Download An efficient data processing framework for mining the massive

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Low-voltage differential signaling wikipedia , lookup

IEEE 1355 wikipedia , lookup

Transcript
Computers, Environment and Urban Systems xxx (2015) xxx–xxx
Contents lists available at ScienceDirect
Computers, Environment and Urban Systems
journal homepage: www.elsevier.com/locate/compenvurbsys
An efficient data processing framework for mining the massive
trajectory of moving objects
Yuanchun Zhou a,1, Yang Zhang a,1, Yong Ge b, Zhenghua Xue a, Yanjie Fu c, Danhuai Guo a, Jing Shao a,
Tiangang Zhu a, Xuezhi Wang a, Jianhui Li a,⇑
a
b
c
Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
Department of Computer Science, University of North Carolina at Charlotte, NC, USA
MSIS Department, Rutgers, The State University of New Jersey, NJ, USA
a r t i c l e
i n f o
Article history:
Available online xxxx
Keywords:
Big data
Trajectory of moving object
Compression contribution model
Parallel linear referencing
Two step consistent hashing
a b s t r a c t
Recently, there has been increasing development of positioning technology, which enables us to collect
large scale trajectory data for moving objects. Efficient processing and analysis of massive trajectory data
has thus become an emerging and challenging task for both researchers and practitioners. Therefore, in this
paper, we propose an efficient data processing framework for mining massive trajectory data. This framework includes three modules: (1) a data distribution module, (2) a data transformation module, and (3) a
high performance I/O module. Specifically, we first design a two-step consistent hashing algorithm, which
takes into account load balancing, data locality, and scalability, for a data distribution module. In the data
transformation module, we present a parallel strategy of a linear referencing algorithm with reduced subtask coupling, easy-implemented parallelization, and low communication cost. Moreover, we propose a
compression-aware I/O module to improve the processing efficiency. Finally, we conduct a comprehensive
performance evaluation on a synthetic dataset (1.114 TB) and a real world taxi GPS dataset (578 GB). The
experimental results demonstrate the advantages of our proposed framework.
Ó 2015 Elsevier Ltd. All rights reserved.
1. Introduction
With the increasing development of sensor networks, global
positioning systems, and wireless communication, people have
collected more and more location traces of various moving objects,
such as human beings and cars. These data have also been transmitted in a timely manner through world-wide information networks, which enable people to monitor and track moving objects
in a real-time fashion (Guido & Waldo, 1983). Considerable
amounts of useful and interesting information is hidden in such
large scale spatio-temporal data (LaValle, Lesser, Shockley,
Hopkins, & Kruschwitz, 2011; Yang et al., 2013), which have
attracted much attention from both researchers and industry.
The movement of many objects is restricted to specific routes.
For instance, the movement of trains is restricted to rail, and most
cars can only move along road networks. Such moving objects produce restricted trajectories, among which various moving patterns
⇑ Corresponding author at: Computer Network Information Center, Chinese
Academy of Sciences (CNIC, CAS), 4,4th South Street, Zhongguancun, P.O. Box 349,
Haidian District, Beijing 100190, China. Tel.: +86 010 5881 2518.
E-mail address: [email protected] (J. Li).
1
These authors contributed equally to this work.
of objects are embedded. For example, there are some speed patterns and traffic jam patterns in the trajectories of vehicles.
There are some periodic patterns and association patterns in the
trajectories of people. A large number of data mining works have
been conducted to discover these patterns in a real-time or offline
fashion (Han, Liu, & Omiecinski, 2012; Li, Ding, Han, & Kays, 2010;
Li, Ji, et al., 2010; Nanni & Pedreschi, 2006; Ossama, Mokhtar, & ElSharkawi, 2011; Pelekis, Kopanakis, Kotsifakos, Frentzos, &
Theodoridis, 2009). However, most data mining tasks with such
large scale restricted trajectories usually face common challenges
such as (1) data storage and processing intensity, (2) time demand,
and (3) coordinate system transformation.
To address these challenges, in this paper, we propose an efficient
data processing framework for mining the massive trajectory data of
restricted moving objects. Our framework consists of three modules: (1) a data distribution module, (2) a data transformation module, and (3) an I/O performance improvement module. Specifically
we first propose a two-stage consistent hashing algorithm for the
data distribution module. The main ideas include: (1) all data are
first distributed to multiple nodes; (2) the data of each node are distributed to multiple disks. Unlike traditional consistent hashing
algorithms, we add a load readjusting step to optimize load balancing. Parallelizing serial algorithms is a popular choice for enhancing
http://dx.doi.org/10.1016/j.compenvurbsys.2015.03.004
0198-9715/Ó 2015 Elsevier Ltd. All rights reserved.
Please cite this article in press as: Zhou, Y., et al. An efficient data processing framework for mining the massive trajectory of moving objects. Computers,
Environment and Urban Systems (2015), http://dx.doi.org/10.1016/j.compenvurbsys.2015.03.004
2
Y. Zhou et al. / Computers, Environment and Urban Systems xxx (2015) xxx–xxx
data processing efficiency. For instance, cloud computing infrastructures (Agrawal, Das, & Abbadi, 2011; Armbrust et al., 2009; Yang,
Wu, Huang, Li, & Li, 2011) partition computation into many subtasks,
and then concurrently run those subtasks on multiple servers.
Hence, we propose a parallel linear referencing strategy and establish a data transformation module based on the MPI parallel programming framework. Our parallelization strategy reduces the
coupled interactions among subtasks, reduces interaction overhead
and thus significantly improves the performance. Moreover,
because frequent system I/Os also jeopardize the performance of
processing big data, we design a compression-aware method to
improve system I/O performance. Our method quantitatively analyzes the I/O performance contribution of data compression,
automatically decides when to compress data, and intelligently balances compression rate and ratio.
In this experiment, we perform detailed measurements and
analysis, especially for the data distribution module and I/O performance improvement module. To measure the effectiveness on load
balancing by the number of virtual nodes, we execute the proposed
two-stage consistent hashing algorithm for 3000, 5000, 7000, and
9000 files. To meet appropriate load balancing and reduce system
overhead, we set the number of virtual nodes as 400. In addition,
we also compare the difference between the inclusion and exclusion of the load readjusting step (see Step 1.5 in Section 4).
Experimental results show that the load readjusting step can
greatly improve the performance of load balancing. To test the
I/O performance improvement module, we take parallel K-means
(Apache, 2014; Jain, Murty, & Flynn, 1999) clustering as the processing task and measure the performance of our proposed framework with 1.114 TB synthetic data and 578 GB taxi GPS data sets
on a 14-server cluster machine. Table 1 presents the results of
three data storage strategies for each iteration: (1) uniform distribution of data on the servers; (2) uniform distribution of data
on high performance storage (Panasas) with a 1 Gb/s network;
and (3) uniform distribution of data on a hadoop distributed file
system (HDFS in short, see Section 7 for configuration details).
We have two observations from Table 1: (1) data locality may
result in higher efficiency than other storage strategies; (2) all
three methods mentioned above perform very poorly in I/O performance because data reading occupies most of the computational
time and thus makes CPUs always idle. Therefore, it is desirable
to introduce an I/O performance improvement module (Xue
et al., 2012) and relieve dramatic I/O latency. We quantitatively
analyze the impact of a variety of factors of compression, such as
compression ratio, compression rate, and compression algorithm.
The performance improvement model can also effectively determine when and how to use compression to improve the
performance.
The remainder of this paper is organized as follows. In Section 2,
we introduce the modules of our proposed framework. In Section 3,
we provide a survey on the work regarding linear referencing, I/O
performance improvement methods, the K-means clustering algorithm, and the clustering of trajectories of moving objects.
Section 4 introduces a two-step consistent hash algorithm to allocate data on a server cluster with multiple disks for improving I/O
performance. In Section 5, we design a parallel linear referencing
strategy. In Section 6, we establish a mathematical model and analyze the I/O performance improvement aspect of data compression.
Section 7 shows the experimental results of two large scale data
sets. Finally we conclude our work in Section 8.
2. Data processing framework
In this section, we describe the data processing framework. The
proposed framework mainly includes three parts: a data distribution module, a data transformation module, and a
Table 1
Comparison of three data storage strategies.
Methods
Reading duration (s)
Iteration duration (s)
Local Disk
Panasas
HDFS
1162
6540
12,426
1516
6854
26,132
compression-aware I/O performance improvement module. We
present the data flow of our proposed framework in Fig. 1
As shown in Fig. 1, arrows stand for data flows between different processes. First, we distribute raw data to different disks of
multiple computing nodes by the two-step consistent hashing
algorithm (refer to Section 4). Then, a linear referencing transformation module converts these data trunks into projected data
in parallel (see Section 5). Later, an I/O performance improvement
module uses compression-aware strategy to perform proper compression. Finally, we conduct a K-means clustering algorithm in
parallel.
Data allocation plays a role in processing and analyzing large
scale trajectory data. To this end, we propose a two-step consistent
hashing strategy based on an existing consistent hashing algorithm. This strategy considers factors related to compression, such
as data locality, load balancing, parallel, and remaining monotonic.
More details are included in Section 4.
Coordinate system transformation is a critical pretreatment
process in mining the trajectory of moving objects. Linear referencing is a classic mechanism of coordinate system transformation, we
therefore design a parallel strategy for linear referencing in a data
transformation module. The proposed parallel algorithm has low
coupling between subtasks, is easy to implement, and leads to
low communication cost. We detail this component in Section 5.
I/O is the main bottleneck of all big data processing tasks. Using
a compression mechanism can effectively reduce the I/O cost of
data processing. In this paper, we exploit the K-means clustering
algorithm to cluster the trajectories of moving objects. Because
the K-means algorithm reads all the data in each iteration, and
often requires multiple iterations before converging, we quantitatively analyze the impact of compression ratio, compression rate,
compression algorithm, and other factors related to compression
on the processing performance. We also quantitatively analyze
when and how to use compression and improve the processing
performance. We will introduce more details in Section 6.
Unlike existing works, the key modules of our data processing
framework are specifically optimized for performance enhancement at all levels. Additionally, the modules of the framework
are loosely coupled and can be independently applied to other
big data analysis scenarios. In addition, this framework has no
dependency on the data to be processed and can be applied to
other similar big data processing applications.
3. Related work
3.1. K-means algorithm
Extensive research on K-means clustering has been conducted in
data mining research (Agarwal & Mustafa, 2004; Aggarwal, Wolf, Yu,
Procopiuc, & Park, 1999; Aggarwal & Yu, 2002; Tung, Xu, & Ooi,
2005). However, when clustering large scale data larger than
Terabytes in size, a serial K-means algorithm often fails. In contrast,
parallel schemes show their advantages. For big data clustering, the
research community has published several parallel K-means algorithms. Dhillon and Modha (2000) proposed a parallel implementation of the K-means algorithm based on message passing
model and analyzed the algorithm’s scalability and speedup. Li
Please cite this article in press as: Zhou, Y., et al. An efficient data processing framework for mining the massive trajectory of moving objects. Computers,
Environment and Urban Systems (2015), http://dx.doi.org/10.1016/j.compenvurbsys.2015.03.004
Y. Zhou et al. / Computers, Environment and Urban Systems xxx (2015) xxx–xxx
Fig. 1. Data processing framework.
and Chung (2007) proposed a bisecting parallel K-means algorithm,
which balanced the load among multiprocessors with a prediction
measure. Based on the most popular parallel programming framework, MapReduce, (Lee, Lee, Choi, Chung, & Moon, 2012; Zhao, Ma,
& He, 2009) proposed a parallel K-means algorithm and assessed
the performance by speed-up, scale-up, and size-up. In this paper,
we exploit the parallel K-means algorithm proposed by Dhillon,
which has high scalability and efficiency (Dhillon & Modha, 2001).
3.2. Linear referencing algorithm
Linear Referencing (LR) (Noronha & Church, 2002) is the specification of a location by means of a distance measurement along
a sequence of road sections from a known reference point. Linear
referencing is mainly used to manage data related to linear features such as railways, rivers, roads, oil and gas transmission
pipelines, and power transmission lines. It is known that linear
referencing is supported by multiple famous Geographic
Information System software, such as ArcGIS (ESRI, 2014), GRASS
GIS (Blazek, 2004), PostGIS (PostGIS, 2014), and GE Global
Transmission Office (Energy, 2014). Because there is almost no
suitable MPI based parallel linear referencing algorithm, we propose an innovative parallel implementation of Linear Referencing
based on the MPI parallel programming model.
3.3. Clustering trajectory of moving objects
Due to the challenge of fast clustering the large scale of moving
objects, many successful and scalable methods on the clustering of
moving objects have been proposed. Li, Han, and Yang (2004)
proposed the concept a of moving micro-cluster to catch some
regularities of moving objects and handle large datasets. This
algorithm can maintain high quality moving micro-clusters and
leads to fast competitive clustering results at any given time.
Ossama et al. (2011) proposed a pattern-based clustering
algorithm which adapts the K-means algorithm for trajectory data
and overcomes the known drawbacks of the K-means algorithm. Li,
Ding, et al. (2010) proposed the concepts of swarm and closed
swarm, which enable the discovery of interesting moving object
clusters with relaxed temporal constraints. The effectiveness and
efficiency are respectively tested using real data and synthetic
data. Han et al. (2012) proposed a road network aware approach,
NEAT, for fast and effective clustering of spatial trajectories of
3
moving objects. Experimental results show that the NEAT
approach runs orders of magnitude faster than existing densitybased trajectory clustering approaches. Nanni and Pedreschi
(2006) proposed a density-based clustering method for moving
object trajectories to discover interesting time intervals, where
(when) the quality of the achieved clustering is optimal. Jensen,
Lin, and Ooi (2007) proposed a fast and effective scheme for the
continuous clustering of moving objects and used the dissimilarity
notion to improve clustering quality and runtime performance.
Then, they proposed a dynamic summary data structure and used
an average-radius function to detect cluster split events. Pelekis
et al. (2009) studied the uncertainty in the trajectory database
and devised the CenTR-I-FCM algorithm for clustering trajectories
under uncertainty. Li, Ji, et al. (2010) designed a system,
MoveMine, for sophisticated moving object data mining.
MoveMine provides a user-friendly interface and flexible tuning
of the underlying methods. It benefits researchers targeting future
studies in moving object data mining. Each of the existing clustering methods has its own advantages and disadvantages. For the
purpose of generality and parallel performance, we implement
the MPI based parallel K-means clustering method in our experiment. Genolini and Falissard (Christophe & Bruno, 2010) proposed
a new implementation of K-means, named KmL, which is specifically designed for longitudinal data. It provides scope for dealing
with missing values and runs the algorithm several times, varying
the starting conditions and/or the number of clusters sought; its
graphical interface helps the user choose the appropriate number
of clusters when the classic criterion is not efficient. KmL gives
much better results on non-polynomial trajectories.
3.4. Compression based I/O optimization
More and more researchers are using data compression to
improve I/O performance. Chen, Ganapathi, and Katz (2010) developed a decision-making algorithm that helps MapReduce users
identify when and where to use compression and improve energy
efficiency. However, Chen’s algorithm only considered the compression ratio and the frequency of reading. Abadi, Madden, and
Ferreira (2006) extended C-Store (a column-oriented DBMS) with
a compression sub-system and evaluated a set of compression
schemes. Zukowski, Heman, Nes, and Boncz (2006) compared a
set of super-scalar compression algorithms proposed with compression techniques used in commercial databases and showed
that they significantly alleviate the I/O bottleneck. Lee, Winslett,
Ma, and Yu (2002) proposed three methods for parallel compression of scientific data to reduce the data migration cost and analyzed eight scientific data sets. Welton et al. (2011) harnessed
idle CPU resources to compress network data, to reduce the
amount of data transferred over the network and increase effective
network bandwidth. Different from the above methods, we integrate the compression mechanism into an MPI based clustering
computing and quantitatively analyze the impact of multiple factors related to compression on the framework performance.
4. Data distribution algorithm
Data distribution is significant in big data processing. To properly allocate large scale trajectory data, we argue that the following
design goals (Karger et al., 1997, 1999) are desirable:
Locality: Because networks are precious resource and usually a
bottleneck especially for large scale data, all data blocks of a
data file should be allocated on a single server as much as possible. It will effectively reduce network traffic when reading a
big data file.
Please cite this article in press as: Zhou, Y., et al. An efficient data processing framework for mining the massive trajectory of moving objects. Computers,
Environment and Urban Systems (2015), http://dx.doi.org/10.1016/j.compenvurbsys.2015.03.004
4
Y. Zhou et al. / Computers, Environment and Urban Systems xxx (2015) xxx–xxx
Load Balancing: Data files should be uniformly distributed
among all servers for load balancing. Otherwise, processing
duration will be delayed by the node associated with the largest
data.
Parallel: Other than uniformly distributing all files among all
servers for simultaneous data processing, it is an effective
way to stripe a big data file among all disks of a server for reading it in parallel. It makes use of aggregated disk I/O when reading a big data file.
Monotonic: When some servers fail or more servers are added,
system rebalancing should incur the least data moving.
Along this line, we propose a two-step consistent hashing algorithm for big data distribution (see Fig. 2). CHA (Consistent Hashing
Algorithm) (Karger et al., 1997, 1999) is widely used in parallel and
distributed systems. The underlying idea is to hash both data
objects and storage servers using the same hash function. It maps
storage servers to an interval which contains multiple data object
hashes. CHA enables load balancing by assigning data objects on
different servers. However, it’s possible to have a non-uniform distribution of data objects among servers if servers are not enough. It
introduces ‘‘virtual nodes’’ to virtually expand servers. By introducing virtual nodes, it achieves more balanced data distribution and
meets our goals on load balancing. If a server is removed, its interval is taken over by a server with an adjacent interval; if a new server joins, data objects from their adjacent server will move into it.
It guarantees the least data moving when server scale changes and
meets our design goal on remaining monotonic. The two-step consistent hash algorithm (using the well-known FNV-1 hash function)
is described as follows:
Step 1: Assigning data files among servers according to a specified hash function.
Step 1.1: Creating virtual nodes: create multiple virtual nodes
names for each server by simply adding an index to server’s
name.
Step 1.2: Mapping virtual nodes: Take a virtual nodes’ name as
the input of the hash function and map all virtual nodes into
hash space which is a 1232-1 range circle.
Step 1.3: Mapping data files: Take the names of data files as the
input of the same hash function and map all data files into the
same hash space circle.
Step 1.4: Determining data files into virtual nodes: Start from
where a data file is located and head clockwise on the ring until
finding a virtual node. The data file is assigned to this virtual
node.
Step 1.5: Adjusting load: Sort virtual nodes by data amount.
Move data files from the heavy nodes to the lightest node to
make the lightest approach the average load. Repeat the step
in the remaining nodes until the last node.
Step 2: Stripping data files of each server into its multi-disks.
Step 2.1: Creating virtual nodes: create multiple virtual node
names for each disk by simply adding an index to disk’s name.
Step 2.2: Mapping virtual nodes: Take a virtual nodes’ name as
the input of hash function and map all virtual nodes into hash
space which is a 1232-1 range circle.
Step 2.3: Mapping data blocks: Split data files into a number of
fixed size blocks. Take the names of data blocks as the input of
the same hash function, then map all data blocks into the same
hash space circle.
Step 2.4: Determining data blocks into virtual nodes: Start from
where a data block is and head clockwise on the ring until finding
a virtual node. The data block is assigned to this virtual node.
Here, Step 1 assigns data files to servers without splitting files.
This enables reading a file without any network traffic, and it satisfies our design goal of locality. Step 1.5 is an additional step for a
CHA. A CHA can guarantee that each server holds approximately
the same number of files, but that’s not enough. Because data file
sizes vary, each server actually has a different amount of data,
especially in the case that file size varies to a great extent. Step
1.5 adjusts the data amount to make each server approach the
average. Step 2 stripes a file into data blocks and then uniformly
distributes blocks into multiple disks of a server according to a
CHA. Each file is simultaneously read from multi-disks of a server,
which provides parallelism within the scope of a server.
Unlike conventional data distribution mechanisms, such as the
Lustre file system, the proposed algorithm has the following
advantages: (1) there is no need to store metadata, which avoids
the single point of failure and reduces the system overhead; (2)
this method assigns data files to servers without splitting files in
Step 1, which greatly reduces network overhead when reading or
writing data; (3) it makes full use of multi-disk aggregated I/O
bandwidth to improve the performance; (4) step 1.5 can make
appropriate readjustments for the hashed files among multiple servers, which achieves the approximate balance among servers; (5)
the proposed algorithm is suitable for both large files and small
files. In summary, the proposed two-step consistent hashing algorithm is superior to other traditional algorithms.
5. Parallel linear referencing strategy
To improve the projection performance, we parallelize the linear referencing algorithm based on the MPI parallel programming
framework. The parallel scheme is shown in Fig. 3.
From Fig. 3, we can see that parallel strategy has low coupling
among subtasks, which contributes to low communication costs
among subtasks and easy implementation. The three circled digits
in this figure indicate a specified order: (1) digit 1 denotes reading
the road network data from local disks; (2) digit 2 denotes building
the R-Tree index system using the road network data; and (3) digit
3 denotes reading trajectory records then performing a linear
referencing operation.
6. Compression-aware I/O improvement module
Fig. 2. Data distribution strategy.
In the compression mechanism, an application does not directly
read the raw data on a local disk. Instead, it first uses a
Please cite this article in press as: Zhou, Y., et al. An efficient data processing framework for mining the massive trajectory of moving objects. Computers,
Environment and Urban Systems (2015), http://dx.doi.org/10.1016/j.compenvurbsys.2015.03.004
Y. Zhou et al. / Computers, Environment and Urban Systems xxx (2015) xxx–xxx
5
compression algorithm with a high compression ratio and high
compression and decompression speeds to read the compressed
data into memory and then decompresses and calculates.
Considering a K-means algorithm often iterates many times for
the same data set before convergence, the larger the number of
iterations is, the greater a performance improvement that compression will yield. Although using compression reduces I/O overhead, it introduces CPU overhead for compression and
decompression. The improvement of clustering performance by
using compression technology is related to multiple factors, such
as data size, compression and decompression efficiency, and number of iterations. It’s desirable to design a mathematical model to
quantitatively analyze the trade-off. Table 2 lists the used parameters. The parameters are determined by the properties of the dataset and hardware configuration (see Section 7).
Fig. 3. Parallel linear referencing strategy.
6.1. Task decomposition
In the process of K-means computation, we assign two types of
processes: compression processes and computation processes.
Compression processes entail reading the raw data from local
disks, then compressing the raw data and writing compressed data
back to disks. Computation processes involve reading the compressed data from local disks, and decompressing and calculating
it. All processing of the data is performed in a block by block manner, as shown in Fig. 4:
Under parallel conditions, we assume that each compression
process takes Tcomp seconds to address a raw data block with the
size of K. Tcomp consists of three parts:
(1) Tcomp_read: duration of reading a data block with size K.
(2) Tcomp_comp: duration of compressing the data block from the
size of K to the size of K/l.
(3) Tcomp_write: duration of writing compressed data with the size
of K/l back to disks.
Therefore, we can conclude Tcomp = Tcomp_read + Tcomp_comp +
Tcomp_write
Under parallel conditions, each computation process takes Tcalc
to address a compressed data block with the size of K/l. Tcalc consists of three parts:
(1) Tcalc_read: duration of reading a compressed data block with
the size of K/l.
(2) Tcalc_uncom: duration of decompressing the compressed data
block from the size of K/l to the size of K.
(3) Tcalc_calc: duration of calculating the uncompressed data with
the size of K.
Therefore, it concludes Tcalc = Tcalc_read + Tcalc_uncom + Tcalc_calc.
In general, the K-means algorithm needs to iterate many times
before convergence. Therefore, only in the first iteration is the data
compression part necessary. In the subsequent iterations, it just
uses the compressed data generated in the first iteration. We will
analyze the first iteration and subsequent iterations.
6.2. Analysis for the first iteration
To facilitate our analysis, we draw the timing diagram for compression processes and computation processes according to three
different situations as shown in Fig. 5:
In Fig. 5, Comp, Calc and Idle denote compression, computation
and idle times, respectively. The green, blue and red boxes indicate
the time taken by each compression process to address a raw data
block, the time taken by each computation process to address a
Table 2
Data and environmental parameters.
Parameters
Definitions
d
A d-dimensional dataset
Dimensionality of dataset dS, 10 default
Number of samples for dS, 15 billion default
Ratio of the size of uncompressed data to the size of
compressed data
Size of a raw data block, 0.1123 default
Size of a compressed data block
Size of dataset dS, 1114.875 default
Number of total processes, 336 default
Number of compression processes
Disk reading rate in GB/s, 0.08969 default
S
D
g
l
K (GB)
K/l (GB)
D (GB)
N
M
avai_band
compressed data block and the idle time of compression/computation processes, respectively.
We assume that there is a constant C, such that Tcomp = C ⁄ Tcalc.
If and only if this constant satisfies C = M/(N M), namely
Tcomp/Tcalc = M/(N M), N M computational processes need
M compression processes to supply the compressed data. At this
point, the processing rates of the compression process and the
computation process are consistent. Only in this situation, the
compression process resources and computation process resources
can be fully utilized.
When C > M/(N M), as shown in Fig. 5(a), the computation
process is faster than the compression process. Computation waits
for compression, and the value of M is smaller than that in the optimal situation.
When C < M/(N M), as shown in Fig. 5(b), the compression
process is faster than the computation process. Compression waits
for calculation, and M is larger than in the optimal situation.
When compression processes and calculation processes coexist,
we design a synchronization mechanism between compression
processes and calculation processes. When compression processes
write compressed data onto disk, the file is named as ‘‘file.lzo.swp’’
until all compressed data writing is finished. Then, compression
processes change the file name from ‘‘file.lzo.swp’’ to ‘‘file.lzo’’. The
calculation processes continuously check whether the file ‘‘file.lzo’’
exists. If file ‘‘file.lzo’’ does not exist, calculation processes continue
to check. When file ‘‘file.lzo’’ exists, the calculation processes read
‘‘file.lzo’’ from disk to memory and then perform calculation.
When first C = N:0 (M = N), as shown in Fig. 5(c), there are no
computation processes and all of the N processes work as compression processes to address all the raw data. Then, C = 0:N (M = 0),
there are no compression processes and all of the N processes work
as computation processes.
Please cite this article in press as: Zhou, Y., et al. An efficient data processing framework for mining the massive trajectory of moving objects. Computers,
Environment and Urban Systems (2015), http://dx.doi.org/10.1016/j.compenvurbsys.2015.03.004
6
Y. Zhou et al. / Computers, Environment and Urban Systems xxx (2015) xxx–xxx
Fig. 4. Data flow chart.
Fig. 5. Diagram for three situations.
Based on the above analysis, we can obtain the formula of time
T1(M) for the first iteration.
8 D C
f ðD%ðK MÞÞ T calc 1 6 M 6 Cþ1
N
>
> K M lT comp þ m
<
D
C
T 1 ðMÞ ¼ T comp þ K ðNMÞ T calc
N 6M 6N1
Cþ1
>
>
: D M¼N
ðT comp þ T calc Þ
K M
ð1Þ
where the function f(x) is:
(l
f ðxÞ ¼
x
ðNMÞ K
1
m
x–0
ð2Þ
x¼0
T1(M) corresponds to three different situations in Fig. 5(a)–(c). Tcomp
and Tcalc are related to the value of M, and they follow a non-linear
relationship. We perform a polynomial fitting Tcomp and Tcalc versus
M. To avoid overfitting, the order of the fitted function is set to
three. Therefore, we can draw the curve of function T1(M) as shown
in Fig. 6:
In Fig. 6, the parameter D is set as 1141.875; Tcomp and Tcalc are
substituted by the fitted functions Tcomp(M) and Tcalc(M); the
parameter C is set as Tcomp/Tcalc; the number of processes N is set
as 336. The figure illustrates that:
(1) When a computation process is faster than compression, the
trend of the red curve is relatively flat. The number of compression processes M has little effect on the time T1(M).
(2) When compression is faster than the computation process,
the green curve shows a slowly increasing trend with the
increasing number of compression processes. When M is
close to 280, the curve shows a sharp increasing trend. The
number of compression processes M has a positive effect
on the time T1(M).
(3) When the compression is conducted before the computation, T1(M) will be a constant and is a little larger than the
other two cases.
Based on this analysis, we conclude that the appropriate choice
of M can effectively improve the performance of parallel K-means
application.
6.3. Analysis for all iterations
6.3.1. Analysis with compression
In the second and subsequent iterations, execution processes
are similar, and it is unnecessary to compress raw data. All N processes work as the computation processes.
Please cite this article in press as: Zhou, Y., et al. An efficient data processing framework for mining the massive trajectory of moving objects. Computers,
Environment and Urban Systems (2015), http://dx.doi.org/10.1016/j.compenvurbsys.2015.03.004
Y. Zhou et al. / Computers, Environment and Urban Systems xxx (2015) xxx–xxx
7
Therefore, it concludes T calc = Tcalc_read + Tcalc_calc.
Based on the above analysis for various iterations, the total time
Ttotal(S), without compression, taken by the program throughout
the whole execution process can be formulated as:
T total ðSÞ ¼
S X
i¼1
D
K N
T calc
ð6Þ
6.4. Compression contribution model
To determine the circumstances under which the compression
can improve the performance of the application, we perform subtraction between Ttotal(S) and Ttotal_com(S) as below:
T total ðSÞ T total
Fig. 6. Time for the first iteration versus the number of compression processes.
Under parallel condition, each computation process takes Tcalc
seconds to address a compressed data block with the size of K/l.
Tcalc consists of three parts:
(1) Tcalc_read: duration of reading a compressed data block with
the size of K/l.
(2) Tcalc_uncom: duration of decompressing the compressed data
block from K/l to K.
(3) Tcalc_calc: duration of calculating the uncompressed data
block with the size of K.
2
3
D=l 7
Ti ¼ 6
7 T calc
6K 6 l N7
ð3Þ
where S indicates the total number of iterations of parallel K-means
application.
Based on the above analysis, the total time Ttotal_com(S) taken by
the application throughout the whole execution process can be formulated as:
T total
2
3
S
X
D=
l
6 7 T calc
com ðSÞ ¼ T 1 þ
6 K
7
N7
i¼2 6 l
ð4Þ
Under a certain number of compression processes, T 1 indicates the
minimum time taken by the first iteration and can be expressed as:
T 1 ¼ max16M6N ðT 1 ðMÞÞ
ð5Þ
6.3.2. Analysis without compression
When compression is not adopted, each of the iterations of
application has a similar execution process, and all N processes
work as computation processes.
Under parallel conditions, each computation process takes T calc
seconds to address a raw data block with size K. To avoid ambiguity between formulas with compression and without compression,
we use T calc here instead of Tcalc.
ð7Þ
If the compression mechanism is able to improve the performance of the program, the difference between Ttotal(S) and
Ttotal_com(S) must satisfy:
T total ðSÞ T total
com ðSÞ
>0
ð8Þ
Therefore, the condition that the compression mechanism can
improve the performance is that the number of iterations S must
satisfies the following formula.
SP
T 1 D
K N
Therefore, we conclude Tcalc = Tcalc_read + Tcalc_uncom + Tcalc_calc and
the time taken by the ith ð2 6 i 6 SÞ iteration is:
D
ðT
ðSÞ
¼
S
T
Þ
com
calc
calc
K N
D
T calc T 1
þ
K N
D
KN
T calc
ðT calc T calc Þ
ð9Þ
To facilitate the comparison, we draw the curve of Ttotal(S), as well
as the curve of Ttotal_com versus the number of iterations, S:
Fig. 7 shows that the application with compression takes more
time for the first iteration. With an increasing number of iterations,
the application with compression starts to takes less time than that
without compression, and the gap between two curves will be larger. We can conclude that the larger the number of iterations, the
higher the performance improvement the compression will yield.
Especially for those applications with high real-time requirements,
using compression can effectively improve the performance when
dealing with large-scale data and many rounds of iterations.
7. Experiments and analysis
In this section, we provide an empirical evaluation of the proposed data processing framework on real-world data. Specifically,
we first establish an MPI cluster and conduct experiments to evaluate the three modules of the framework. The MPI cluster consists of
14 computing nodes that are interconnected with an Infiniband
network. Additionally, to evaluate the I/O performance improvement module, we establish a Hadoop cluster for experimental
comparison. The Hadoop cluster consists of 5 computing nodes
interconnected by a Gigabit Ethernet network. Each node in the
Hadoop cluster holds 24 computing cores, 32 GB memory size,
and 12x1 TB disks. The CPU is Intel(R) Xeon(R) L5640 @2.27 GHz.
The replication number of the Hadoop cluster is set as 3.
7.1. Experiments on data distribution module
T calc consists of two parts:
(1) Tcalc_read: duration of reading a raw data block with the size
of K.
(2) Tcalc_calc: duration of calculating the raw data block with the
size of K.
Here, we report the performance of our proposed data distribution strategy. We first test the file reading speed. While the
speed is approximately 130 MB/s if the file is read from a single
disk, our module reads a file from 12 disks of a server in parallel
Please cite this article in press as: Zhou, Y., et al. An efficient data processing framework for mining the massive trajectory of moving objects. Computers,
Environment and Urban Systems (2015), http://dx.doi.org/10.1016/j.compenvurbsys.2015.03.004
8
Y. Zhou et al. / Computers, Environment and Urban Systems xxx (2015) xxx–xxx
Fig. 7. Time with and without compression.
and achieves a higher speed of 860 MB/s. This result validates that
Step 2 effectively improves the I/O performance.
We then measure the effect of the number of virtual nodes.
Specifically, we set different numbers of virtual nodes for each
physical node and apply the consistent hashing algorithm to different file sets with four sizes (e.g., 3000, 5000, 7000, and 9000 files).
Fig. 8 shows the load balance over different numbers of virtual
nodes. The x-axis of Fig. 8 indicates the number of virtual nodes.
The y-axis of Fig. 8 indicates (in log scale) how far the data amount
on servers is spread out from the average. Fig. 8 shows that as the
number of virtual nodes increases, the variance decreases gradually until the number of virtual nodes reaches 400. And then the
variance fluctuates if the number of virtual nodes exceeds 400.
We tune the parameter to achieve load balancing and reduce system overhead, and set the number of virtual nodes to 400 in the
following experiments.
Fig. 9 shows the comparison of performances with and without
Step 1.5, where the variance shows how far data amounts on servers is spread out from the average. The comparison validates that
Step 1.5 helps improve the load balancing as the file number
grows.
In sum, the two-step consistent hashing algorithm improves the
data distribution task from all four aspects: locality, load balancing,
parallel, and monotonic.
7.2. Experiments on I/O improvement module
In our experiments, we used an MPI cluster with 336 cores to
run a parallel program for the K-means clustering algorithm that
is implemented by a message-passing interface programming
model. Compared to the MPI based implementation of K-means
algorithm, we also performed the parallel implementation based
on MapReduce parallel programming framework.
Table 3 lists the initial value of environmental parameters.
These parameters are determined by the properties of the dataset
and the hardware configuration. LZO, LZMA, Gzip and Bzip in
Table 3 indicate four different compression algorithms, respectively, used by the following measurements and analysis. Initial
values of parameters are measured on our computing clusters
and may not be the same as those of others.
7.2.1. Analysis for the first iteration
7.2.1.1. Analysis of the first compression then computation (Use LZO
only). Fig. 5(c) shows that there is no coexistence for compression
Fig. 8. Comparison of load balancing using various numbers of virtual nodes.
Fig. 9. Comparison of load balancing using CHA with and without Step 1.5.
processes and computation processes and no direct interaction
between them. Therefore, the time Tcomp taken by compression
processes to address a raw data block and the time Tcalc taken by
computation processes to address a compressed data block can
be viewed as fixed values. Then, the total execution time of the
application can be formulated as:
T total
D
com ðSÞ ¼
K N
2
3
S
X
D=
l
6 7 T calc
T comp þ
6 K
7
N7
i¼1 6 l
ð10Þ
To facilitate the analysis, we draw three curves:
(1) Total execution time without compression predicted by
Formula 6, Ttotal(S).
(2) Total execution time with compression predicted by
Formula 10, Ttotal_com(S).
(3) Total execution time with compression by real measured
data.
Fig. 10 shows that the application with compression takes a
longer time for the first and the second iterations. As the number
of iterations increases, the application with compression starts to
take less time than that without compression, and the growing
Please cite this article in press as: Zhou, Y., et al. An efficient data processing framework for mining the massive trajectory of moving objects. Computers,
Environment and Urban Systems (2015), http://dx.doi.org/10.1016/j.compenvurbsys.2015.03.004
9
Y. Zhou et al. / Computers, Environment and Urban Systems xxx (2015) xxx–xxx
Table 3
Initial parameters (where avai_band indicates the average disk bandwidth, uncom_speed indicates the speed of decompression, and l indicates the ratio of the size of
uncompressed data to the size of compressed data).
Parameters
Default initial value
K
D
N
M
avai_band
uncom_speed for
uncom_speed for
uncom_speed for
uncom_speed for
l for the LZO
l for the LZMA
l for the Gzip
l for the Bzip
0.1123 GB
1141.875 GB
336
14336
0.08969 GB/s
0.108 GB/s
0.0064 GB/s
0.0053 GB/s
0.0038 GB/s
2.313
5.004
4.107
5.646
LZO
LZMA
Gzip
Bzip
gap between the blue, green and red curves shows that the performance improvement by compression will be more obvious as the
number of iterations increases.
Fig. 10. Time with and without compression.
7.2.1.2. Analysis of comparative speeds between compression and
computation (Use LZO only). Fig. 5(a) and (b) show that there are
concurrent compression processes and computation processes. To
facilitate comparative analysis of the first iteration, we draw curves
when the compression is faster than the computation, when the
computation is faster than the compression, and the corresponding
real measured data versus the number of compression processes.
Fig. 11 shows that, when the number of compression processes
is less than 280, the pink curve is more consistent with the red
curve than the green one. When the number of compression processes is greater than 280, the pink curve is more consistent with
the green curve than the red one. After processing the measured
data, we found that when the number of compression processes
is less than 280, the measured data for Tcomp and Tcalc satisfy the following inequality, which means that the computation process is
faster than compression.
T comp
M
>
NM
T calc
ð11Þ
When the number of compression processes is greater than 280,
the measured data for Tcomp and Tcalc satisfy the following inequality, which means that compression is faster than the computation.
T comp
M
<
NM
T calc
ð12Þ
Based on the above discussions, we can conclude that the T1(M)
function is consistent with the measured data and provides an
accurate description of the relationship between the execution
time of the first iteration and the number of compression processes. T1(M) can also provide a strong criterion to set an appropriate number of compression processes to minimize execution time
of the first iteration.
7.2.2. Analysis for all iterations
7.2.2.1. Analysis of the number of iterations (Use LZO only). We draw
predicted curves in Fig. 12.
As Fig. 12 shows, curves without compression are higher than
those with compression for the first iteration. For the second and
subsequent iterations, the application does not need to compress
data. Then, curves with compression are flattened and are lower
than those without compression. As the number of iterations
increases, the gap between curves with compression and those
without compression becomes larger. The performance improvement will be more significant as the number of iterations increases.
Fig. 11. Time for the first iteration.
7.2.2.2. Analysis of the compression ratio. Empirically, the higher the
compression ratio, the lower the compression speed is. Assuming
that the compression ratio l and the decompression speed
uncom_speed satisfy the formula: l = comp_coeff/uncom_speed,
where comp_coeff is a coefficient related to the specific compression algorithm. The formula of the total execution time for the
application can be transformed to:
T total
com ðSÞ
¼ T 1 þ
S X
K
l
i¼2
D
K N
1
K
þ
þ T calc
av ai band comp coeff
calc
ð13Þ
When the application iterates multiple times, to facilitate the
analysis, the performance improvement of the first iteration by
compression is negligible and T 1 can be viewed as a fixed value.
Based on the above assumptions and the formula, we can see that
compression only has influence on Tcalc_read and Tcalc_uncom. To intuitively analyze the influence on the total execution time of the compression ratio, we draw the contour figure for the sum of Tcalc_read
and Tcalc_uncom versus the compression ratio and compression coefficient comp_coeff as shown in Fig. 13. The value of the curves
Please cite this article in press as: Zhou, Y., et al. An efficient data processing framework for mining the massive trajectory of moving objects. Computers,
Environment and Urban Systems (2015), http://dx.doi.org/10.1016/j.compenvurbsys.2015.03.004
10
Y. Zhou et al. / Computers, Environment and Urban Systems xxx (2015) xxx–xxx
Fig. 12. Time versus the number of iterations.
Fig. 13. Comparison of the duration taken by each compression algorithm.
corresponds to the value of the sum of Tcalc_read and Tcalc_uncom, the
x-axis indicates the compression ratio defined in Table 2, and the
y-axis indicates the compression coefficient.
Fig. 13 illustrates that as the compression ratio increases, the
time taken to read uncompressed data and to decompress data
decreases steadily. In addition, we also performed a comparative
analysis on the effects of the four common compression algorithms
on the time Tcalc_read + Tcalc_uncom. The curves in red asterisks, in
green inverted triangles, in purple boxes, and in blue circles indicate the time taken by the LZO, LZMA, Gzip, and Bzip algorithms,
respectively. Among these compression algorithms, the LZMA algorithm outperforms the others.
7.2.2.3. Analysis of the number of processes (Use LZO only). In Fig. 14,
we can see that the total execution time shows a steady decrease
as the number of processes increases. Especially, when the number
of the processes is small, the performance improvement is significant. As the number of processes exceeds a certain number, the
performance improvement is no longer very clear. In addition,
compression greatly reduces the execution time taken by the
application. When the number of processes is small, compression
does not improve the performance of the program. As the number
of processes increases, the gap between curves with the compression and those without the compression becomes larger and larger.
7.2.2.4. Analysis of the data size (Use LZO only). Fig. 15 shows that
the two predicted time curves are approximately proportional
to the data size. As the amount of data increases, the gap between
the curves with compression and those without compression
becomes larger and larger. When the amount of data approaches
to 1000 GB, the time taken by MPI application with compression
is approximately only half of that without compression.
However, the measured time taken by MapReduce application
with compression is not reduced significantly.
In this experiment, the MPI parallel programming and
MapReduce parallel programming models both integrate the LZO
compression algorithm to improve performance. However, the difference of performance between MPI and MapReduce is very
obvious. We think there are two possible reasons: (1) the parallel
K-means algorithm needs to perform a reducing operation for all
computing subtasks. MPI performs the reducing operation by
inter-process communication, however MapReduce performs the
reducing operation by writing and reading data onto disks. For a
data interexchange among computing subtasks with a fixed
Fig. 14. Time versus the number of total processes.
amount of data, the cost of disk I/O is much higher than the cost
of network I/O. Therefore, an MPI parallel programming model is
more suitable for parallel K-means implementation. (2)
Compared to MPI, MapReduce has additional costs of high reliability, data replication, and a fault-tolerance mechanism. For this reason, MPI parallel programming models can provide more efficient
performance for the iterative K-means algorithm.
When the amount of data is less than 400 GB, without compression, the measured curve by MPI application shows different performance when compared to the predicted result, due to
memory. When the memory can hold all the data, the second
and subsequent iterations no longer need the disk I/O operations.
Therefore, the measured curve by MPI is significantly lower than
the predicted one. As the amount of data increases, until the memory is not able to hold all the data, the memory will not contribute
to the performance improvement between two neighboring
iterations.
Therefore, an effective use of memory can greatly improve the
performance of the application. Actually, we can adopt a memory
trick in the parallel K-means algorithm to improve its performance.
Suppose that the order of data blocks accessed in the ith iteration
is: 1, 2, . . . , P, then the order of data blocks accessed in the (i + 1)th
iteration can be set to: P, . . . , 2, 1. In this way, the (i + 1)th iteration
Please cite this article in press as: Zhou, Y., et al. An efficient data processing framework for mining the massive trajectory of moving objects. Computers,
Environment and Urban Systems (2015), http://dx.doi.org/10.1016/j.compenvurbsys.2015.03.004
11
Y. Zhou et al. / Computers, Environment and Urban Systems xxx (2015) xxx–xxx
can use the left data of the ith iteration in memory and thereby
reduce the I/O overhead to enhance the overall performance of
the program. Experimental results show that this trick can achieve
a 30% reduction in disk I/O overhead.
whether exists a passenger on taxi, taxi speed (km/h), taxi driving
direction, and driving time (hours).
To improve the clustering performance, we adopt the compression-aware I/O strategy and integrate the LZO compression algorithm into an MPI-based parallel K-means algorithm. We then
compare the experimental results between the taxi GPS data
(see Table 4) and the synthetic data (see Table 5). Here, Calc,
Redu, I/O, and Total indicate the calculation cost, I/O cost communication cost, and the total cost in each iteration, respectively.
Table 4 shows that the use of LZO compression increases the
calculation cost by 47.88%, but LZO totally contributes to 58.16%
performance improvement. Because the size of the compressed
taxi GPS data is less than the memory size of computing nodes,
the I/O cost is reduced by 99.99%. Due to the data cached in memory at each iteration, the I/O performance improvement is negligible. To avoid the impact of cache, we use synthetic data, in which
the compressed data size is greater than the memory size. We then
perform the clustering task and the results are listed in Table 5.
Table 5 shows that the use of LZO compression increases the
calculation cost by 21.29%, but LZO brings a 71.83% reduction to
I/O cost. At each iteration of parallel K-means clustering of trajectory of moving objects, the use of LZO compression algorithm still
contributes a 57.86% cost reduction. The compression-aware I/O
performance improvement module improves the I/O performance
and the overall performance of framework significantly.
7.3. Experiments on clustering of trajectory
8. Conclusions
In this section, we performed K-means clustering on trajectory
data. The trajectory data used in this experiment was collected
from taxi vehicles in a city, which consists of 23,876 taxis. For each
taxi, the GPS system samples a location record every 30 s. Each
record includes the following attributes: license plate number,
the current time, whether there is a taxi passenger, current taxi
speed (km/h), taxi driving orientation, and taxi location by longitude and latitude. The trajectory data of all the taxis of each day
are stored as a file, which consists of approximately 50 million
records. The trajectory data of the whole year consists of approximately 18 billion records, which occupies approximately 578 GB
disk space.
For the longitude and latitude of each record, we perform a linear referencing projection operation after which the original longitude and latitude coordinates are transformed into projected
linear coordinates. Each linear coordinate includes: projected longitude, projected latitude, and the distance between a milepost and
the projected point. In the clustering stage, the data to be clustered
includes 8 attributes: road ID, longitude and latitude of the projection point, distance between the milepost and the projection point,
In this paper, we present a novel framework for efficient processing of trajectory data. Our proposed framework consists of
three modules: (1) a big data distribution module based on a
two-step consistent hashing algorithm, (2) a data transformation
module based on a parallel linear referencing strategy, and (3) a
compression-aware I/O performance improvement module. We
take a K-means clustering algorithm as an example, and conduct
extensive empirical studies with large scale synthetic data and
real-world GPS data. The experimental results show that our
two-step consistent hashing algorithm can achieve the effectiveness of locality, load balancing, parallel, and remain monotonic
while improving the performance significantly. The proposed parallel linear referencing strategy has low coupling and low communication costs among subtasks and can be implemented
easily, and the compression-aware performance improvement
model is capable of providing effective decision support on how
to use compression to improve I/O performance.
For future works, we will extend our compression model to take
network transmission into consideration because data interchange
is sometimes inevitable. We are also interested in applying the
Fig. 15. Time versus the data size.
Table 4
Duration (D) and Duration Trend (DT) with/without LZO compression in one iteration (taxi GPS data).
Without LZO compression
D (s)
DT (%)
With LZO compression
Calc
Redu
I/O
Total
Calc
Redu
I/O
Total
114.94
0.0
226.39
0.0
479.56
0.0
820.89
0.0
169.97
+47.88
163.08
27.97
0.06
99.99
343.45
58.16
Table 5
Duration (D) and Duration Trend (DT) with/without LZO compression in one iteration (Synthetic data).
Without LZO compression
D (s)
DT (%)
With LZO compression
Calc
Redu
I/O
Total
Calc
Redu
I/O
Total
182.20
0.0
50.26
0.0
1088.85
0.0
1321.31
0.0
220.99
+21.29
29.04
42.22
306.72
71.83
556.75
57.86
Please cite this article in press as: Zhou, Y., et al. An efficient data processing framework for mining the massive trajectory of moving objects. Computers,
Environment and Urban Systems (2015), http://dx.doi.org/10.1016/j.compenvurbsys.2015.03.004
12
Y. Zhou et al. / Computers, Environment and Urban Systems xxx (2015) xxx–xxx
data distribution algorithm and I/O performance improvement
algorithm to large scale data processing and analysis in other
domains.
Acknowledgements
This work is supported by Chinese ‘‘Twelfth Five-Year’’ Plan for
Science & Technology Support under Grant Nos. 2012BAK17B01
and 2013BAD15B02, the Natural Science Foundation of China
(NSFC) under Grant Nos. 91224006, 61003138 and 41371386, the
Strategic Priority Research Program of the Chinese Academy of
Sciences under Grant Nos. XDA06010202 and XDA05050601, the
joint project by the Foshan and the Chinese Academy of Sciences
under Grant No. 2012YS23.
References
Abadi, D., Madden, S., & Ferreira, M. (2006). Integrating compression and execution
in column-oriented database systems. In Proceedings of the 2006 ACM SIGMOD
international conference on management of data (pp. 671–682). Chicago, IL, USA:
ACM.
Agarwal, P. K., & Mustafa, N. H. (2004). K-means projective clustering. In Proceedings
of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of
database systems (pp. 155–165). Paris, France: ACM.
Aggarwal, C. C., Wolf, J. L., Yu, P. S., Procopiuc, C., & Park, J. S. (1999). Fast algorithms
for projected clustering. SIGMOD Record, 28, 61–72.
Aggarwal, C. C., & Yu, P. S. (2002). Redefining clustering for high-dimensional
applications. IEEE Transactions on Knowledge and Data Engineering, 14, 210–225.
Agrawal, D., Das, S., & Abbadi, A. E (2011). Big data and cloud computing: Current
state and future opportunities. In Proceedings of the 14th international conference
on extending database technology (pp. 530–533). Uppsala, Sweden: ACM.
Apache (2014). Apache Mahout: Scalable machine learning and data mining.
Armbrust, M., Fox, A., Griffith, R., Joseph, A. D., Katz, R. H., Konwinski, A., et al.
(2009). Above the clouds: A Berkeley view of cloud computing. Berkeley: EECS
Department, University of California.
Blazek, R. (2004). Introducing the linear reference system in GRASS. In FOSS/GRASS
user conference. Bangkok, Thailand.
Chen, Y., Ganapathi, A., & Katz, R. H. (2010). To compress or not to compress –
Compute vs. IO tradeoffs for mapreduce energy efficiency. In P. Barford, J.
Padhye, & S. Sahu (Eds.), Green networking (pp. 23–28). ACM.
Christophe, M. Genolini, & Bruno, Falissard (2010). KmL: k-means for longitudinal
data. Computational Statistics, 2, 317–328.
Dhillon, I. S., & Modha, D. S. (2001). Method and system for clustering data in
parallel in a distributed-memory multiprocessor system. Google Patents.
Dhillon, I. S., & Modha, D. S. (2000). A data-clustering algorithm on distributed
memory multiprocessors. In Revised papers from large-scale parallel data mining,
workshop on large-scale parallel KDD systems, SIGKDD (pp. 245–260). SpringerVerlag.
Energy, G. E. (2014). Small World Global Transmission Office.
ESRI (2014). ArcGIS desktop help 9.3: An overview of linear referencing.
Guido, D., & Waldo, T. (1983). Push-pull migration laws. Annals of the Association of
American Geographers, 73, 1–17.
Han, B., Liu, L., & Omiecinski, E. (2012). NEAT: Road network aware trajectory
clustering. In Proceedings of the 2012 IEEE 32nd international conference on
distributed computing systems (pp. 142–151). IEEE Computer Society.
Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM
Computing Surveys, 31, 264–323.
Jensen, C. S., Lin, D., & Ooi, B. C. (2007). Continuous clustering of moving objects.
IEEE Transactions on Knowledge and Data Engineering, 19, 1161–1174.
Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., & Lewin, D. (1997).
Consistent hashing and random trees: Distributed caching protocols for
relieving hot spots on the World Wide Web. In Proceedings of the twenty-ninth
annual ACM symposium on theory of computing (pp. 654–663). El Paso, Texas,
USA: ACM.
Karger, D., Sherman, A., Berkheimer, A., Bogstad, B., Dhanidina, R., Iwamoto, K., et al.
(1999). Web caching with consistent hashing. In Proceedings of the eighth
international conference on World Wide Web (pp. 1203–1213). Toronto, Canada:
Elsevier North-Holland, Inc.
LaValle, S., Lesser, E., Shockley, R., Hopkins, M. S., & Kruschwitz, N. (2011). Big data,
analytics and the path from insights to value. MIT Sloan Management Review, 52,
11.
Lee, K.-H., Lee, Y.-J., Choi, H., Chung, Y. D., & Moon, B. (2012). Parallel data processing
with MapReduce: A survey. SIGMOD Record, 40, 11–20.
Lee, J., Winslett, M., Ma, X., & Yu, S. (2002). Enhancing data migration performance
via parallel data compression. In Proceedings of the 16th international parallel and
distributed processing symposium (pp. 142). IEEE Computer Society.
Li, Y., & Chung, S. M. (2007). Parallel bisecting k-means with prediction clustering
algorithm. The Journal of Supercomputing, 39, 19–37.
Li, Z., Ding, B., Han, J., & Kays, R. (2010). Swarm: Mining relaxed temporal moving
object clusters. Proceedings of the VLDB Endowment, 3, 723–734.
Li, Y., Han, J., & Yang, J. (2004). Clustering moving objects. In Proceedings of the tenth
ACM SIGKDD international conference on knowledge discovery and data mining
(pp. 617–622). Seattle, WA, USA: ACM.
Li, Z., Ji, M., Lee, J.-G., Tang, L.-A., Yu, Y., Han, J., et al. (2010). MoveMine: Mining
moving object databases. In Proceedings of the 2010 ACM SIGMOD international
conference on management of data (pp. 1203–1206). Indianapolis, Indiana, USA:
ACM.
Nanni, M., & Pedreschi, D. (2006). Time-focused clustering of trajectories of moving
objects. Journal of Intelligent Information Systems, 27, 267–289.
Noronha, V., & Church, R. L. (2002). Linear referencing and alternate expressions of
location for transportation. Santa Barbara: Vehicle Intelligence & Transportation
Analysis Laboratory University of California.
Ossama, O., Mokhtar, H. M. O., & El-Sharkawi, M. E. (2011). Clustering moving
objects using segments slopes. International Journal of Database Management
Systems (IJDMS), 3, 35–48.
Pelekis, N., Kopanakis, I., Kotsifakos, E., Frentzos, E., & Theodoridis, Y. (2009).
Clustering trajectories of moving objects in an uncertain world. In Proceedings of
the 2009 ninth IEEE international conference on data mining (pp. 417–427). IEEE
Computer Society.
PostGIS (2014). PostGIS 1.5.2 manual.
Tung, A. K. H, Xu, X., & Ooi, B. C (2005). CURLER: Finding and visualizing nonlinear
correlation clusters. In Proceedings of the 2005 ACM SIGMOD international
conference on management of data (pp. 467–478). Baltimore, Maryland: ACM.
Welton, B., Kimpe, D., Cope, J., Patrick, C. M, Iskra, K., & Ross, R. (2011). Improving I/
O forwarding throughput with data compression. In Proceedings of the 2011 IEEE
international conference on cluster computing (pp. 438–445). IEEE Computer
Society.
Xue, Z., Shen, G., Li, J., Xu, Q., Zhang, Y., & Shao, J. (2012). Compression-aware I/O
performance analysis for big data clustering. In Proceedings of the 1st
international workshop on big data, streams and heterogeneous source mining:
Algorithms, systems, programming models and applications (pp. 45–52). Beijing,
China: ACM.
Yang, C., Wu, H., Huang, Q., Li, Z., & Li, J. (2011). Using spatial principles to optimize
distributed computing for enabling the physical science discoveries (pp. 5498–
5503).
Yang, C., Sun, M., Liu, K., Huang, Q., Li, Z., Gui, Z., et al. (2013). Contemporary
computing technologies for processing big spatiotemporal data. Space-time
integration in geography and GIScience: Research frontiers in the US and China.
Springer.
Zhao, W., Ma, H., & He, Q. (2009). Parallel K-means clustering based on MapReduce.
In Proceedings of the 1st international conference on cloud computing
(pp. 674–679). Beijing, China: Springer-Verlag.
Zukowski, M., Heman, S., Nes, N., & Boncz, P. (2006). Super-scalar RAM-CPU cache
compression. In Proceedings of the 22nd international conference on data
engineering (pp. 59). IEEE Computer Society.
Please cite this article in press as: Zhou, Y., et al. An efficient data processing framework for mining the massive trajectory of moving objects. Computers,
Environment and Urban Systems (2015), http://dx.doi.org/10.1016/j.compenvurbsys.2015.03.004