Download An efficient data processing framework for mining the massive

Computers, Environment and Urban Systems xxx (2015) xxx–xxx Contents lists available at ScienceDirect Computers, Environment and Urban Systems journal homepage: www.elsevier.com/locate/compenvurbsys An efficient data processing framework for mining the massive trajectory of moving objects Yuanchun Zhou a,1, Yang Zhang a,1, Yong Ge b, Zhenghua Xue a, Yanjie Fu c, Danhuai Guo a, Jing Shao a, Tiangang Zhu a, Xuezhi Wang a, Jianhui Li a,⇑ a b c Computer Network Information Center, Chinese Academy of Sciences, Beijing, China Department of Computer Science, University of North Carolina at Charlotte, NC, USA MSIS Department, Rutgers, The State University of New Jersey, NJ, USA a r t i c l e i n f o Article history: Available online xxxx Keywords: Big data Trajectory of moving object Compression contribution model Parallel linear referencing Two step consistent hashing a b s t r a c t Recently, there has been increasing development of positioning technology, which enables us to collect large scale trajectory data for moving objects. Efficient processing and analysis of massive trajectory data has thus become an emerging and challenging task for both researchers and practitioners. Therefore, in this paper, we propose an efficient data processing framework for mining massive trajectory data. This framework includes three modules: (1) a data distribution module, (2) a data transformation module, and (3) a high performance I/O module. Specifically, we first design a two-step consistent hashing algorithm, which takes into account load balancing, data locality, and scalability, for a data distribution module. In the data transformation module, we present a parallel strategy of a linear referencing algorithm with reduced subtask coupling, easy-implemented parallelization, and low communication cost. Moreover, we propose a compression-aware I/O module to improve the processing efficiency. Finally, we conduct a comprehensive performance evaluation on a synthetic dataset (1.114 TB) and a real world taxi GPS dataset (578 GB). The experimental results demonstrate the advantages of our proposed framework. Ó 2015 Elsevier Ltd. All rights reserved. 1. Introduction With the increasing development of sensor networks, global positioning systems, and wireless communication, people have collected more and more location traces of various moving objects, such as human beings and cars. These data have also been transmitted in a timely manner through world-wide information networks, which enable people to monitor and track moving objects in a real-time fashion (Guido & Waldo, 1983). Considerable amounts of useful and interesting information is hidden in such large scale spatio-temporal data (LaValle, Lesser, Shockley, Hopkins, & Kruschwitz, 2011; Yang et al., 2013), which have attracted much attention from both researchers and industry. The movement of many objects is restricted to specific routes. For instance, the movement of trains is restricted to rail, and most cars can only move along road networks. Such moving objects produce restricted trajectories, among which various moving patterns ⇑ Corresponding author at: Computer Network Information Center, Chinese Academy of Sciences (CNIC, CAS), 4,4th South Street, Zhongguancun, P.O. Box 349, Haidian District, Beijing 100190, China. Tel.: +86 010 5881 2518. E-mail address: [email protected] (J. Li). 1 These authors contributed equally to this work. of objects are embedded. For example, there are some speed patterns and traffic jam patterns in the trajectories of vehicles. There are some periodic patterns and association patterns in the trajectories of people. A large number of data mining works have been conducted to discover these patterns in a real-time or offline fashion (Han, Liu, & Omiecinski, 2012; Li, Ding, Han, & Kays, 2010; Li, Ji, et al., 2010; Nanni & Pedreschi, 2006; Ossama, Mokhtar, & ElSharkawi, 2011; Pelekis, Kopanakis, Kotsifakos, Frentzos, & Theodoridis, 2009). However, most data mining tasks with such large scale restricted trajectories usually face common challenges such as (1) data storage and processing intensity, (2) time demand, and (3) coordinate system transformation. To address these challenges, in this paper, we propose an efficient data processing framework for mining the massive trajectory data of restricted moving objects. Our framework consists of three modules: (1) a data distribution module, (2) a data transformation module, and (3) an I/O performance improvement module. Specifically we first propose a two-stage consistent hashing algorithm for the data distribution module. The main ideas include: (1) all data are first distributed to multiple nodes; (2) the data of each node are distributed to multiple disks. Unlike traditional consistent hashing algorithms, we add a load readjusting step to optimize load balancing. Parallelizing serial algorithms is a popular choice for enhancing http://dx.doi.org/10.1016/j.compenvurbsys.2015.03.004 0198-9715/Ó 2015 Elsevier Ltd. All rights reserved. Please cite this article in press as: Zhou, Y., et al. An efficient data processing framework for mining the massive trajectory of moving objects. Computers, Environment and Urban Systems (2015), http://dx.doi.org/10.1016/j.compenvurbsys.2015.03.004 2 Y. Zhou et al. / Computers, Environment and Urban Systems xxx (2015) xxx–xxx data processing efficiency. For instance, cloud computing infrastructures (Agrawal, Das, & Abbadi, 2011; Armbrust et al., 2009; Yang, Wu, Huang, Li, & Li, 2011) partition computation into many subtasks, and then concurrently run those subtasks on multiple servers. Hence, we propose a parallel linear referencing strategy and establish a data transformation module based on the MPI parallel programming framework. Our parallelization strategy reduces the coupled interactions among subtasks, reduces interaction overhead and thus significantly improves the performance. Moreover, because frequent system I/Os also jeopardize the performance of processing big data, we design a compression-aware method to improve system I/O performance. Our method quantitatively analyzes the I/O performance contribution of data compression, automatically decides when to compress data, and intelligently balances compression rate and ratio. In this experiment, we perform detailed measurements and analysis, especially for the data distribution module and I/O performance improvement module. To measure the effectiveness on load balancing by the number of virtual nodes, we execute the proposed two-stage consistent hashing algorithm for 3000, 5000, 7000, and 9000 files. To meet appropriate load balancing and reduce system overhead, we set the number of virtual nodes as 400. In addition, we also compare the difference between the inclusion and exclusion of the load readjusting step (see Step 1.5 in Section 4). Experimental results show that the load readjusting step can greatly improve the performance of load balancing. To test the I/O performance improvement module, we take parallel K-means (Apache, 2014; Jain, Murty, & Flynn, 1999) clustering as the processing task and measure the performance of our proposed framework with 1.114 TB synthetic data and 578 GB taxi GPS data sets on a 14-server cluster machine. Table 1 presents the results of three data storage strategies for each iteration: (1) uniform distribution of data on the servers; (2) uniform distribution of data on high performance storage (Panasas) with a 1 Gb/s network; and (3) uniform distribution of data on a hadoop distributed file system (HDFS in short, see Section 7 for configuration details). We have two observations from Table 1: (1) data locality may result in higher efficiency than other storage strategies; (2) all three methods mentioned above perform very poorly in I/O performance because data reading occupies most of the computational time and thus makes CPUs always idle. Therefore, it is desirable to introduce an I/O performance improvement module (Xue et al., 2012) and relieve dramatic I/O latency. We quantitatively analyze the impact of a variety of factors of compression, such as compression ratio, compression rate, and compression algorithm. The performance improvement model can also effectively determine when and how to use compression to improve the performance. The remainder of this paper is organized as follows. In Section 2, we introduce the modules of our proposed framework. In Section 3, we provide a survey on the work regarding linear referencing, I/O performance improvement methods, the K-means clustering algorithm, and the clustering of trajectories of moving objects. Section 4 introduces a two-step consistent hash algorithm to allocate data on a server cluster with multiple disks for improving I/O performance. In Section 5, we design a parallel linear referencing strategy. In Section 6, we establish a mathematical model and analyze the I/O performance improvement aspect of data compression. Section 7 shows the experimental results of two large scale data sets. Finally we conclude our work in Section 8. 2. Data processing framework In this section, we describe the data processing framework. The proposed framework mainly includes three parts: a data distribution module, a data transformation module, and a Table 1 Comparison of three data storage strategies. Methods Reading duration (s) Iteration duration (s) Local Disk Panasas HDFS 1162 6540 12,426 1516 6854 26,132 compression-aware I/O performance improvement module. We present the data flow of our proposed framework in Fig. 1 As shown in Fig. 1, arrows stand for data flows between different processes. First, we distribute raw data to different disks of multiple computing nodes by the two-step consistent hashing algorithm (refer to Section 4). Then, a linear referencing transformation module converts these data trunks into projected data in parallel (see Section 5). Later, an I/O performance improvement module uses compression-aware strategy to perform proper compression. Finally, we conduct a K-means clustering algorithm in parallel. Data allocation plays a role in processing and analyzing large scale trajectory data. To this end, we propose a two-step consistent hashing strategy based on an existing consistent hashing algorithm. This strategy considers factors related to compression, such as data locality, load balancing, parallel, and remaining monotonic. More details are included in Section 4. Coordinate system transformation is a critical pretreatment process in mining the trajectory of moving objects. Linear referencing is a classic mechanism of coordinate system transformation, we therefore design a parallel strategy for linear referencing in a data transformation module. The proposed parallel algorithm has low coupling between subtasks, is easy to implement, and leads to low communication cost. We detail this component in Section 5. I/O is the main bottleneck of all big data processing tasks. Using a compression mechanism can effectively reduce the I/O cost of data processing. In this paper, we exploit the K-means clustering algorithm to cluster the trajectories of moving objects. Because the K-means algorithm reads all the data in each iteration, and often requires multiple iterations before converging, we quantitatively analyze the impact of compression ratio, compression rate, compression algorithm, and other factors related to compression on the processing performance. We also quantitatively analyze when and how to use compression and improve the processing performance. We will introduce more details in Section 6. Unlike existing works, the key modules of our data processing framework are specifically optimized for performance enhancement at all levels. Additionally, the modules of the framework are loosely coupled and can be independently applied to other big data analysis scenarios. In addition, this framework has no dependency on the data to be processed and can be applied to other similar big data processing applications. 3. Related work 3.1. K-means algorithm Extensive research on K-means clustering has been conducted in data mining research (Agarwal & Mustafa, 2004; Aggarwal, Wolf, Yu, Procopiuc, & Park, 1999; Aggarwal & Yu, 2002; Tung, Xu, & Ooi, 2005). However, when clustering large scale data larger than Terabytes in size, a serial K-means algorithm often fails. In contrast, parallel schemes show their advantages. For big data clustering, the research community has published several parallel K-means algorithms. Dhillon and Modha (2000) proposed a parallel implementation of the K-means algorithm based on message passing model and analyzed the algorithm’s scalability and speedup. Li Please cite this article in press as: Zhou, Y., et al. An efficient data processing framework for mining the massive trajectory of moving objects. Computers, Environment and Urban Systems (2015), http://dx.doi.org/10.1016/j.compenvurbsys.2015.03.004 Y. Zhou et al. / Computers, Environment and Urban Systems xxx (2015) xxx–xxx Fig. 1. Data processing framework. and Chung (2007) proposed a bisecting parallel K-means algorithm, which balanced the load among multiprocessors with a prediction measure. Based on the most popular parallel programming framework, MapReduce, (Lee, Lee, Choi, Chung, & Moon, 2012; Zhao, Ma, & He, 2009) proposed a parallel K-means algorithm and assessed the performance by speed-up, scale-up, and size-up. In this paper, we exploit the parallel K-means algorithm proposed by Dhillon, which has high scalability and efficiency (Dhillon & Modha, 2001). 3.2. Linear referencing algorithm Linear Referencing (LR) (Noronha & Church, 2002) is the specification of a location by means of a distance measurement along a sequence of road sections from a known reference point. Linear referencing is mainly used to manage data related to linear features such as railways, rivers, roads, oil and gas transmission pipelines, and power transmission lines. It is known that linear referencing is supported by multiple famous Geographic Information System software, such as ArcGIS (ESRI, 2014), GRASS GIS (Blazek, 2004), PostGIS (PostGIS, 2014), and GE Global Transmission Office (Energy, 2014). Because there is almost no suitable MPI based parallel linear referencing algorithm, we propose an innovative parallel implementation of Linear Referencing based on the MPI parallel programming model. 3.3. Clustering trajectory of moving objects Due to the challenge of fast clustering the large scale of moving objects, many successful and scalable methods on the clustering of moving objects have been proposed. Li, Han, and Yang (2004) proposed the concept a of moving micro-cluster to catch some regularities of moving objects and handle large datasets. This algorithm can maintain high quality moving micro-clusters and leads to fast competitive clustering results at any given time. Ossama et al. (2011) proposed a pattern-based clustering algorithm which adapts the K-means algorithm for trajectory data and overcomes the known drawbacks of the K-means algorithm. Li, Ding, et al. (2010) proposed the concepts of swarm and closed swarm, which enable the discovery of interesting moving object clusters with relaxed temporal constraints. The effectiveness and efficiency are respectively tested using real data and synthetic data. Han et al. (2012) proposed a road network aware approach, NEAT, for fast and effective clustering of spatial trajectories of 3 moving objects. Experimental results show that the NEAT approach runs orders of magnitude faster than existing densitybased trajectory clustering approaches. Nanni and Pedreschi (2006) proposed a density-based clustering method for moving object trajectories to discover interesting time intervals, where (when) the quality of the achieved clustering is optimal. Jensen, Lin, and Ooi (2007) proposed a fast and effective scheme for the continuous clustering of moving objects and used the dissimilarity notion to improve clustering quality and runtime performance. Then, they proposed a dynamic summary data structure and used an average-radius function to detect cluster split events. Pelekis et al. (2009) studied the uncertainty in the trajectory database and devised the CenTR-I-FCM algorithm for clustering trajectories under uncertainty. Li, Ji, et al. (2010) designed a system, MoveMine, for sophisticated moving object data mining. MoveMine provides a user-friendly interface and flexible tuning of the underlying methods. It benefits researchers targeting future studies in moving object data mining. Each of the existing clustering methods has its own advantages and disadvantages. For the purpose of generality and parallel performance, we implement the MPI based parallel K-means clustering method in our experiment. Genolini and Falissard (Christophe & Bruno, 2010) proposed a new implementation of K-means, named KmL, which is specifically designed for longitudinal data. It provides scope for dealing with missing values and runs the algorithm several times, varying the starting conditions and/or the number of clusters sought; its graphical interface helps the user choose the appropriate number of clusters when the classic criterion is not efficient. KmL gives much better results on non-polynomial trajectories. 3.4. Compression based I/O optimization More and more researchers are using data compression to improve I/O performance. Chen, Ganapathi, and Katz (2010) developed a decision-making algorithm that helps MapReduce users identify when and where to use compression and improve energy efficiency. However, Chen’s algorithm only considered the compression ratio and the frequency of reading. Abadi, Madden, and Ferreira (2006) extended C-Store (a column-oriented DBMS) with a compression sub-system and evaluated a set of compression schemes. Zukowski, Heman, Nes, and Boncz (2006) compared a set of super-scalar compression algorithms proposed with compression techniques used in commercial databases and showed that they significantly alleviate the I/O bottleneck. Lee, Winslett, Ma, and Yu (2002) proposed three methods for parallel compression of scientific data to reduce the data migration cost and analyzed eight scientific data sets. Welton et al. (2011) harnessed idle CPU resources to compress network data, to reduce the amount of data transferred over the network and increase effective network bandwidth. Different from the above methods, we integrate the compression mechanism into an MPI based clustering computing and quantitatively analyze the impact of multiple factors related to compression on the framework performance. 4. Data distribution algorithm Data distribution is significant in big data processing. To properly allocate large scale trajectory data, we argue that the following design goals (Karger et al., 1997, 1999) are desirable: Locality: Because networks are precious resource and usually a bottleneck especially for large scale data, all data blocks of a data file should be allocated on a single server as much as possible. It will effectively reduce network traffic when reading a big data file. Please cite this article in press as: Zhou, Y., et al. An efficient data processing framework for mining the massive trajectory of moving objects. Computers, Environment and Urban Systems (2015), http://dx.doi.org/10.1016/j.compenvurbsys.2015.03.004 4 Y. Zhou et al. / Computers, Environment and Urban Systems xxx (2015) xxx–xxx Load Balancing: Data files should be uniformly distributed among all servers for load balancing. Otherwise, processing duration will be delayed by the node associated with the largest data. Parallel: Other than uniformly distributing all files among all servers for simultaneous data processing, it is an effective way to stripe a big data file among all disks of a server for reading it in parallel. It makes use of aggregated disk I/O when reading a big data file. Monotonic: When some servers fail or more servers are added, system rebalancing should incur the least data moving. Along this line, we propose a two-step consistent hashing algorithm for big data distribution (see Fig. 2). CHA (Consistent Hashing Algorithm) (Karger et al., 1997, 1999) is widely used in parallel and distributed systems. The underlying idea is to hash both data objects and storage servers using the same hash function. It maps storage servers to an interval which contains multiple data object hashes. CHA enables load balancing by assigning data objects on different servers. However, it’s possible to have a non-uniform distribution of data objects among servers if servers are not enough. It introduces ‘‘virtual nodes’’ to virtually expand servers. By introducing virtual nodes, it achieves more balanced data distribution and meets our goals on load balancing. If a server is removed, its interval is taken over by a server with an adjacent interval; if a new server joins, data objects from their adjacent server will move into it. It guarantees the least data moving when server scale changes and meets our design goal on remaining monotonic. The two-step consistent hash algorithm (using the well-known FNV-1 hash function) is described as follows: Step 1: Assigning data files among servers according to a specified hash function. Step 1.1: Creating virtual nodes: create multiple virtual nodes names for each server by simply adding an index to server’s name. Step 1.2: Mapping virtual nodes: Take a virtual nodes’ name as the input of the hash function and map all virtual nodes into hash space which is a 1232-1 range circle. Step 1.3: Mapping data files: Take the names of data files as the input of the same hash function and map all data files into the same hash space circle. Step 1.4: Determining data files into virtual nodes: Start from where a data file is located and head clockwise on the ring until finding a virtual node. The data file is assigned to this virtual node. Step 1.5: Adjusting load: Sort virtual nodes by data amount. Move data files from the heavy nodes to the lightest node to make the lightest approach the average load. Repeat the step in the remaining nodes until the last node. Step 2: Stripping data files of each server into its multi-disks. Step 2.1: Creating virtual nodes: create multiple virtual node names for each disk by simply adding an index to disk’s name. Step 2.2: Mapping virtual nodes: Take a virtual nodes’ name as the input of hash function and map all virtual nodes into hash space which is a 1232-1 range circle. Step 2.3: Mapping data blocks: Split data files into a number of fixed size blocks. Take the names of data blocks as the input of the same hash function, then map all data blocks into the same hash space circle. Step 2.4: Determining data blocks into virtual nodes: Start from where a data block is and head clockwise on the ring until finding a virtual node. The data block is assigned to this virtual node. Here, Step 1 assigns data files to servers without splitting files. This enables reading a file without any network traffic, and it satisfies our design goal of locality. Step 1.5 is an additional step for a CHA. A CHA can guarantee that each server holds approximately the same number of files, but that’s not enough. Because data file sizes vary, each server actually has a different amount of data, especially in the case that file size varies to a great extent. Step 1.5 adjusts the data amount to make each server approach the average. Step 2 stripes a file into data blocks and then uniformly distributes blocks into multiple disks of a server according to a CHA. Each file is simultaneously read from multi-disks of a server, which provides parallelism within the scope of a server. Unlike conventional data distribution mechanisms, such as the Lustre file system, the proposed algorithm has the following advantages: (1) there is no need to store metadata, which avoids the single point of failure and reduces the system overhead; (2) this method assigns data files to servers without splitting files in Step 1, which greatly reduces network overhead when reading or writing data; (3) it makes full use of multi-disk aggregated I/O bandwidth to improve the performance; (4) step 1.5 can make appropriate readjustments for the hashed files among multiple servers, which achieves the approximate balance among servers; (5) the proposed algorithm is suitable for both large files and small files. In summary, the proposed two-step consistent hashing algorithm is superior to other traditional algorithms. 5. Parallel linear referencing strategy To improve the projection performance, we parallelize the linear referencing algorithm based on the MPI parallel programming framework. The parallel scheme is shown in Fig. 3. From Fig. 3, we can see that parallel strategy has low coupling among subtasks, which contributes to low communication costs among subtasks and easy implementation. The three circled digits in this figure indicate a specified order: (1) digit 1 denotes reading the road network data from local disks; (2) digit 2 denotes building the R-Tree index system using the road network data; and (3) digit 3 denotes reading trajectory records then performing a linear referencing operation. 6. Compression-aware I/O improvement module Fig. 2. Data distribution strategy. In the compression mechanism, an application does not directly read the raw data on a local disk. Instead, it first uses a Please cite this article in press as: Zhou, Y., et al. An efficient data processing framework for mining the massive trajectory of moving objects. Computers, Environment and Urban Systems (2015), http://dx.doi.org/10.1016/j.compenvurbsys.2015.03.004 Y. Zhou et al. / Computers, Environment and Urban Systems xxx (2015) xxx–xxx 5 compression algorithm with a high compression ratio and high compression and decompression speeds to read the compressed data into memory and then decompresses and calculates. Considering a K-means algorithm often iterates many times for the same data set before convergence, the larger the number of iterations is, the greater a performance improvement that compression will yield. Although using compression reduces I/O overhead, it introduces CPU overhead for compression and decompression. The improvement of clustering performance by using compression technology is related to multiple factors, such as data size, compression and decompression efficiency, and number of iterations. It’s desirable to design a mathematical model to quantitatively analyze the trade-off. Table 2 lists the used parameters. The parameters are determined by the properties of the dataset and hardware configuration (see Section 7). Fig. 3. Parallel linear referencing strategy. 6.1. Task decomposition In the process of K-means computation, we assign two types of processes: compression processes and computation processes. Compression processes entail reading the raw data from local disks, then compressing the raw data and writing compressed data back to disks. Computation processes involve reading the compressed data from local disks, and decompressing and calculating it. All processing of the data is performed in a block by block manner, as shown in Fig. 4: Under parallel conditions, we assume that each compression process takes Tcomp seconds to address a raw data block with the size of K. Tcomp consists of three parts: (1) Tcomp_read: duration of reading a data block with size K. (2) Tcomp_comp: duration of compressing the data block from the size of K to the size of K/l. (3) Tcomp_write: duration of writing compressed data with the size of K/l back to disks. Therefore, we can conclude Tcomp = Tcomp_read + Tcomp_comp + Tcomp_write Under parallel conditions, each computation process takes Tcalc to address a compressed data block with the size of K/l. Tcalc consists of three parts: (1) Tcalc_read: duration of reading a compressed data block with the size of K/l. (2) Tcalc_uncom: duration of decompressing the compressed data block from the size of K/l to the size of K. (3) Tcalc_calc: duration of calculating the uncompressed data with the size of K. Therefore, it concludes Tcalc = Tcalc_read + Tcalc_uncom + Tcalc_calc. In general, the K-means algorithm needs to iterate many times before convergence. Therefore, only in the first iteration is the data compression part necessary. In the subsequent iterations, it just uses the compressed data generated in the first iteration. We will analyze the first iteration and subsequent iterations. 6.2. Analysis for the first iteration To facilitate our analysis, we draw the timing diagram for compression processes and computation processes according to three different situations as shown in Fig. 5: In Fig. 5, Comp, Calc and Idle denote compression, computation and idle times, respectively. The green, blue and red boxes indicate the time taken by each compression process to address a raw data block, the time taken by each computation process to address a Table 2 Data and environmental parameters. Parameters Definitions d A d-dimensional dataset Dimensionality of dataset dS, 10 default Number of samples for dS, 15 billion default Ratio of the size of uncompressed data to the size of compressed data Size of a raw data block, 0.1123 default Size of a compressed data block Size of dataset dS, 1114.875 default Number of total processes, 336 default Number of compression processes Disk reading rate in GB/s, 0.08969 default S D g l K (GB) K/l (GB) D (GB) N M avai_band compressed data block and the idle time of compression/computation processes, respectively. We assume that there is a constant C, such that Tcomp = C ⁄ Tcalc. If and only if this constant satisfies C = M/(N M), namely Tcomp/Tcalc = M/(N M), N M computational processes need M compression processes to supply the compressed data. At this point, the processing rates of the compression process and the computation process are consistent. Only in this situation, the compression process resources and computation process resources can be fully utilized. When C > M/(N M), as shown in Fig. 5(a), the computation process is faster than the compression process. Computation waits for compression, and the value of M is smaller than that in the optimal situation. When C < M/(N M), as shown in Fig. 5(b), the compression process is faster than the computation process. Compression waits for calculation, and M is larger than in the optimal situation. When compression processes and calculation processes coexist, we design a synchronization mechanism between compression processes and calculation processes. When compression processes write compressed data onto disk, the file is named as ‘‘file.lzo.swp’’ until all compressed data writing is finished. Then, compression processes change the file name from ‘‘file.lzo.swp’’ to ‘‘file.lzo’’. The calculation processes continuously check whether the file ‘‘file.lzo’’ exists. If file ‘‘file.lzo’’ does not exist, calculation processes continue to check. When file ‘‘file.lzo’’ exists, the calculation processes read ‘‘file.lzo’’ from disk to memory and then perform calculation. When first C = N:0 (M = N), as shown in Fig. 5(c), there are no computation processes and all of the N processes work as compression processes to address all the raw data. Then, C = 0:N (M = 0), there are no compression processes and all of the N processes work as computation processes. Please cite this article in press as: Zhou, Y., et al. An efficient data processing framework for mining the massive trajectory of moving objects. Computers, Environment and Urban Systems (2015), http://dx.doi.org/10.1016/j.compenvurbsys.2015.03.004 6 Y. Zhou et al. / Computers, Environment and Urban Systems xxx (2015) xxx–xxx Fig. 4. Data flow chart. Fig. 5. Diagram for three situations. Based on the above analysis, we can obtain the formula of time T1(M) for the first iteration. 8 D C f ðD%ðK MÞÞ T calc 1 6 M 6 Cþ1 N > > K M lT comp þ m < D C T 1 ðMÞ ¼ T comp þ K ðNMÞ T calc N 6M 6N1 Cþ1 > > : D M¼N ðT comp þ T calc Þ K M ð1Þ where the function f(x) is: (l f ðxÞ ¼ x ðNMÞ K 1 m x–0 ð2Þ x¼0 T1(M) corresponds to three different situations in Fig. 5(a)–(c). Tcomp and Tcalc are related to the value of M, and they follow a non-linear relationship. We perform a polynomial fitting Tcomp and Tcalc versus M. To avoid overfitting, the order of the fitted function is set to three. Therefore, we can draw the curve of function T1(M) as shown in Fig. 6: In Fig. 6, the parameter D is set as 1141.875; Tcomp and Tcalc are substituted by the fitted functions Tcomp(M) and Tcalc(M); the parameter C is set as Tcomp/Tcalc; the number of processes N is set as 336. The figure illustrates that: (1) When a computation process is faster than compression, the trend of the red curve is relatively flat. The number of compression processes M has little effect on the time T1(M). (2) When compression is faster than the computation process, the green curve shows a slowly increasing trend with the increasing number of compression processes. When M is close to 280, the curve shows a sharp increasing trend. The number of compression processes M has a positive effect on the time T1(M). (3) When the compression is conducted before the computation, T1(M) will be a constant and is a little larger than the other two cases. Based on this analysis, we conclude that the appropriate choice of M can effectively improve the performance of parallel K-means application. 6.3. Analysis for all iterations 6.3.1. Analysis with compression In the second and subsequent iterations, execution processes are similar, and it is unnecessary to compress raw data. All N processes work as the computation processes. Please cite this article in press as: Zhou, Y., et al. An efficient data processing framework for mining the massive trajectory of moving objects. Computers, Environment and Urban Systems (2015), http://dx.doi.org/10.1016/j.compenvurbsys.2015.03.004 Y. Zhou et al. / Computers, Environment and Urban Systems xxx (2015) xxx–xxx 7 Therefore, it concludes T calc = Tcalc_read + Tcalc_calc. Based on the above analysis for various iterations, the total time Ttotal(S), without compression, taken by the program throughout the whole execution process can be formulated as: T total ðSÞ ¼ S X i¼1 D K N T calc ð6Þ 6.4. Compression contribution model To determine the circumstances under which the compression can improve the performance of the application, we perform subtraction between Ttotal(S) and Ttotal_com(S) as below: T total ðSÞ T total Fig. 6. Time for the first iteration versus the number of compression processes. Under parallel condition, each computation process takes Tcalc seconds to address a compressed data block with the size of K/l. Tcalc consists of three parts: (1) Tcalc_read: duration of reading a compressed data block with the size of K/l. (2) Tcalc_uncom: duration of decompressing the compressed data block from K/l to K. (3) Tcalc_calc: duration of calculating the uncompressed data block with the size of K. 2 3 D=l 7 Ti ¼ 6 7 T calc 6K 6 l N7 ð3Þ where S indicates the total number of iterations of parallel K-means application. Based on the above analysis, the total time Ttotal_com(S) taken by the application throughout the whole execution process can be formulated as: T total 2 3 S X D= l 6 7 T calc com ðSÞ ¼ T 1 þ 6 K 7 N7 i¼2 6 l ð4Þ Under a certain number of compression processes, T 1 indicates the minimum time taken by the first iteration and can be expressed as: T 1 ¼ max16M6N ðT 1 ðMÞÞ ð5Þ 6.3.2. Analysis without compression When compression is not adopted, each of the iterations of application has a similar execution process, and all N processes work as computation processes. Under parallel conditions, each computation process takes T calc seconds to address a raw data block with size K. To avoid ambiguity between formulas with compression and without compression, we use T calc here instead of Tcalc. ð7Þ If the compression mechanism is able to improve the performance of the program, the difference between Ttotal(S) and Ttotal_com(S) must satisfy: T total ðSÞ T total com ðSÞ >0 ð8Þ Therefore, the condition that the compression mechanism can improve the performance is that the number of iterations S must satisfies the following formula. SP T 1 D K N Therefore, we conclude Tcalc = Tcalc_read + Tcalc_uncom + Tcalc_calc and the time taken by the ith ð2 6 i 6 SÞ iteration is: D ðT ðSÞ ¼ S T Þ com calc calc K N D T calc T 1 þ K N D KN T calc ðT calc T calc Þ ð9Þ To facilitate the comparison, we draw the curve of Ttotal(S), as well as the curve of Ttotal_com versus the number of iterations, S: Fig. 7 shows that the application with compression takes more time for the first iteration. With an increasing number of iterations, the application with compression starts to takes less time than that without compression, and the gap between two curves will be larger. We can conclude that the larger the number of iterations, the higher the performance improvement the compression will yield. Especially for those applications with high real-time requirements, using compression can effectively improve the performance when dealing with large-scale data and many rounds of iterations. 7. Experiments and analysis In this section, we provide an empirical evaluation of the proposed data processing framework on real-world data. Specifically, we first establish an MPI cluster and conduct experiments to evaluate the three modules of the framework. The MPI cluster consists of 14 computing nodes that are interconnected with an Infiniband network. Additionally, to evaluate the I/O performance improvement module, we establish a Hadoop cluster for experimental comparison. The Hadoop cluster consists of 5 computing nodes interconnected by a Gigabit Ethernet network. Each node in the Hadoop cluster holds 24 computing cores, 32 GB memory size, and 12x1 TB disks. The CPU is Intel(R) Xeon(R) L5640 @2.27 GHz. The replication number of the Hadoop cluster is set as 3. 7.1. Experiments on data distribution module T calc consists of two parts: (1) Tcalc_read: duration of reading a raw data block with the size of K. (2) Tcalc_calc: duration of calculating the raw data block with the size of K. Here, we report the performance of our proposed data distribution strategy. We first test the file reading speed. While the speed is approximately 130 MB/s if the file is read from a single disk, our module reads a file from 12 disks of a server in parallel Please cite this article in press as: Zhou, Y., et al. An efficient data processing framework for mining the massive trajectory of moving objects. Computers, Environment and Urban Systems (2015), http://dx.doi.org/10.1016/j.compenvurbsys.2015.03.004 8 Y. Zhou et al. / Computers, Environment and Urban Systems xxx (2015) xxx–xxx Fig. 7. Time with and without compression. and achieves a higher speed of 860 MB/s. This result validates that Step 2 effectively improves the I/O performance. We then measure the effect of the number of virtual nodes. Specifically, we set different numbers of virtual nodes for each physical node and apply the consistent hashing algorithm to different file sets with four sizes (e.g., 3000, 5000, 7000, and 9000 files). Fig. 8 shows the load balance over different numbers of virtual nodes. The x-axis of Fig. 8 indicates the number of virtual nodes. The y-axis of Fig. 8 indicates (in log scale) how far the data amount on servers is spread out from the average. Fig. 8 shows that as the number of virtual nodes increases, the variance decreases gradually until the number of virtual nodes reaches 400. And then the variance fluctuates if the number of virtual nodes exceeds 400. We tune the parameter to achieve load balancing and reduce system overhead, and set the number of virtual nodes to 400 in the following experiments. Fig. 9 shows the comparison of performances with and without Step 1.5, where the variance shows how far data amounts on servers is spread out from the average. The comparison validates that Step 1.5 helps improve the load balancing as the file number grows. In sum, the two-step consistent hashing algorithm improves the data distribution task from all four aspects: locality, load balancing, parallel, and monotonic. 7.2. Experiments on I/O improvement module In our experiments, we used an MPI cluster with 336 cores to run a parallel program for the K-means clustering algorithm that is implemented by a message-passing interface programming model. Compared to the MPI based implementation of K-means algorithm, we also performed the parallel implementation based on MapReduce parallel programming framework. Table 3 lists the initial value of environmental parameters. These parameters are determined by the properties of the dataset and the hardware configuration. LZO, LZMA, Gzip and Bzip in Table 3 indicate four different compression algorithms, respectively, used by the following measurements and analysis. Initial values of parameters are measured on our computing clusters and may not be the same as those of others. 7.2.1. Analysis for the first iteration 7.2.1.1. Analysis of the first compression then computation (Use LZO only). Fig. 5(c) shows that there is no coexistence for compression Fig. 8. Comparison of load balancing using various numbers of virtual nodes. Fig. 9. Comparison of load balancing using CHA with and without Step 1.5. processes and computation processes and no direct interaction between them. Therefore, the time Tcomp taken by compression processes to address a raw data block and the time Tcalc taken by computation processes to address a compressed data block can be viewed as fixed values. Then, the total execution time of the application can be formulated as: T total D com ðSÞ ¼ K N 2 3 S X D= l 6 7 T calc T comp þ 6 K 7 N7 i¼1 6 l ð10Þ To facilitate the analysis, we draw three curves: (1) Total execution time without compression predicted by Formula 6, Ttotal(S). (2) Total execution time with compression predicted by Formula 10, Ttotal_com(S). (3) Total execution time with compression by real measured data. Fig. 10 shows that the application with compression takes a longer time for the first and the second iterations. As the number of iterations increases, the application with compression starts to take less time than that without compression, and the growing Please cite this article in press as: Zhou, Y., et al. An efficient data processing framework for mining the massive trajectory of moving objects. Computers, Environment and Urban Systems (2015), http://dx.doi.org/10.1016/j.compenvurbsys.2015.03.004 9 Y. Zhou et al. / Computers, Environment and Urban Systems xxx (2015) xxx–xxx Table 3 Initial parameters (where avai_band indicates the average disk bandwidth, uncom_speed indicates the speed of decompression, and l indicates the ratio of the size of uncompressed data to the size of compressed data). Parameters Default initial value K D N M avai_band uncom_speed for uncom_speed for uncom_speed for uncom_speed for l for the LZO l for the LZMA l for the Gzip l for the Bzip 0.1123 GB 1141.875 GB 336 14336 0.08969 GB/s 0.108 GB/s 0.0064 GB/s 0.0053 GB/s 0.0038 GB/s 2.313 5.004 4.107 5.646 LZO LZMA Gzip Bzip gap between the blue, green and red curves shows that the performance improvement by compression will be more obvious as the number of iterations increases. Fig. 10. Time with and without compression. 7.2.1.2. Analysis of comparative speeds between compression and computation (Use LZO only). Fig. 5(a) and (b) show that there are concurrent compression processes and computation processes. To facilitate comparative analysis of the first iteration, we draw curves when the compression is faster than the computation, when the computation is faster than the compression, and the corresponding real measured data versus the number of compression processes. Fig. 11 shows that, when the number of compression processes is less than 280, the pink curve is more consistent with the red curve than the green one. When the number of compression processes is greater than 280, the pink curve is more consistent with the green curve than the red one. After processing the measured data, we found that when the number of compression processes is less than 280, the measured data for Tcomp and Tcalc satisfy the following inequality, which means that the computation process is faster than compression. T comp M > NM T calc ð11Þ When the number of compression processes is greater than 280, the measured data for Tcomp and Tcalc satisfy the following inequality, which means that compression is faster than the computation. T comp M < NM T calc ð12Þ Based on the above discussions, we can conclude that the T1(M) function is consistent with the measured data and provides an accurate description of the relationship between the execution time of the first iteration and the number of compression processes. T1(M) can also provide a strong criterion to set an appropriate number of compression processes to minimize execution time of the first iteration. 7.2.2. Analysis for all iterations 7.2.2.1. Analysis of the number of iterations (Use LZO only). We draw predicted curves in Fig. 12. As Fig. 12 shows, curves without compression are higher than those with compression for the first iteration. For the second and subsequent iterations, the application does not need to compress data. Then, curves with compression are flattened and are lower than those without compression. As the number of iterations increases, the gap between curves with compression and those without compression becomes larger. The performance improvement will be more significant as the number of iterations increases. Fig. 11. Time for the first iteration. 7.2.2.2. Analysis of the compression ratio. Empirically, the higher the compression ratio, the lower the compression speed is. Assuming that the compression ratio l and the decompression speed uncom_speed satisfy the formula: l = comp_coeff/uncom_speed, where comp_coeff is a coefficient related to the specific compression algorithm. The formula of the total execution time for the application can be transformed to: T total com ðSÞ ¼ T 1 þ S X K l i¼2 D K N 1 K þ þ T calc av ai band comp coeff calc ð13Þ When the application iterates multiple times, to facilitate the analysis, the performance improvement of the first iteration by compression is negligible and T 1 can be viewed as a fixed value. Based on the above assumptions and the formula, we can see that compression only has influence on Tcalc_read and Tcalc_uncom. To intuitively analyze the influence on the total execution time of the compression ratio, we draw the contour figure for the sum of Tcalc_read and Tcalc_uncom versus the compression ratio and compression coefficient comp_coeff as shown in Fig. 13. The value of the curves Please cite this article in press as: Zhou, Y., et al. An efficient data processing framework for mining the massive trajectory of moving objects. Computers, Environment and Urban Systems (2015), http://dx.doi.org/10.1016/j.compenvurbsys.2015.03.004 10 Y. Zhou et al. / Computers, Environment and Urban Systems xxx (2015) xxx–xxx Fig. 12. Time versus the number of iterations. Fig. 13. Comparison of the duration taken by each compression algorithm. corresponds to the value of the sum of Tcalc_read and Tcalc_uncom, the x-axis indicates the compression ratio defined in Table 2, and the y-axis indicates the compression coefficient. Fig. 13 illustrates that as the compression ratio increases, the time taken to read uncompressed data and to decompress data decreases steadily. In addition, we also performed a comparative analysis on the effects of the four common compression algorithms on the time Tcalc_read + Tcalc_uncom. The curves in red asterisks, in green inverted triangles, in purple boxes, and in blue circles indicate the time taken by the LZO, LZMA, Gzip, and Bzip algorithms, respectively. Among these compression algorithms, the LZMA algorithm outperforms the others. 7.2.2.3. Analysis of the number of processes (Use LZO only). In Fig. 14, we can see that the total execution time shows a steady decrease as the number of processes increases. Especially, when the number of the processes is small, the performance improvement is significant. As the number of processes exceeds a certain number, the performance improvement is no longer very clear. In addition, compression greatly reduces the execution time taken by the application. When the number of processes is small, compression does not improve the performance of the program. As the number of processes increases, the gap between curves with the compression and those without the compression becomes larger and larger. 7.2.2.4. Analysis of the data size (Use LZO only). Fig. 15 shows that the two predicted time curves are approximately proportional to the data size. As the amount of data increases, the gap between the curves with compression and those without compression becomes larger and larger. When the amount of data approaches to 1000 GB, the time taken by MPI application with compression is approximately only half of that without compression. However, the measured time taken by MapReduce application with compression is not reduced significantly. In this experiment, the MPI parallel programming and MapReduce parallel programming models both integrate the LZO compression algorithm to improve performance. However, the difference of performance between MPI and MapReduce is very obvious. We think there are two possible reasons: (1) the parallel K-means algorithm needs to perform a reducing operation for all computing subtasks. MPI performs the reducing operation by inter-process communication, however MapReduce performs the reducing operation by writing and reading data onto disks. For a data interexchange among computing subtasks with a fixed Fig. 14. Time versus the number of total processes. amount of data, the cost of disk I/O is much higher than the cost of network I/O. Therefore, an MPI parallel programming model is more suitable for parallel K-means implementation. (2) Compared to MPI, MapReduce has additional costs of high reliability, data replication, and a fault-tolerance mechanism. For this reason, MPI parallel programming models can provide more efficient performance for the iterative K-means algorithm. When the amount of data is less than 400 GB, without compression, the measured curve by MPI application shows different performance when compared to the predicted result, due to memory. When the memory can hold all the data, the second and subsequent iterations no longer need the disk I/O operations. Therefore, the measured curve by MPI is significantly lower than the predicted one. As the amount of data increases, until the memory is not able to hold all the data, the memory will not contribute to the performance improvement between two neighboring iterations. Therefore, an effective use of memory can greatly improve the performance of the application. Actually, we can adopt a memory trick in the parallel K-means algorithm to improve its performance. Suppose that the order of data blocks accessed in the ith iteration is: 1, 2, . . . , P, then the order of data blocks accessed in the (i + 1)th iteration can be set to: P, . . . , 2, 1. In this way, the (i + 1)th iteration Please cite this article in press as: Zhou, Y., et al. An efficient data processing framework for mining the massive trajectory of moving objects. Computers, Environment and Urban Systems (2015), http://dx.doi.org/10.1016/j.compenvurbsys.2015.03.004 11 Y. Zhou et al. / Computers, Environment and Urban Systems xxx (2015) xxx–xxx can use the left data of the ith iteration in memory and thereby reduce the I/O overhead to enhance the overall performance of the program. Experimental results show that this trick can achieve a 30% reduction in disk I/O overhead. whether exists a passenger on taxi, taxi speed (km/h), taxi driving direction, and driving time (hours). To improve the clustering performance, we adopt the compression-aware I/O strategy and integrate the LZO compression algorithm into an MPI-based parallel K-means algorithm. We then compare the experimental results between the taxi GPS data (see Table 4) and the synthetic data (see Table 5). Here, Calc, Redu, I/O, and Total indicate the calculation cost, I/O cost communication cost, and the total cost in each iteration, respectively. Table 4 shows that the use of LZO compression increases the calculation cost by 47.88%, but LZO totally contributes to 58.16% performance improvement. Because the size of the compressed taxi GPS data is less than the memory size of computing nodes, the I/O cost is reduced by 99.99%. Due to the data cached in memory at each iteration, the I/O performance improvement is negligible. To avoid the impact of cache, we use synthetic data, in which the compressed data size is greater than the memory size. We then perform the clustering task and the results are listed in Table 5. Table 5 shows that the use of LZO compression increases the calculation cost by 21.29%, but LZO brings a 71.83% reduction to I/O cost. At each iteration of parallel K-means clustering of trajectory of moving objects, the use of LZO compression algorithm still contributes a 57.86% cost reduction. The compression-aware I/O performance improvement module improves the I/O performance and the overall performance of framework significantly. 7.3. Experiments on clustering of trajectory 8. Conclusions In this section, we performed K-means clustering on trajectory data. The trajectory data used in this experiment was collected from taxi vehicles in a city, which consists of 23,876 taxis. For each taxi, the GPS system samples a location record every 30 s. Each record includes the following attributes: license plate number, the current time, whether there is a taxi passenger, current taxi speed (km/h), taxi driving orientation, and taxi location by longitude and latitude. The trajectory data of all the taxis of each day are stored as a file, which consists of approximately 50 million records. The trajectory data of the whole year consists of approximately 18 billion records, which occupies approximately 578 GB disk space. For the longitude and latitude of each record, we perform a linear referencing projection operation after which the original longitude and latitude coordinates are transformed into projected linear coordinates. Each linear coordinate includes: projected longitude, projected latitude, and the distance between a milepost and the projected point. In the clustering stage, the data to be clustered includes 8 attributes: road ID, longitude and latitude of the projection point, distance between the milepost and the projection point, In this paper, we present a novel framework for efficient processing of trajectory data. Our proposed framework consists of three modules: (1) a big data distribution module based on a two-step consistent hashing algorithm, (2) a data transformation module based on a parallel linear referencing strategy, and (3) a compression-aware I/O performance improvement module. We take a K-means clustering algorithm as an example, and conduct extensive empirical studies with large scale synthetic data and real-world GPS data. The experimental results show that our two-step consistent hashing algorithm can achieve the effectiveness of locality, load balancing, parallel, and remain monotonic while improving the performance significantly. The proposed parallel linear referencing strategy has low coupling and low communication costs among subtasks and can be implemented easily, and the compression-aware performance improvement model is capable of providing effective decision support on how to use compression to improve I/O performance. For future works, we will extend our compression model to take network transmission into consideration because data interchange is sometimes inevitable. We are also interested in applying the Fig. 15. Time versus the data size. Table 4 Duration (D) and Duration Trend (DT) with/without LZO compression in one iteration (taxi GPS data). Without LZO compression D (s) DT (%) With LZO compression Calc Redu I/O Total Calc Redu I/O Total 114.94 0.0 226.39 0.0 479.56 0.0 820.89 0.0 169.97 +47.88 163.08 27.97 0.06 99.99 343.45 58.16 Table 5 Duration (D) and Duration Trend (DT) with/without LZO compression in one iteration (Synthetic data). Without LZO compression D (s) DT (%) With LZO compression Calc Redu I/O Total Calc Redu I/O Total 182.20 0.0 50.26 0.0 1088.85 0.0 1321.31 0.0 220.99 +21.29 29.04 42.22 306.72 71.83 556.75 57.86 Please cite this article in press as: Zhou, Y., et al. An efficient data processing framework for mining the massive trajectory of moving objects. Computers, Environment and Urban Systems (2015), http://dx.doi.org/10.1016/j.compenvurbsys.2015.03.004 12 Y. Zhou et al. / Computers, Environment and Urban Systems xxx (2015) xxx–xxx data distribution algorithm and I/O performance improvement algorithm to large scale data processing and analysis in other domains. Acknowledgements This work is supported by Chinese ‘‘Twelfth Five-Year’’ Plan for Science & Technology Support under Grant Nos. 2012BAK17B01 and 2013BAD15B02, the Natural Science Foundation of China (NSFC) under Grant Nos. 91224006, 61003138 and 41371386, the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant Nos. XDA06010202 and XDA05050601, the joint project by the Foshan and the Chinese Academy of Sciences under Grant No. 2012YS23. References Abadi, D., Madden, S., & Ferreira, M. (2006). Integrating compression and execution in column-oriented database systems. In Proceedings of the 2006 ACM SIGMOD international conference on management of data (pp. 671–682). Chicago, IL, USA: ACM. Agarwal, P. K., & Mustafa, N. H. (2004). K-means projective clustering. In Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (pp. 155–165). Paris, France: ACM. Aggarwal, C. C., Wolf, J. L., Yu, P. S., Procopiuc, C., & Park, J. S. (1999). Fast algorithms for projected clustering. SIGMOD Record, 28, 61–72. Aggarwal, C. C., & Yu, P. S. (2002). Redefining clustering for high-dimensional applications. IEEE Transactions on Knowledge and Data Engineering, 14, 210–225. Agrawal, D., Das, S., & Abbadi, A. E (2011). Big data and cloud computing: Current state and future opportunities. In Proceedings of the 14th international conference on extending database technology (pp. 530–533). Uppsala, Sweden: ACM. Apache (2014). Apache Mahout: Scalable machine learning and data mining. Armbrust, M., Fox, A., Griffith, R., Joseph, A. D., Katz, R. H., Konwinski, A., et al. (2009). Above the clouds: A Berkeley view of cloud computing. Berkeley: EECS Department, University of California. Blazek, R. (2004). Introducing the linear reference system in GRASS. In FOSS/GRASS user conference. Bangkok, Thailand. Chen, Y., Ganapathi, A., & Katz, R. H. (2010). To compress or not to compress – Compute vs. IO tradeoffs for mapreduce energy efficiency. In P. Barford, J. Padhye, & S. Sahu (Eds.), Green networking (pp. 23–28). ACM. Christophe, M. Genolini, & Bruno, Falissard (2010). KmL: k-means for longitudinal data. Computational Statistics, 2, 317–328. Dhillon, I. S., & Modha, D. S. (2001). Method and system for clustering data in parallel in a distributed-memory multiprocessor system. Google Patents. Dhillon, I. S., & Modha, D. S. (2000). A data-clustering algorithm on distributed memory multiprocessors. In Revised papers from large-scale parallel data mining, workshop on large-scale parallel KDD systems, SIGKDD (pp. 245–260). SpringerVerlag. Energy, G. E. (2014). Small World Global Transmission Office. ESRI (2014). ArcGIS desktop help 9.3: An overview of linear referencing. Guido, D., & Waldo, T. (1983). Push-pull migration laws. Annals of the Association of American Geographers, 73, 1–17. Han, B., Liu, L., & Omiecinski, E. (2012). NEAT: Road network aware trajectory clustering. In Proceedings of the 2012 IEEE 32nd international conference on distributed computing systems (pp. 142–151). IEEE Computer Society. Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31, 264–323. Jensen, C. S., Lin, D., & Ooi, B. C. (2007). Continuous clustering of moving objects. IEEE Transactions on Knowledge and Data Engineering, 19, 1161–1174. Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., & Lewin, D. (1997). Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In Proceedings of the twenty-ninth annual ACM symposium on theory of computing (pp. 654–663). El Paso, Texas, USA: ACM. Karger, D., Sherman, A., Berkheimer, A., Bogstad, B., Dhanidina, R., Iwamoto, K., et al. (1999). Web caching with consistent hashing. In Proceedings of the eighth international conference on World Wide Web (pp. 1203–1213). Toronto, Canada: Elsevier North-Holland, Inc. LaValle, S., Lesser, E., Shockley, R., Hopkins, M. S., & Kruschwitz, N. (2011). Big data, analytics and the path from insights to value. MIT Sloan Management Review, 52, 11. Lee, K.-H., Lee, Y.-J., Choi, H., Chung, Y. D., & Moon, B. (2012). Parallel data processing with MapReduce: A survey. SIGMOD Record, 40, 11–20. Lee, J., Winslett, M., Ma, X., & Yu, S. (2002). Enhancing data migration performance via parallel data compression. In Proceedings of the 16th international parallel and distributed processing symposium (pp. 142). IEEE Computer Society. Li, Y., & Chung, S. M. (2007). Parallel bisecting k-means with prediction clustering algorithm. The Journal of Supercomputing, 39, 19–37. Li, Z., Ding, B., Han, J., & Kays, R. (2010). Swarm: Mining relaxed temporal moving object clusters. Proceedings of the VLDB Endowment, 3, 723–734. Li, Y., Han, J., & Yang, J. (2004). Clustering moving objects. In Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 617–622). Seattle, WA, USA: ACM. Li, Z., Ji, M., Lee, J.-G., Tang, L.-A., Yu, Y., Han, J., et al. (2010). MoveMine: Mining moving object databases. In Proceedings of the 2010 ACM SIGMOD international conference on management of data (pp. 1203–1206). Indianapolis, Indiana, USA: ACM. Nanni, M., & Pedreschi, D. (2006). Time-focused clustering of trajectories of moving objects. Journal of Intelligent Information Systems, 27, 267–289. Noronha, V., & Church, R. L. (2002). Linear referencing and alternate expressions of location for transportation. Santa Barbara: Vehicle Intelligence & Transportation Analysis Laboratory University of California. Ossama, O., Mokhtar, H. M. O., & El-Sharkawi, M. E. (2011). Clustering moving objects using segments slopes. International Journal of Database Management Systems (IJDMS), 3, 35–48. Pelekis, N., Kopanakis, I., Kotsifakos, E., Frentzos, E., & Theodoridis, Y. (2009). Clustering trajectories of moving objects in an uncertain world. In Proceedings of the 2009 ninth IEEE international conference on data mining (pp. 417–427). IEEE Computer Society. PostGIS (2014). PostGIS 1.5.2 manual. Tung, A. K. H, Xu, X., & Ooi, B. C (2005). CURLER: Finding and visualizing nonlinear correlation clusters. In Proceedings of the 2005 ACM SIGMOD international conference on management of data (pp. 467–478). Baltimore, Maryland: ACM. Welton, B., Kimpe, D., Cope, J., Patrick, C. M, Iskra, K., & Ross, R. (2011). Improving I/ O forwarding throughput with data compression. In Proceedings of the 2011 IEEE international conference on cluster computing (pp. 438–445). IEEE Computer Society. Xue, Z., Shen, G., Li, J., Xu, Q., Zhang, Y., & Shao, J. (2012). Compression-aware I/O performance analysis for big data clustering. In Proceedings of the 1st international workshop on big data, streams and heterogeneous source mining: Algorithms, systems, programming models and applications (pp. 45–52). Beijing, China: ACM. Yang, C., Wu, H., Huang, Q., Li, Z., & Li, J. (2011). Using spatial principles to optimize distributed computing for enabling the physical science discoveries (pp. 5498– 5503). Yang, C., Sun, M., Liu, K., Huang, Q., Li, Z., Gui, Z., et al. (2013). Contemporary computing technologies for processing big spatiotemporal data. Space-time integration in geography and GIScience: Research frontiers in the US and China. Springer. Zhao, W., Ma, H., & He, Q. (2009). Parallel K-means clustering based on MapReduce. In Proceedings of the 1st international conference on cloud computing (pp. 674–679). Beijing, China: Springer-Verlag. Zukowski, M., Heman, S., Nes, N., & Boncz, P. (2006). Super-scalar RAM-CPU cache compression. In Proceedings of the 22nd international conference on data engineering (pp. 59). IEEE Computer Society. Please cite this article in press as: Zhou, Y., et al. An efficient data processing framework for mining the massive trajectory of moving objects. Computers, Environment and Urban Systems (2015), http://dx.doi.org/10.1016/j.compenvurbsys.2015.03.004

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download An efficient data processing framework for mining the massive