Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
CCBD 2016 The 7th International Conference on Cloud Computing and Big Data Book of Program November 16-18, 2016 Macau, China Sponsored by Contents Conference Committee 1 Conference Schedule 6 Keynotes 7 Oral Presentation Sessions 10 Poster Session 14 Macau Big Data Public Forum 17 Abstract 18 Index 42 Keynote Speakers 42 Paper Session 42 Conference Committee Honorary Chairs Macau, China Wei Zhao University of Macau China Wen Gao National Science Foundation of China China Hong Mei Chinese Academy of Science General Chairs Macau, China Lionel Ni University of Macau America Geoffrey Charles Fox Indiana University Hong Kong, China Benjamin W. Wah The Chinese University of Hong Kong Taiwan Tei-Wei Kuo National Taiwan University Program Chairs Macau, China Yuan Yan Tang University of Macau Italy Ernesto Damiani University of Milan America Chengzhong Xu Wayne State University China Chunming Hu Beihang Unviersity Local Organization Chairs Macau, China Chi Man Pun University of Macau Macau, China Jian Tao Zhou University of Macau Long Chen University of Macau Yibo Zhang University of Macau Publicity Chairs Macau, China Publication Chairs Macau, China Finance and Registration Chairs Macau, China Leong Hou U University of Macau Steering Committee 1 Macau, China C. L. Philip Chen University of Macau China Runhua Lin Chinese Institute of Electronics China Guangnan Ni Academician, Chinese Academy of Engineering China Rulin Liu Chinese Institute of Electronics China Ke Liu National Science Foundation of China United Kingdom Ameer Al-Nemrat University of East London Australia Bahman Javadi University of Western Sydney China Bing Tang Hunan University of Science and Technology China Chao Yin Jiujiang University China Chen Xu East China Normal University America Dakai Zhu University of Texas at San Antonio India Deo Prakash Vidyarthi Jawaharlal Nehru University Singapore Feida Zhu Singapore Management University Cyprus George Pallis University of Cyprus China Hao Chen Nankai University in China China Haofen Wang East China University of Science and Technology Taiwan Hung-Chang Hsiao National Cheng Kung University China Jingwei Zhang East China Normal University America Jinoh Kim Texas A &M University-Commerce Australia John Yearwood Deakin University America Jyh-Haw Yeh Boise State University China Kai Chen Shanghai Jiaotong University Switzerland Katarzyna Wac University of Geneva China Kejun Dong Chinese Academy of Sciences China Li Xu Fujian Normal University America Prasad Kulkarni University of Kansas America Seetharami Seelam IBM Research America Serban Maerean IBM System & Technology Group Technical Committee 2 America Tom Hacker Purdue University Brazil Weigang Li University of Brasilia China Xiaojun Hei Huazhong University of Science and Technology China Xijin Tang CAS Academy of Mathematics & Systems Science Australia Xinyi Huang University of Wollongong China Yanming Shen Dalian University of Technology China Yi Wang Tsinghua University China Ying Yan Microsoft Research Asia America Yuan Ding Google China Yuwei Peng Wuhan University China Zhipeng Gao Beijing University of Posts and Telecommunications America Abdelhalim Amer Argonne National Laboratory United Kingdom Ahmad Afsahi Queen's University America Gagan Agrawal The Ohio State University America Haiying Shen Clemson University America Krishna Kant Temple University America Michela Taufer University of Delaware America Rong Ge Marquette University America Sangmin Seo Argonne National Laboratory Spain Toni Cortes Barcelona Supercomputing Center America Xin Yuan Florida State University China Yunquan Zhang Chinese Academy of Sciences America Zhiling Lan Illinois Institute of Technology Australia Andrzej Goscinsk Deakin University Taiwan Ching-Hsien Hsu Chung Hua University South Africa Ekow Otoo University of Witwatersrand America Farokh Bastani UT Dallas Malaysia Hairulnizam Mahdin UTHM America Jerry Gao San Jose State University 3 China Jianxin Li Beihang University America Jicheng Fu University of Central Oklahoma Saudi Arabia M. Shamim Hossain King Saud University United Kingdom Omer Rana Cardiff University United Kingdom Paul Townend University of Leeds Canada Ruppa Thulasiram University of Manitoba Taiwan San-Yih Hwang National Sun Yat-sen University China Wei He Shangdong University America Yinong Chen Arizona State University Australia Yun Yang Swinburne University of Technology America Zhonghang Xia Western Kentucky University Taiwan Fu-Hau Hsu National Central University America Geoffrey Charles Fox Indiana University United Kingdom Shakeel Ahmad De Montfort University Italy Danilo Ardagna Politecnico di Milano China Xiaoying Bai Tsinghua University New Zealand Jim Buchan Auckland University of Technology China Jian Cao Shanghai Jiaotong University America Keke Chen Wright State University China Yixiang Chen East China Normal University China Wanchun Dou Nanjing University Austria Schahram Dustdar Vienna University of Technology China Yanbo Han North China University of Technology China Qing He Chinese Academy of Science China Yuan He Tsinghua University Taiwan Robert C.H. Hsu Chung Hua University America Dijiang Huang Arizona State University China Hai Jin Huazhong University of Science and Technology China Xiaoyuan Jing Wuhan University Germany Jan Jurjens Technical University of Dortmund New Zealand Dong Seong Kim University of Canterbury 4 Germany Luigi Lo Lacono Cologne University of Applied Sciences China Jinhu Lu Chinese Academy of Sciences United Kingdom Graham Morgan University of Newcastle upon Tyne Denmark Neeli Prasad Aalborg University America Omer F. Rana Cardiff University China Chao Peng East China Normal University Germany Michael Resch University of Stuttgart China Qinbao Song Xi’an Jiaotong University America C. Chiu Tan Temple University China Weiqin Tong Shanghai University America Wei-Tek Tsai Arizona State University America Andy, Ju An Wang Southern Polytechnic State University China Qing Wang Chinese Academy of Sciences America Zhengping Wu University of Bridgeport America Haiyong Xie Yale University China Shengwu Xiong Wuhan University of Technology China He Zhang Nanjing University China Xiaolong Zhang Wuhan University of Science and Technology United Kingdom Hong Zhu Oxford Brookes University Australia Yang Xiang Deakin University China Li Zhang Tsinghua University China Ziran Zhao Tsinghua University China Jianping Gu Nuctech Company Limited 5 CCBD 2016, November 16-18, Macau, China Schedule 16 Nov 2016 8:30 Opening 9:00 Keynote – 1 Carlo Ghezzi [1] 10:00 10:30 Presentation Session – 1 Knowledge Discovery & Data Engineering in Cloud Computing and Big Data – Part I (6 talks in E4-G062) Keynote – 3 Ziran Zhao [3] Presentation Session - 6 Business Models and Applications for Cloud Computing and Big Data (6 talks in E4-G062) Old Macau (S8 – 1st Floor 1001) Presentation Session – 2 Knowledge Discovery & Data Engineering in Cloud Computing and Big Data – Part II (4 talks in E4-G062) 17:10 Keynote – 2 Xin Yao [2] Presentation Session - 4 Software Engineering, Tools & Services for Cloud Computing and Big Data – Part II (6 talks in E4-G062) 13:20 15:30 18 Nov 2016 Coffee break 12:10 15:00 17 Nov 2016 Presentation Session – 5 Architecture & Foundation of Cloud Computing and Big Data (5 talks in E4-G062) Short break Presentation Session - 7 Security, Privacy, Trust & Quality of Cloud Computing and Big Data (5 talks in E4-G062) Coffee break Presentation Session – 3 Software Engineering, Tools & Services for Cloud Computing and Big Data – Part I (5 talks in E4-G062) Historic Centre of Macau Moving to N1 Macau Big Data Public Forum (E4-G078) 17:30 Poster (32 posters in N1) 19:00 Banquet in Fortune Inn (N1) Cocktails at Tromba Rija*, Macau Tower End Notes: Conference is in E4-G062; Length of presentation: 14 minutes, Q&A: 1 minute, Setup: 1 minute. *After the cocktail there will be two shuttle buses: (1) Holiday Inn @ Cotai Central and (2) The Postgraduate Guest House @ University of Macau 6 Keynotes Keynote 1: Tolerating Uncertainty via Evolvable-by-Design Software Room: E4-G062, 9:00 – 10:00, Wednesday, 16 Nov. 2016 Carlo Ghezzi, Politecnico di Milano Abstract: Uncertainty is ubiquitous when software is designed. Requirements are often uncertain, and volatile. Assumptions about the behavior of the environment in which the software will be embedded are also often uncertain. The virtual platform on which the software will be operated may likewise be subject to uncertain operating conditions. Design-time uncertainty is resolved during operation, and often the way it is resolved changes over time. This leads to the need for software to evolve continuously, to keep guaranteeing satisfaction of its quality goals. Evolution can partly be self-managed, by adding self-adaptive capabilities to the software. This requires an upfront careful analysis to understand where the sources of uncertainty, how they can be resolved during operation, and how they can be managed through dynamic reconfigurations. Whenever self-adaptation cannot solve the problems, designers must be in the loop to provide new solutions that can be dynamically incorporated in the running system. The talk provides a holistic view of how to handle uncertainty, which is based on the notion of perpetual development and adaptation. It shows that existing approaches to software development need to be rethought to respond to these challenges. The traditional separation between development and operation (design time and run time) blurs and even fades. The talk especially focuses on modeling and verification, which need to be rethought in the light of perpetual development and evolution. It also focuses on achieving self-adaptation to support continuous satisfaction of non-functional requirements --- such as reliability, performance, energy consumption --- in the context of virtualized environments (cloud computing, service-oriented computing). Biography: Carlo Ghezzi is an ACM Fellow (1999), an IEEE Fellow (2005), a member of the European Academy of Sciences and of the Italian Academy of Sciences. He received the ACM SIGSOFT Outstanding Research Award (2015) and the Distinguished Service Award (2006). He is the current President of Informatics Europe. He is a regular member of the program committee of flagship conferences in the software engineering field, such as the ICSE and ESEC/FSE, for which he also served as Program and General Chair. He has been the Editor in Chief of the ACM Trans. on Software Engineering and Methodology and an associate editor of and IEEE Trans. on Software Engineering. Currently he is an Associate Editor of the Communications of the ACM and Science of Computer Programming. Ghezzi’s research has been mostly focusing on different aspects of software engineering. He co-authored over 200 papers and 8 books. He coordinated several national and international research projects. He has been the recipient of an ERC Advanced Grant. 7 Keynote 2: From Ensemble Learning to Learning in the Model Space Room: E4-G062, 9:00 – 10:00, Thursday, 17 Nov. 2016 Xin Yao, Southern University of Science and Technology of China Abstract: Ensemble learning has been shown to be very effective in solving many challenging regression and classification problems. Multi-objective learning offers not only a novel method to construct and learn ensembles automatically, but also better ways to balance accuracy and diversity in an ensemble. This talk introduces the basic ideas behind multi-objective learning. It describes how ensembles can be used in mining data streams from the point of view of online learning. In particular, the importance of diversity in online learning is demonstrated. Finally, a novel approach to data stream mining is presented --- learning in the model space, which can handle very challenging data streams. The effectiveness of such an approach is illustrated by concrete examples in cognitive fault diagnosis. Biography: Xin Yao is a Chair Professor of Computer Science at the Southern University of Science and Technology in Shenzhen, China. He is an IEEE Fellow and the President (2014-15) of IEEE Computational Intelligence Society (CIS). His work won the 2001 IEEE Donald G. Fink Prize Paper Award, 2010 IEEE Transactions on Evolutionary Computation Outstanding Paper Award, 2010 BT Gordon Radley Award for Best Author of Innovation (Finalist), 2011 and 2015 IEEE Transactions on Neural Networks Outstanding Paper Awards, and many other best paper awards. He won the prestigious Royal Society Wolfson Research Merit Award in 2012 and the IEEE CIS Evolutionary Computation Pioneer Award in 2013. He was the Editor-in-Chief (2003-08) of IEEE Transactions on Evolutionary Computation and is an Associate Editor or Editorial Member of more than ten other journals. His major research interests include evolutionary computation, ensemble learning, and their applications, especially in software engineering. 8 Keynote 3: Human Millimeter-wave Holographic Imaging and Automatic Target Recognition Room: E4-G062, 9:00 – 10:00, Friday, 18 Nov. 2016 Ziran Zhao, Duty Director of Institute for Security Detection Technology of Tsinghua University Abstract: Millimeter-wave (MMW) holographic Imaging is one of the most effective methods for human inspection because it can acquire three-dimensional images of human body via a single scan. Due to its high penetration through fabrics and contrast in reflectivity, we can easily distinguish contrabands such as guns and explosives on human body in MMW images. Besides, millimeter wave is non-ionizing radiation of no potential health threat. Our imaging system utilizes a linear antenna array to improve scanning speed. The image reconstruction is achieved via Fast Fourier Transform (FFT) and spatial spherical wave expansion. Linear antenna array will results in artifacts in the reconstructed images, system errors and background scattering could also have negative influences on MMW images. We propose a set of calibration and denoising methods to eliminate these influences and obtain significant image quality in experimental studies. Our experiments indicate that these methods could improve the image quality. Automatic Target Recognition (ATR) based on millimeter-wave holographic Images is a key step to meet the requirements of intelligent devices. Object detection methods for color images are not very efficient in human body MMW images. Thus, we proposed a synthetic object detection method for MMW images on the basis of machine learning. According to previous theories, both multi-layer model and sparse coding could improve the accuracy of recognition. Thus, we select saliency, SIFT and HOG features to describe MMW images and build a two-layer model to encode these features. The encoded features are fed to a linear SVM for target/non-target classification. As the amount of training data contributes to the efficiency of SVM classifier, we build a training set consists of over 30,000 human body MMW images which is generated from the 3,154 original images via several image augmentation techniques. The experimental results indicate that the total target detection rate in MMW images is improved from 70% to 85% by training set augmentation, which demonstrate the efficiency of our method. Biography: Dr. Zhao Ziran received his B.S. and Ph.D. from the Tsinghua University, in 1998 and 2004, respectively. In 1994, he joined the Tsinghua University and became associate professor in 2008. He received a joint appointment as executive deputy director of Institute for Security Detection Technology in 2012. Dr. Zhao’s research interest is generally in the area of imaging and detection technology. In particular, he works to apply new image reconstruction algorithms to a wide range of applications including millimeterwave imaging, terahertz imaging, radiation imaging, cosmic-ray muon tomography. He has being devoted in solving scientific and technical problems on security detection technology and providing hi-tech equipment for anti-terrorism. He won the National Patent Golden award of 2009. And he is the main member of Tsinghua university radiation imaging innovative research team, which won the National Science and Technology Progress Award (Innovative Research Team) in 2013. 9 Oral Presentation Session Each presentation has 16 minutes: 1 minute Setup + 14 minute Oral Presentation + 1 minute Q&A. Presentation Session 1 Knowledge Discovery & Data Engineering in Cloud Computing and Big Data – Part I 10:30 – 12:10, Wednesday, 16 Nov. 2016 Room: E4-G062 Chair: Jingzhi Guo Time Ref. Author Title 10:30-10:46 [5] Yi Tan Multi-view Clustering via Co-regularized Nonnegative Matrix Factorization with Correlation Constraint 10:46-11:02 [13] Anyong Qin Minimum Description Length Principle Based Atomic Norm for Synthetic Low-rank Matrix Recovery 11:02-11:18 [16] Huapeng Yu Transfer Learning for Face Identification with Deep Face Model 11:18-11:34 [22] Luyan Xiao When Taxi Meets Bus: Night Bus Stop Planning over Largescale Traffic Data 11:34-11:50 [23] Li Zhang Large-scale Classification of Cargo Images Using Ensemble of Exemplar-SVMs 11:50-12:06 [24] Manhua Jiang Characterizing On-Bus WiFi Passenger Behaviors by Approximate Search and Cluster Analysis Presentation Session 2 Knowledge Discovery & Data Engineering in Cloud Computing and Big Data – Part II 13:20 – 15:00, Wednesday, 16 Nov. 2016 Room: E4-G062 Chair: Ryan U Time Ref. Author 13:20-13:36 [28] Chang Lu 13:36-13:52 [39] Qingquan Lai 13:52-14:08 [43] Yunpeng Shen 14:08-14:24 [56] Zhenyu Liao Title Data Mining Applied to Oil Well Using K-means and DBSCAN Using Weighted SVM for Identifying User from Gait with Smart Phone Learning the Distribution of Data for Embedding Event Detection on Online Videos using Crowdsourced Time-Sync Comment 10 Presentation Session 3 15:30 – 17:10, Wednesday, 16 Nov. 2016 Software Engineering, Tools & Services for Cloud Computing and Big Data – Part I Room: E4-G062 Chair: Bob Zhang Time Ref. Author 15:30-15:46 [11] Zhigang Xu 15:46-16:02 [14] Xing Liu 16:02-16:18 [22] Xutian Zhuang 16:18-16:34 [36] Xichun Yue 16:34-16:50 [41] Chan-Fu Kuo Title An VM Scheduling Strategy Based on Hierarchy and Load for OpenStack Hitchhike: an I/O Scheduler Enabling Writeback for Small Synchronous Writes Queries over Large-scale Incremental Data of Hybrid Granularities An Optimized Approach to Protect Virtual Machine Image Integrity in Cloud Computing On Construction of an Energy Monitoring Service Using Big Data Technology for Smart Campus Presentation Session 4 10:30 – 12:10, Thursday, 17 Nov. 2016 Software Engineering, Tools & Services for Cloud Computing and Big Data – Part II Room: E4-G062 Chair: Zhiguo Gong Time Ref. Author Title 10:30-10:46 [51] Jou-Fan Chen Financial Time-series Data Analysis using Deep Convolutional Neural Networks 10:46-11:02 [53] Bo Li Performance Comparison and Analysis of Yarn's Schedulers with Stress Cases 11:02-11:18 [59] Shaohuai Shi Benchmarking State-of-the-Art Deep Learning Software Tools 11:18-11:34 [62] Enqing Tang Performance Comparison between Five NoSQL Databases 11:34-11:50 [69] Manyi Cai 11:50-12:06 [48] Yu-Fu Chen A Protocol for Extending Analytics Capability of SQL Database Binary Classification and Data Analysis for Modeling Calendar Anomalies in Financial Markets 11 Presentation Session 5 13:20 – 14:30, Thursday, 17 Nov. 2016 Architecture & Foundation of Cloud Computing and Big Data Room: E4-G062 Chair: Jiantao Zhou Time Ref. Author Title Low Complexity WSSOR-based Linear Precoding for Massive MIMO Systems 13:20-13:36 [8] Li Zhang 13:36-13:52 [25] Ni Luo 13:52-14:08 [33] Binyang Li IMFSSC: An In-Memory Distributed File System Framework for Super Computing 14:08-14:24 [55] Tsz Fai Chow Utilizing Real-Time Travel Information, Mobile Applications and Wearable Devices for Smart Public Transportation 14:24:14:40 [50] Chin Chou Performance Modeling for Spark Using SVM Affinity Propagation Clustering for Intelligent Portfolio Diversification and Investment Risk Reduction Presentation Session 6 10:30 – 12:10, Friday, 18 Nov. 2016 Business Models and Applications for Cloud Computing and Big Data Room: E4-G062 Chair: Szu-Hao Huang Time Ref. Author 10:30-10:46 [49] Szu-Hao Huang 10:46-11:02 [64] Xiaoxue Hu 11:02-11:18 [66] Mei-Chen Wu 11:18-11:34 [67] Yu-Hsiang Hsu 11:34-11:50 [70] Simon Fong 11:50-12:06 [71] Guoshuai Zhao Title Decision Support System for Real-Time Trading based on On-Line Learning and Parallel Computing Techniques Efficient Power Allocation under Global Power Cap and Application-Level Power Budget Treand Behavior Research by Pattern Analysis in Financial Big data - A Case Study of Taiwan Index Futures Market Applying Market Profile Theory to Analyze Financial Big Data and Discover Financial Market Trading Behavior - A Case Study of Taiwan Futures Market Competitive Intelligence Study on Macau Food and Beverage Industry Finding Optimal Meteorological Observation Locations by Multi-Source Urban Big Data Analysis 12 Presentation Session 7 13:20 – 15:00, Friday, 18 Nov. 2016 Security, Privacy, Trust & Quality of Cloud Computing and Big Data Room: E4-G062 Chair: Pattarasinee Bhattarakosol Time Ref. Author Title Design and Implementation of A Role-Based Access Control for Categorized Resource in Smart Community Systems 13:20-13:36 [17] Siping Shi 13:36-13:52 [34] Weidian Zhan A Secure and VM-Supervising VDI System Based on OpenStack 13:52-14:08 [60] Tipaporn Juengchareonpoon A Mobile Cloud System for Enhancing Multimedia File Transfer with IP Protection 14:08-14:24 [61] Wenhan Zhu 14:24-14:40 [68] Lin Yang Distinguish True or False 4K Resolution using Frequency Domain Analysis and Free-Energy Modelling Protecting Link Privacy for Large Correlated Social Networks 13 Poster Session Room: N1, 17:30 – 19:00, Wednesday, 16 Nov. 2016 Reference Paper Title Author Affiliations [4] Energy Saving of Elevator Group Under uppeak Flow Based on Geese-PSO Chunzhi Wang Hubei University of Technology [6] An ACO-based Link Load-Balancing Algorithm in SDN Chunzhi Wang Hubei University of Technology [7] A Cost-effective Approach of Building Multitenant Oriented Lightweight Virtual HPC Cluster Rongzhen Li National University of Defense Technology [9] Multi-view Latent Space Learning based on Local Discriminant Embedding Xinge You [10] Improving Government-Data Learning via Distributed Clustering Analysis [12] Hadoop-MapReduce Job Scheduling Algorithms Survey [15] A New Template Update Scheme for Visual Tracking [18] A Robust Appearance Model for Object Tracking [19] GA-Based Sweep Coverage Scheme in WSN [20] A Short Text Similarity Algorithm for Finding Similar Police 110 Incidents [26] [27] [29] [30] [31] Yurong Zhong Ehab Mohamed Xiaohuan Lu Yi Li Huazhong University of Science and Technology Institute of Electronic Science and Technology Beijing University of Aeronautics and Astronautics Harbin Institute of Technology Shenzhen Graduate School Harbin Institute of Technology Shenzhen Graduate School Peng Huang Sichuan Agricultural University Lei Duan Beijing University of Aeronautics and Astronautics An Improved K-means text clustering algorithm By Optimizing initial cluster centers Caiquan Xiong Hubei University of Technology The Implementation of Air Pollution Monitoring Service Using Hybrid Database Converter Jia-Yow Weng Tunghai University Chang Liu Chengdu University Xinlu Zong Hubei University of Technology Li Zheng Nuctech Company Limited Super Resolution Reconstruction of Brain MR Image based on Convolution Sparse Network Evacuation behaviors and link Selection Strategy based on artificial fish swarm algorithm A Synthetic Targets Detection Method for Human Millimeter-wave Holographic Imaging System 14 [32] An Efficient Distributed Clustering Protocol Based on Game-Theory for Wireless Sensor Networks Xuegang Wu Chongqing University [35] Performance Evaluation for Distributed Join Based on MapReduce Jingwei Zhang Guilin University of Electronic Technology [37] Breaking the Top-k Restriction of the kNN Hidden Databases Zhiguo Gong University of Macau [38] Online Fake Drug Detection System in Heterogeneous Platforms using Big Data Analysis Yubin Zhao Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences [40] Blood Pressure Monitoring on the Cloud System in Elderly Community Centres: A Data Capturing Platform for Application Research in Public Health Kelvin Tsoi Chinese University of Hong Kong [42] A Smart Cloud Robotic System based on Cloud Computing Services Lujia Wang Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences [44] On Blind Quality Assessment of JPEG Images Guangtao Zhai Shanghai Jiao Tong University [45] Research on The Application of Distributed Self-adaptive Task Allocation Mechanism in Distribution Automation System Haitian Li North China Electric Power University [46] Noise-Robust SLIC Superpixel for Natural Images [47] Big Data Analysis on Radiographic Image Quality Jianping Gu Nuctech Company Limited [52] A Practical Model for Analyzing Push-based Virtual Machine Live Migration Cho-Chin Lin National Ilan University [54] Classification of Parkinson's disease and Essential Tremor Based on Structural MRI Li Zhang Chengdu University [57] Synthetic Data Generator for Classification Rules Learning Runzong Liu Chongqing University [58] A Flash Light System for Individuals with Visual Impairment Based on TPVM Wenbin Fang Shanghai Jiao Tong University [63] Collective Extraction for Opinion Targets and Opinion Words from Online Reviews Xiangxiang Jiang Guilin University of Electronic Technology [65] An Adaptive Tone Mapping Algorithm Based on Gaussian Filter Chang Liu Chongqing University 15 Jiantao Zhou University of Macau [72] Research on Algorithm of PSO in Image Segmentation of Cement-Based 16 Xiaojie Deng Hubei University of Technology Macao Big Data Public Forum 2016 澳門大數據公開論壇 2016 Big Data is an emerging field where innovative technology offers alternatives to resolve the inherent problems that appear when working with huge amounts of data, providing new ways to reuse and extract value from information. The 1st Macao Big Data Public Forum will also be held on November 18th, 2016 at the University of Macau and is jointly held with CCBD2016. The forum will discuss the current situation and potential development of Big Data in the future. Date: November 18, 2016 16:00-18:00 Venue: E4-G078 Chair: Prof. Lionel Ni, Vice Rector of the University of Macau Speakers: Prof. Qiang Yang, Hong Kong University of Science and Technology Dr. Yu Zheng, Microsoft Research Prof. Lei Chen, Hong Kong University of Science and Technology Time Agenda 15:30-16:00 Guests Arrival & Registration 16:00-16:10 Welcome Speech by Prof. Lionel Ni Keynote 1: Prof. Qiang Yang, 16:10-16:40 Hong Kong University of Science and Technology Keynote 2: Dr. Yu Zheng, 16:40-17:10 Microsoft Research Keynote 3: Prof. Lei Chen, 17:10-17:40 Hong Kong University of Science and Technology 17:40-17:55 Joint Q & A Session 17:55-18:00 Closing by Prof. Lionel Ni Official Hotel: Holiday Inn Macao Cotai Central Organizers: University of Macau Macao Convention & Exhibition Association 17 Abstract [1] Tolerating Uncertainty via Evolvable-by-Design Software Carlo Ghezzi, Politecnico di Milano Abstract: Uncertainty is ubiquitous when software is designed. Requirements are often uncertain, and volatile. Assumptions about the behavior of the environment in which the software will be embedded are also often uncertain. The virtual platform on which the software will be operated may likewise be subject to uncertain operating conditions. Design-time uncertainty is resolved during operation, and often the way it is resolved changes over time. This leads to the need for software to evolve continuously, to keep guaranteeing satisfaction of its quality goals. Evolution can partly be self-managed, by adding self-adaptive capabilities to the software. This requires an upfront careful analysis to understand where the sources of uncertainty, how they can be resolved during operation, and how they can be managed through dynamic reconfigurations. Whenever self-adaptation cannot solve the problems, designers must be in the loop to provide new solutions that can be dynamically incorporated in the running system. The talk provides a holistic view of how to handle uncertainty, which is based on the notion of perpetual development and adaptation. It shows that existing approaches to software development need to be rethought to respond to these challenges. The traditional separation between development and operation (design time and run time) blurs and even fades. The talk especially focuses on modeling and verification, which need to be rethought in the light of perpetual development and evolution. It also focuses on achieving self-adaptation to support continuous satisfaction of non-functional requirements --- such as reliability, performance, energy consumption --- in the context of virtualized environments (cloud computing, service-oriented computing). [2] From Ensemble Learning to Learning in the Model Space Xin Yao, Southern University of Science and Technology Abstract: Ensemble learning has been shown to be very effective in solving many challenging regression and classification problems. Multi-objective learning offers not only a novel method to construct and learn ensembles automatically, but also better ways to balance accuracy and diversity in an ensemble. This talk introduces the basic ideas behind multi-objective learning. It describes how ensembles can be used in mining data streams from the point of view of online learning. In particular, the importance of diversity in online learning is demonstrated. Finally, a novel approach to data stream mining is presented --- learning in the model space, which can handle very challenging data streams. The effectiveness of such an approach is illustrated by concrete examples in cognitive fault diagnosis. [3] Human Millimeter-wave Holographic Imaging and Automatic Target Recognition Ziran Zhao, Duty Director of Institute for Security Detection Technology of Tsinghua University Abstract: Millimeter-wave (MMW) holographic Imaging is one of the most effective methods for human inspection because it can acquire three-dimensional images of human body via a single scan. Due to its high penetration through fabrics and contrast in reflectivity, we can easily distinguish contrabands such as guns and explosives on human body in MMW images. Besides, millimeter wave is non-ionizing radiation of no potential health threat. Our imaging system utilizes a linear antenna array to improve scanning speed. The image reconstruction is achieved via Fast Fourier Transform (FFT) and spatial spherical wave expansion. Linear antenna array will results in artifacts in the reconstructed images, system errors and background scattering could also have negative influences on MMW images. We propose a set of calibration and denoising methods to eliminate these influences and obtain significant image quality in 18 experimental studies. Our experiments indicate that these methods could improve the image quality. Automatic Target Recognition (ATR) based on millimeter-wave holographic Images is a key step to meet the requirements of intelligent devices. Object detection methods for color images are not very efficient in human body MMW images. Thus, we proposed a synthetic object detection method for MMW images on the basis of machine learning. According to previous theories, both multi-layer model and sparse coding could improve the accuracy of recognition. Thus, we select saliency, SIFT and HOG features to describe MMW images and build a two-layer model to encode these features. The encoded features are fed to a linear SVM for target/non-target classification. As the amount of training data contributes to the efficiency of SVM classifier, we build a training set consists of over 30,000 human body MMW images which is generated from the 3,154 original images via several image augmentation techniques. The experimental results indicate that the total target detection rate in MMW images is improved from 70% to 85% by training set augmentation, which demonstrate the efficiency of our method. [4] Energy Saving of Elevator Group Under up-peak Flow Based on Geese-PSO Chunzhi Wang, Hubei University of Technology Abstract: Vertical elevators are commonly used in high-rise buildings. Elevator Group Control System (EGCS) is designed to dispatch elevator cars to meet the needs of customer’s call in different floor. The optimization of EGCS is aiming at improving its transport capacity and service quality, which is a typical combinational optimization. Particle Swarm Optimization (PSO) is good at solving combinational optimization, but it is also easy to fall into local solutions. In this paper, according to the flight characteristics of geese group, we propose an improved PSO algorithm named Geese-PSO. By using a novel coding method, we can offer a natural way to meet our requirement. Finally in order to realize energy-saving of elevator group optimization, we derive the energy-saving function and time cost function, build the elevator group control model and give the optimization scheme. Simulation results demonstrate the effectiveness of the approach. [5] Multi-view Clustering via Co-regularized Nonnegative Matrix Factorization with Correlation Constraint Yi Tan, Guizhou Normal University Abstract: With the increasing availability of multi-view nonnegative data in practical applications, multi-view learning based on nonnegative matrix factorization (NMF) has attracted more and more attentions. However, previous works are either difficult to generate meaningful clustering results in terms of views with heterogeneous quality, or sensitive to noise. To address these problems, we propose a co-regularized nonnegative matrix factorization method with correlation constraint (CO-NMFCC) for multi-view clustering, which jointly exploits both consistent and complementary information across multiple views. Different from previous works, we aim at integrating information from multiple views efficiently and making it more robust to the presence of noisy views. More specifically, correlation constraint is imposed on the low-dimensional space to learn a common representation shared by multiple views. Meanwhile, we exploit the complementary information of multiple views through the co-regularization to accommodate the imbalance of the quality of views. In addition, experiments on two real datasets demonstrate that CO-NMFCC is an effective and promising algorithm for practical applications. [6] An ACO-based Link Load-Balancing Algorithm in SDN Chunzhi Wang, Hubei University of Technology 19 Abstract: Software Defined Networking is a novel network architecture, which separates data and control plane by OpenFlow. The feature of centralized control can be achieved to acquisition and allocation of global network resource. So, the link load-balancing of SDN is not as difficult as the traditional network. This paper proposes a link load-balancing algorithm based on Ant Colony Optimization (LLBACO). The algorithm uses the search rule of ACO and takes link load, delay and pack-loss as impact factors that ants select next node. For maintaining link load-balancing and reducing end-to-end transmission delay, the widest and shortest path in the all paths can be gained by ants. The results of simulations show that LLBACO can balance the link load of network effectively, improve the Quality of Service (QoS) and decrease network overhead, compared with existing algorithm. [7] A Cost-effective Approach of Building Multi-tenant Oriented Lightweight Virtual HPC Cluster Rongzhen Li, National University of Defense Technology Abstract: HPC are considered as increasingly important but only a small set of large enterprises or governments have the capability to use this high performance approach. In order to deliver HPC service and solve software dependency problems which rigidly restrict the usage of HPC applications. Based on Fat-Tree network topology and the virtual HPC cluster model, this paper provides a cloud of HPC delivery model and solves the dependency of HPC application software stack without destroying the initial HPC environments. Extensive experiments were conducted and the results validate the feasibility and the efficiency of our approach. [8] Low Complexity WSSOR-based Linear Precoding for Massive MIMO Systems Li Zhang, Anhui University Abstract: For massive MIMO system with hundreds of antennas at the base station and serve a lot of users, regularized zero forcing (RZF) precoding can achieve the high performance, but suffer from high complexity due to the required matrix inversion of large size. To solve this question, we propose a precoding based on weighted symmetric successive over relaxation(WSSOR)method to approximate the matrix inversion. The proposed method can reduce the computational complexity by about one order of magnitude and it can approach the RZF precoding. We also propose a simple way to choose the optional relaxation parameter in massive MIMO systems. And we choose weighting factor is only related to the system configuration parameters. Simulation results prove that the proposed WSSOR-based precoding can approach the near-optional performance of RZF precoding with small number of iterations. [9] Multi-view Latent Space Learning based on Local Discriminant Embedding Xinge You, Huazhong University of Science and Technology Abstract: In many computer vision systems, one object can be described by different features or extracted from different sources. These varying features or sources usually exhibit heterogeneous properties and can be referred to as multi-view data of the object. Individual view usually contains the information of one particular aspect and cannot describe the problem completely. But multi-view data can contain complete and complementary information of the problem. It is therefore derive the need to combine the information of multi-view data to better describe the problem as well as discover the connections and differences between multiple views. Complementary principle and consensus principle are two important principles for effective multi-view learning algorithms. When views capture information which is uniquely but not complete enough to give a uniform learning performance, these views may degrade the learning performance and it is therefore not an ideal solution to simply concatenate multiple views into single view. In this paper, we 20 proposed a multi-view latent space learning algorithm which assume that multi-view is extracted from the same latent space via distinct transformation. Under this assumption, our algorithm can have a good performance even though views are not complete and the space obtained can contain the valuable information of each view as well as get the underlying connections between multi-view data. Due to the local discriminant embedding of the input space, this multi-view latent space is more suitable for classification or recognition problems. The proposed algorithm is evaluated on two tasks: indoor scene classification and abnormal objects classification on MIT scene 67, Abnormal Objects database respectively. Extensive experiments show that the algorithm we proposed achieves comparable improvements when compared with many other outstanding methods. [10] Improving Government-Data Learning via Distributed Clustering Analysis Yurong Zhong, Institute of Electronic Science and Technology Abstract: Clustering analysis is a study which is of great value, and the large-scale government-data needed to be handled by cluster analysis is growing increasingly. Efficient analysis techniques of largescale data need to be adopted to handle the large-scale data. Traditional model of serial programming has serious scalability shortage, which don’t satisfy the need of the large-scale government-data handling for computing and storage resources. Distributed computing technology represented by the MapReduce has good scalability, and can greatly improve the execution efficiency of data-intensive algorithm, and give play to the computing power of compute cluster based on general hardware. Based on the background of "data platform for public petition", it aims to study how to combine the cluster analysis technology with the current massive government-data, extracting useful information from the mass characteristics hidden in the data through the cluster analysis technology, which can provide comprehensive analyses for system managers and decision makers. This paper focus on the study of combining basic distributed clustering algorithm and TF-IDF algorithm, developing the cases feature analysis module based on distributed clustering algorithm. Based on distributed clustering algorithm, according to the information of the cases, do clustering analysis of cases according to its characteristics, and then get several hidden information through several decisional result. [11] An VM Scheduling Strategy Based on Hierarchy and Load for OpenStack Zhigang Xu, Beijing University of Aeronautics and Astronautics Abstract: In the cloud computing environment, one of the most important module is the Scheduler. As the most popular open-source cloud platform, OpenStack provides us with a massive amount of scheduling strategies. But there is no one considering of the hierarchies of the VMs and hosts. We will guarantee the security of VM through these hierarchies. Although OpenStack is abundant in scheduling strategies, none of them is based on the network load of the host. This paper proposes a scheduling strategy based on the hierarchies and load. We will define the service levels and security levels for the VMs and hosts, then filter out the hosts which do not have the corresponding levels with the VMs. Each of the remaining hosts will get a weight value according to their overall load: such as CPU, memory, disk, network. In the end, the host with the highest weight value will be selected to create a VM. We build a prototype system on OpenStack to demonstrate our design and test the result of our solution. According to the experiments, VM can be created on an appropriate host. [12] Hadoop-MapReduce Job Scheduling Algorithms Survey Ehab Mohamed, Beijing University of Aeronautics and Astronautics 21 Abstract: The big data computing era is coming to be a fact in all daily life. As data-intensive become a reality in many of scientific branches, finding an efficient strategy for massive data computing systems has become a multi-objective improvement. Processing these huge data on the distributed hardware clusters as Clouds needs a powerful computation model like Hadoop-MapReduce. In this paper, we studied various schedulers developed in Hadoop in Cloud Environments, features and issues. Most existing studies considered the improvement in the performance from the single point of view (scheduling, locality of data, the correctness of the data, etc. ) but very few literature involved multi-objectives improvements (quality requirements, scheduling entities, and dynamic environment adaptation), especially in heterogeneous parallel and distributed systems. Hadoop and MapReduce are two important aspects in big data for handling structured and unstructured data. The Creation of an algorithm for node selection is essential to improve and optimize the performance of the MapReduce. This paper introduces a survey of the previous work done in the Hadoop-MapReduce scheduling and gives some suggestion for the improvement of it. [13] Minimum Description Length Principle Based Atomic Norm for Synthetic Low-rank Matrix Recovery Anyong Qin, Chongqing University Abstract: Recovering underlying low-rank structure of clean data corrupted with sparse noise/outliers has been attracting increasing interest. However, in many low-rank problems, neither the exact rank of estimated matrix nor the particular locations as well as the values of outliers is known. The conventional methods fail to separate the low-rank and sparse component, especially gross outliers. So we exploit the advantage of minimum description length principle and atomic norm to overcome the above limitations. In this paper, we first apply atomic norm to find all the candidate atoms of low-rank and sparse term respectively, and then minimize the description length of model as well as residual, in order to select the appropriate atoms of low-rank and the sparse matrix. The experimental results based on synthetic data sets demonstrate the effectiveness and robustness of the proposed method. [14] Hitchhike: an I/O Scheduler Enabling Writeback for Small Synchronous Writes Xing Liu, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences Abstract: Small and synchronous writes are pervasive in environments, which manifest in various levels of software stack, ranging from device drivers to application software. Given block interface, these writes can cause serious write amplifications, excess disk seeks or flash wear, and expensive flush operations, which, together, fairly degrade the overall I/O performance. To address these issues, we present a novel block I/O scheduler, named Hitchhike, in this paper, which is able to identify the small writes, and embed them into other data blocks via some compression technologies. With Hitchhike, we can complete a small write and another write in one atomic block operation, getting rid of the write amplification, and the overhead in excess disk seeks. We implemented Hitchhike based on the CFQ and Deadline I/O schedulers in Linux 2.6.32, and evaluated it by running Filebench benchmark. Our results show that compared to traditional approaches, Hitchhike can significantly improve the performance of synchronous small writes. [15] A New Template Update Scheme for Visual Tracking Xiaohuan Lu, Harbin Institute of Technology Shenzhen Graduate School Abstract: Single object tracking can be focused on two phases under the particle filter framework: one is sparse representation, which can be regarded as a matching evaluation; the other is template update, which can be regarded as the appearance changes of the target. Template update is the most direct and basic 22 phase to ensure a high quality tracking. However, most template update schemes cannot capture the latest appearance of the target, thereby leading to a low quality tracking. In this paper, we propose a new template update scheme, which can obtain the latest trends of the target. The experimental results on popular benchmark video sequences show that the proposed template update scheme is feasible and effective. [16] Transfer Learning for Face Identification with Deep Face Model Huapeng Yu, Chengdu University Abstract: Deep face model learned on big dataset surpasses human for face recognition task on difficult unconstrained face dataset. But in practice, we are often lack of resources to learn such a complex model, or we only have very limited training samples (sometimes only one for each class) for a specific face recognition task. In this paper, we address these problems through transferring an already learned deep face model to specific tasks on hand. We empirically transfer hierarchical representations of deep face model as a source model and then learn higher layer representations on a specific small training set to obtain a final task-specific target model. Experiments on face identification tasks with public small data set and practical real faces verify the effectiveness and efficiency of our approaches for transfer learning. We also empirically explore an important open problem -- attributes and transferability of different layer features of deep model. We argue that lower layer features are both local and general, while higher layer ones are both global and specific which embraces both intra-class invariance and inter-class discrimination. The results of unsupervised feature visualization and supervised face identification strongly support our view. [17] Design and Implementation of a Role-Based Access Control for Categorized Resource in Smart Community Systems Siping Shi, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences Abstract: With the progressive development of smart communities, security of smart community systems becomes an important issue. Role-Based Access control is a way to solve this problem. However, existing implementations of role-based access control are not fine-grained, and it takes no account of category information of the resources. As every resource in the model was authorized in the same way, such a model cannot meet the security requirement of smart community systems. In this paper, we proposed an improved role-based access control model for categorized resources, in combination with special requirements of smart community systems. We designed a role-based access control model for categorized resources, which integrates the community category information in the definition of roles so as to limit the number of roles. The new model was fully implemented in a community management system with 14500 users from 14 communities. We compared our system to Spring Security, an existing security open source framework and demonstrated the advantages of our access control model. [18] A Robust Appearance Model for Object Tracking Yi Li, Harbin Institute of Technology Shenzhen Graduate School Abstract: Patch strategy is widely adopted in visual tracking to address partial occlusions. However, most patch-based tracking methods either assume all patches sharing the same importance or exploit simple prior for computing the importance of each patch, which may depress the tracking performance when the target object is non-rigid or the background information is included in the initial bounding box. To this end, an importance-aware appearance model with respect to the target patches and background patches is built, which is able to adaptively evaluate the importance of each target/background patch by means of the local self-similarity. In addition, we propose a novel bi-directional multi-voting scheme, which integrates a multi- 23 voting scheme and the two-side agreement scheme, to produce a reliable target-background confidence map. Combining the importance-aware appearance model and the bi-directional multi-voting scheme, a robust patch-based tracking method is proposed. Experimental results demonstrate that the proposed tracking method outperforms other state-of-the-art methods on a set of challenging tracking tasks. [19] GA-Based Sweep Coverage Scheme in WSN Peng Huang, Sichuan Agricultural University Abstract: The minimum number of required sensors problem in sweep coverage, one of the important coverage problems in WSN, uses a small number of mobile sensor nodes to satisfy both POI (Point of Interest) coverage and data delivery is a dynamic coverage problem. To find the minimum number of mobile sensor nodes with a uniform speed to guarantee sweep coverage was proved to be a NP-hard problem. In this paper, we investigate the minimum number of required sensors problem in sweep coverage to minimize the number of mobile sensor nodes with limited data buffer size while satisfying dynamical POI coverage and data delivery simultaneously. A GA-based Sweep Coverage scheme (GASC) to solve the problem is proposed. In GASC, a random route generation is first introduced to create the initial routes for POIs, and then Genetic Algorithm is employed to optimize these routes. Computational results show that the proposed GASC approach outperforms all previously known and published methods. [20] A Short Text Similarity Algorithm for Finding Similar Police 110 Incidents Lei Duan, Beijing University of Aeronautics and Astronautics Abstract: Finding similar police 110 incidents from incident dataset plays an important role to recognize related cases from which the investigators could find more clues and make a better decision on police deployment. We aim at finding 110 incidents with similar case features and semantic compared against a given incident. A short text similarity algorithm called Police Incident Mover's Distance is presented. Our algorithm is developed from a novel semantic similarity algorithm Word Mover's Distance (WMD). In order to emphasize the significance of case features in incident text, the method introduces the traditional term frequency-inverted document frequency (TF-IDF) as term weights to the WMD. Then the algorithm is verified on the practical dataset of public security department to find similar incidents, and experiments show that the algorithm is effective and can improve the accuracy in finding similar police incidents. [21] When Taxi Meets Bus: Night Bus Stop Planning over Large-scale Traffic Data Luyan Xiao, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences Abstract: With more and more citizens traveling for life or work at night, there is a big gap between the demands and supplies for public transportation service in China. In this paper, we address the problem of night-bus stop planning by investigating the characteristics of taxi GPS trajectories and transactions, rather than leveraging subjective and costly surveys about citizen mobility patterns. There are two stages in our method. In the first stage, we extract the Pick-up and Drop-off Records (PDRs) from the taxi GPS trajectories and transactions for capturing citizens travel patterns at night. In the second stage, we propose DC-DBSCAN, an improved DBSCAN clustering algorithm by considering the Distance Constraint, to detect hot locations as candidate night-bus stops from the PDRs dataset. We take the service range of a bus stop into consideration, and optimize the candidates by considering the cost and convenience factors. Finally, our experiments demonstrate that our method is valid and with better performance than that of K-means. 24 [22] Queries over Large-scale Incremental Data of Hybrid Granularities Xutian Zhuang, South China Normal University Abstract: The development of Internet and Web systems in recent years has made it difficult and challenging to deal with large-scale data. What we usually need to process is incremental data, the scale of which will enlarge with the changing of time. Nowadays, many general queries over data is traditionally based on the full amount of raw data, which become a great challenge to performance when the data increases substantially. As these data will not be updated after generated, this paper proposes a query model, called hybrid-granularity model, that data and queries can be preprocessed to form intermediate result sets of different granularities. With query transformation, submitted queries can take advantage of intermediate preprocessing results, to obtain the required final results. In this paper, we also describe query transformation and the methods to seek the solution of best performance with hybrid-granularity model for the specific query. Finally, we analyze and verify the advantage on performance of the proposed model by comparison with the original model in experiment. The proposed solution is used in some practical systems, which shows that this solution can guarantee the correctness of query results while improving the responsive efficiency of the query significantly. [23] Large-scale Classification of Cargo Images Using Ensemble of Exemplar-SVMs Li Zhang, Tsinghua University Abstract: This paper develops a large-scale classification algorithm for cargo X-ray images using ensemble of exemplar-SVMs. Large-scale or fine-grained classification is very helpful for customs to improve the inspection efficiency and liberate their inspectors. However, big intra-class variation accompanied with small inter-class variation of cargo images makes it almost impossible to classify them using traditional class-SVM between classes. But those typical images with salient and representative features of some classes can be easily distinguished from others. Inspired by the idea of the ensemble of exemplar-SVMs for object detection, we develop a classification method using the ensemble of exemplar-SVMs of typical image patches. Firstly typical image patches are defined and the method of extracting them is discussed. Then for each of the typical image patches, a linear SVM is trained using itself as the positive sample and all others as the negative samples. In the classification step, fast detection method based on WTA-hash is used. Images are firstly classified to typical patches and then classified to the category of the corresponding typical image patches. A semantic tree built according to the HS code is used to trade off specificity for accuracy. [24] Characterizing On-Bus WiFi Passenger Behaviors by Approximate Search and Cluster Analysis Manhua Jiang, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences Abstract: On-bus WiFi emerges as a promising market in recent years. It is an interesting problem to investigate the characteristics of bus passengers by the spatial-temporal data collected by smart WiFi devices. In this paper, we analyze passenger behavior logs, including WiFi connection, WiFi disconnection, web authorization, data traffic, etc. We aggregate these passengers' activities into online events by finding out the relations among these activities. We describe all the trips of bus passengers and cluster these trips with all the passengers' Origin-Destination pairs (ODs) to find the distribution of passengers' interested points. Our results show that there is only 8.33% of on-bus WiFi passengers to explore the web. The average time used on on-bus WiFi only last 6 minutes in one connection, but total time which a passenger spends on on-bus WiFi is about 25 minutes a day. On-bus WiFi passengers' average data traffic of the network is periodical. It is growing slowly until Sunday while dropping down on Monday, which indicates 25 passengers prefer to use on-bus WiFi more frequently at weekends. 44.7% of the passengers are active in only one place and 39% of the passengers are active in two places. [25] Performance Modeling for Spark Using SVM Ni Luo, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences Abstract: At present, Spark is widely used in a number of enterprises. Although Spark is much faster than Hadoop for some applications, its configuration parameters can have a great impact on its performance due to the number and complexity of the parameters, and various characteristics of applications. Unfortunately, there is not yet any research conducted to predict the performance of Spark based on its configuration sets. In this paper, we employ a machine learning method- Support Vector Machine(SVM) to build performance models for Spark. The input of configuration sets is collected by running Spark application previously with randomly modified and combined property values. In this way, we also determine the range of each property and gain a deeper understanding about how these properties work in Spark. We also use Artificial Neural Network to model the performance of Spark and find that the error rate of ANN is on average 1.98 times that of SVM for three workloads from HiBench. [26] An Improved K-means text clustering algorithm By Optimizing initial cluster centers Caiquan Xiong, Hubei University of Technology Abstract: K-means clustering algorithm is an influential algorithm in data mining. The traditional K-means algorithm has sensitivity to the initial cluster centers, leading to the result of clustering depends on the initial centers excessively. In order to overcome this shortcoming, this paper proposes an improved K-means text clustering algorithm by optimizing initial cluster centers. The algorithm first calculates the density of each data object in the data set, and then judge which data object is an isolated point. After removing all of isolated points, a set of data objects with high density is obtained. Afterwards, chooses k high density data objects as the initial cluster centers, where the distance between the data objects is the largest. The experimental results show that the improved K-means algorithm can improve the stability and accuracy of text clustering. [27] The Implementation of Air Pollution Monitoring Service Using Hybrid Database Converter Jia-Yow Weng, Tunghai University Abstract: As air pollution becomes more and more serious, pollution hurts human health, people start to pay attention on real time value of air pollution factors monitoring and recording analysis. Because our system need to get data from air pollution monitoring stations usually. Among of data is growing faster, RDB (Relational database) is hard to process huge data. In order to maintain smooth monitoring, we must remove the historical data is not consolidated. But when we analysis air pollution data, historical data is an important target. So, how to dump data to NoSQL without change RDB system become an important things. In order to achieve our goal, this paper proposed an air pollution monitoring system combines Hadoop cluster to dump data from RDB to NoSQL and backup. This will not only reduce the loading of RDB and also keep the service performance. Dump data to NoSQL need to processing without affecting the real time monitoring on air pollution monitoring system. In this part we focus on without interruption web service. Improve efficiency through optimization of dump method, and data backup let service quickly restart by MapReduce and distributed databases when RDB impaired. And through three different types of Conversion mode get the best data conversion to be our system. At last, air pollution monitoring system 26 provide message about air pollution factors variation, as an important basis of environment detection and analysis, let people live in a more comfortable environment. [28] Data Mining Applied to Oil Well Using K-means and DBSCAN Chang Lu, Beijing Institute of Technology Abstract: Oil is essential to our life mainly in transportation, and thus the productivity of oil well is very important. Classification of oil wells can make it easier to manage wells to ensure good oil productivity. Machine learning is an emerging technology of analyzing data in which cluster is a good way to do classification. The paper will apply two kinds of cluster method to the data from Dagang oil well and then do analysis on not only the classification results but also the choice of method for future analysis. [29] Super Resolution Reconstruction of Brain MR Image based on Convolution Sparse Network Chang Liu, Chengdu University Abstract: In order to recover high resolution image from their corresponding low-resolution counterparts for MR Image, this paper has proposed a super resolution reconstruction method to recover the lowresolution MR images based on convolution neural network. Based on the proposed network, the convolution operation and non-linear mapping are employed to adapt MR images naturally and leaning the end-to-end mapping from low/high-resolution images. On one hand, convolution operation is natural for image processing; on the other hand, non-linear mapping is helpful to explore the non-linear mapping relationship between low resolution and high resolution images and enhance the sparsity of feature representation. The experiments have demonstrated that the proposed convolution sparse network has the ability to restore the detail information from low resolution MR images and achieve better performance for super resolution reconstruction. [30] Evacuation Behaviors and Link Selection Strategy Based On Artificial Fish Swarm Algorithm Xinlu Zong, Hubei University of Technology Abstract: Frequent accidents are taken in public places in recent years, causing heavy casualties and economic losses. It is necessary for us to research effective evacuation. Provide evacuation route guidance strategy for the evacuated individuals is the key for emergency evacuation. This paper proposes a evacuation model based on Artificial Fish Awarm Algorithm ( AFSA ). Define the evacuee as an intelligent artificial fish and utilize the preying behavior, swarming behavior and following behavior of artificial fish swarm to simulate the mental activity, path selection, behavior preference of individuals. The model embody the characteristics of evacuation rules and uncertainties in the process and we can get the optimum path planning. In this paper, we take Zhuankou Stadium in as the experiment environment, to conduct a research of the simulation of evacuation process under emergency. The results of simulations show that this method can and balance the congestion and improve the efficiency and fidelity of evacuation, compared with existing algorithm. [31] A Synthetic Targets Detection Method for Human Millimeter-wave Holographic Imaging System Li Zheng, Nuctech Company Limited. 27 Abstract: Automatic Target Recognition (ATR) technology is of great significance in security inspection, while traditional object detection methods are proved not efficient in human body millimeter-wave images. In this paper, we propose a synthetic objection detection method for millimeter-wave images. We choose saliency, SIFT and HOG features to form image descriptors. According to sparse representation, the features are encoded again and fed to a linear SVM for target/non-target classification. Previous works proved that the amount of training samples would influence the efficiency of SVM classifiers. Thus, we utilize several simulating methods for data augmentation, aiming to increase the number of training samples before training linear SVM classifiers. The experimental results show that our approach is efficient in target detection of human body millimeter-wave images. Moreover, classifiers trained on larger sets with simulated samples have better performance in classification on our testing dataset. [32] An Efficient Distributed Clustering Protocol Based on Game-Theory for Wireless Sensor Networks Xuegang Wu, Chongqing University Abstract: Clustering has been known as an effective way to reduce energy dissipation and prolong network lifetime in wireless sensor networks (WSNs). Game theory (GT) has been used to find the optimal solutions to clustering problems. However, the residual energy of nodes is not considered when calculating the equilibrium probability in the earlier studies. Besides, under the perspective of energy consumption, the definitions of payoffs in local clustering games are required when calculating the equilibrium probability. Based on the considerations, a hybrid of the equilibrium probability attained by playing local clustering games and a cost-dependent exponential function is proposed to get the probability to be CH(Cluster Head), in which new definitions of payoffs under the perspective of energy consumption are used. In the paper, we proposed an efficient distributed, game-theory based clustering protocol (EDGC) for Wireless Sensor Networks. [33] IMFSSC: An In-Memory Distributed File System Framework for Super Computing Binyang Li, Beijing University of Aeronautics and Astronautics Abstract: Supercomputing has been widely implemented in theoretical physics, theoretical chemistry, climate modeling, biology simulation and medicine research for high-performance and energy-efficient computing. Many of scientific applications are I/O sensitive and users have to tolerate high latency when supercomputing center storage processes thousands of I/O requests. In this paper, IMFSSC, an in-memory distributed file system framework for super computing is proposed, which is supposed to improve latency performance for I/O sensitive applications and relieves the congestion when I/O requests burst. IMFSSC consists of modules for multiple master-slave supports, load balance among huge amount computing nodes, and uses memory space to store data to minimize the latency. Additional features will be added in the near future and it is supposed to provide better support for large scaled computer systems. Finally, the performance is tested under the framework for evaluation, which shows high scalability and good I/O performance. [34] A Secure and VM-Supervising VDI System Based on OpenStack Weidian Zhan, Beijing University of Aeronautics and Astronautics Abstract: Against the background of data explosion and cloud computing, this paper investigates a branch of the cloud computing technology which is known as VDI (virtual desktop infrastructure). Users can access data and information via cloud desktops with the endpoint devices. The paper studies OpenStack - a 28 famous open-source cloud platform which has been widely used, and introduces a secure, optimized and high-available VDI system based on it. The system provides responsive and high-available desktop connections by multi-thread VM operating, enhances the security of the user-login process by security labels and device authentication. [35] Performance Evaluation for Distributed Join Based on MapReduce Jingwei Zhang, Guilin University of Electronic Technology Abstract: Inner-Join is a fundamental and frequent operation in large-scale data analysis. MapReduce is the most widely available framework in large-scale data analysis. A variety of inner-join algorithms are put forward to run on the MapReduce environment. Usually, those algorithms are designed for specific scenarios, but inner-join could present very different performance when data volume, reference ratio, data skew rate, and running environments et al are varied. This paper summarized and implemented those wellknown join algorithms in a uniform MapReduce environment. Considering the number of tables, broadcast cost, data skew, join rate and related factors, we designed and conducted a large number of experiments to compare the time performance of those join algorithms. According to the experimental results, we analyzed and summarized the performance and applicability of those algorithms in different scenarios, which would be a reference of performance improvement for large-scale data analysis under different circumstances. [36] An Optimized Approach to Protect Virtual Machine Image Integrity in Cloud Computing Xichun Yue, Beijing University of Aeronautics and Astronautics Abstract: The development of cloud computing is surely unprecedented in IT industry with many companies adapting to this new technology. The related companies undoubtedly benefit a lot from cloud computing. Meanwhile, the security of cloud platforms becomes one of the concerns for companies. As an important underlying component, the virtual machine image is also in need of especial protection. In this paper, we propose an optimized approach to protect the virtual machine image integrity. In the approach, we propose an architecture of integrity protection, optimize a hardware environment as the fundamental deployment environment, design a measurement module to measure and verify images, and design a strategy module to handle the results. Finally, we integrate it with OpenStack and evaluate its security and performance. The experiments demonstrate that our approach can protect the image integrity well and the measurement speed is increased three times faster than the ordinary approach with a little more resource consumption. [37] Breaking the Top-k Restriction of the kNN Hidden Databases Zhiguo Gong, University of Macau Abstract: With the increasing development of Location-based services (LBS), the spatial data become accessible on the web. Often, such services provide a public interface which allows users to find k nearest points to an arbitrary query point. These services may be abstractly modeled as a hidden database behind a kNN query interface, we refer it as a kNN hidden database. The kNN interface is the only way we can access such hidden databases and can be quite restrictive. A key restriction enforced by such a kNN interface is the top-k output constraint - i.e., given an arbitrary query, the system only returns the k nearest points to the query point (where k is typically a small number such as 10 or 50), hence, such restriction prevents many third-party services from being developed over the hidden databases. In this paper, we investigate an interesting problem of ”breaking” the kNN restriction of such web databases to find more 29 than k nearest point. To our best knowledge, this is the first work to study the problem over the kNN hidden database. We investigate and design a set of algorithms which can efficiently address this problem. Beyond that, we also perform a set of experiments over synthetic datasets and real-world datasets which illustrate the effectiveness of our algorithms. [38] Online Fake Drug Detection System in Heterogeneous Platforms using Big Data Analysis Yubin Zhao, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences Abstract: The widespread use of internet provides extensive heterogeneous platforms for drug sales. The internet has greatly facilitated the development of merchandise sales, meanwhile, many fake drug sellers that has been strongly restricted in the market by law enforcement agencies build their own platforms for sales via the internet. In order to against the fake drug website, reduce time and human resource consumption, it is necessary to screen and identify the drug information on the internet automatically. In this paper, we develop an automatic drug information screening and content analytical system which online distract the information and mine the hidden relationship and find the source of the sellers. Our major contributions lie in those aspects as follows: (1) We apply focused crawler technique to transform the unstructured data on the drug website into structured data, and stored the data in the local database. (2) An integrated fake drug identification method is proposed which consists of an image recognition module and information retrieval module. Based on this method, the fake drug website is not only identified one by one, but we also extract the hidden connections of multiple platforms. Experimental results demonstrate that our system can successfully identify a large number of fake drug website. [39] Using Weighted SVM for Identifying User from Gait with Smart Phone Qingquan Lai, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences Abstract: With the development of authentication technology, fingerprint and speech authentication have applied to most smart devices, which means we are stepping into the era of biometric-based authentication. As the stable biological feature, the gait is used to establish the authentication model in many researches. Most of these researches are based on extracting cycles or statistics from the gait data which used as features in the authentication process with simple machine learning algorithm. The approach presented in the paper that extracts frequency-series features from gait data from acceleration sensor and uses Weighted Support Vector Machine to recognize users. Further, this paper uses the same methodology to perform the experiment, which shows improved performance of 3.57% EER. [40] Blood Pressure Monitoring on the Cloud System in Elderly Community Centres: A Data Capturing Platform for Application Research in Public Health Kelvin Tsoi, Chinese University of Hong Kong Abstract: Technology on the cloud frameworks in healthcare data management and analytics has opened new horizons for public health research. Hypertension is a significant modifiable risk factor for cardiovascular diseases. Nowadays, telemonitoring blood pressure (BP) has been suggested as an effective tool for BP control. However, elderly people always have difficulties when using electronic health monitoring devices at home. BP data capturing with cloud technology in elderly community centres under guidance and with healthcare provider alert function is a pioneer. In this study, the infrastructure of data collection is constructed on the cloud to capture behavioral data on BP meter use and BP readings. BP data will be generated by the daily use of BP measurement and uploaded to the cloud. All personal characteristics, electronic health records, BP data and call log with nurse can be encrypted and store on 30 the cloud. The remote platform on the cloud can provide efficient analytic performance on huge volume of data with high velocity of data creation in a population-based study. Data mining on the BP measurement will help to better understand the ways to control hypertension. This platform will also be potentially used in other epidemiological studies in public health. [41] On Construction of an Energy Monitoring Service Using Big Data Technology for Smart Campus Chan-Fu Kuo, Tunghai University Abstract: The prosperity of modern human civilization is attributed to the huge amount of resources and energy. With the increasing population and technological advancements, the demand for energy will definitely continue to increase. How to save energy has become an important issue. How to reduce expenses by reducing electricity consumption and unnecessary energy consumption are very important for institutions as well as universities. In this work, we proposed a system to collect the electricity usage data in campus buildings through smart meters and environmental sensors, and process the huge amount of data by big data processing techniques. Therefore, in this thesis we introduced cloud computing and big data processing architecture as solutions to build a real-time energy monitoring system for smart campus. In this work, we used Hadoop ecosystem which is built on big data processing architecture to improve the capacity of big data storage and processing for our system. We compared the performance of Hive and HBase for searching energy data, and the performance of relational database and big data distributed database for data search. We also presented a method to identify abnormal electrical condition through MapReduce framework, and compared the difference of performances between Spark and Hadoop in realtime processing. The proposed system has been implemented in Tunghai University. Finally, the system interface vividly displays the electricity usage states in campus buildings; thus, users can monitor the electricity usage in the campus and historical data at any time and any place. [42] A Smart Cloud Robotic System based on Cloud Computing Services Lujia Wang, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences Abstract: In this paper, we present a smart service robotic system based on cloud computing services. The design and implementation of infrastructure, computation components and communication components are introduced. The proposed system can alleviate the complex computation and storage load of robots to the cloud and provide various services to the robots. The computation components can dynamically allocate resources to the robots. The communication components allow easy access to the cloud and provide flexible resource management. Furthermore, we modeled the task-scheduling problem and proposed a max-heaps algorithm. The simulation results demonstrate that the proposed algorithm minimized the overall task costs. [43] Learning the Distribution of Data for Embedding Yunpeng Shen, Chongqing University Abstract: One of the central problems in machine learning and pattern recognition is how to deal with highdimensional data either for visualization or for classification and clustering. Most of dimensionality reduction technologies, designed to cope with the curse of dimensionality, are based on Euclidean distance metric. In this work, we propose an unsupervised nonlinear dimensionality reduction method which attempt to preserve the distribution of input data, called distribution preserving embedding (DPE). It is done by minimizing the dissimilarity between the densities estimated in the original and embedded spaces. In theory, 31 patterns in data can effectively be described by the distribution of the data. Therefore, DPE is able to discover the intrinsic pattern (structure) of data, including the global structures and the local structures. Additionally, DPE can be extended to cope with out-of-sample problem naturally. Extensive experiments on different data sets compared with other competing methods are reported to demonstrate the effectiveness of the proposed approach. [44] On Blind Quality Assessment of JPEG Images Guangtao Zhai, Shanghai Jiao Tong University Abstract: JPEG is still the most widely used image compression format. Perceptual quality assessment for JPEG images has been extensively studied for the last two centuries. While a large number of no-reference perceptual quality metrics have been proposed along the years, it is shown in this paper that on existing image quality databases, statistically, performance of many those metrics is not better than the quality factor (Q) for JPEG images, as used in the popular implementation by the IJG (Independent JPEG Group). It should be noted that Q or the quantization table computed from Q is almost always available at the decoder end so we focus our analysis on no-reference or blind quality assessment metrics. This research highlights the fact that despite of many progress achieved in that area, JPEG quality assessment is still a topic worth revisiting and further investigation. [45] Research on the Application of Distributed Self-adaptive Task Allocation Mechanism in Distribution Automation System Haitian Li, North China Electric Power University Abstract: With the development of distribution network, and the intelligentization of terminal devices in distribution automation system, it becomes key point that research the task allocation method and selfadaptive task allocation model from perspective of distributed system to improve the utilization of the devices in whole distribution network and optimize the system performance on intelligent terminal devices and master station. This paper proposes a mathematical model of the system performance index, and puts forward four kinds of typical distributed self-adaptive task allocation model in distribution automation system, OQOD model, OQND model, NQND model and NQOD model, according to the impact factors in the mathematical model and the existing distributed self-adaptive algorithm, such as self-adaptive Min-Min algorithm. This paper also analyzes the structural characteristics and application environment of three kinds of distributed self-adaptive task allocation model, and further puts forward the feasible suggestion about the future research on each model. [46] Noise-Robust SLIC Superpixel for Natural Images Jiantao Zhou, University of Macau Abstract: Superpixel algorithm aims to semantically group neighboring pixels into a coherent region. It could significantly boost the performance of the subsequent vision processing task such as image segmentation. Recently, the work simple linear iterative clustering (SLIC) [1] has drawn huge attention for its state-of-the-art segmentation performance and high computational efficiency. However, the performance of SLIC is dramatically degraded for noisy images. In this work, we propose three measures to improve the robustness of SLIC against noise: 1) a new pixel intensity distance measurement is designed by explicitly considering the within-cluster noise variance; 2) the spatial distance measurement is refined by exploiting the variation of pixel locations in a cluster; and 3) a noise-robust estimator is proposed to update the cluster centers by excluding the possible outliers caused by noise. Extensive experimental results on synthetic 32 noisy images validate the effectiveness of those improvements. In addition, we apply the proposed noiserobust SLIC to superpixel-based noise level estimation task to demonstrate its practical usage. [47] Big Data Analysis on Radiographic Image Quality Jianping Gu, Nuctech Company Limited. Abstract: Mass data generated from in-service radiographic product in routine work contains information on image quality. Analyzing it might supplement the time-consuming radiographic Quality Assurance Test Procedure to evaluate image quality, to know product performance in sites, to locate risks, and to give directions to manufacturers for the following actions. This article illustrates methodologies of extracting information from mass data and of applying Big Data in visual quality tracking, analysis, control, and risk mitigation. [48] Binary Classification and Data Analysis for Modeling Calendar Anamolies in Financial Markets Yu-Fu Chen, National Chiao Tung University Abstract: This paper studies on the Day-of-the-week effect by means of several binary classification algorithms in order to achieve the most effective and efficient decision trading support system. This approach utilizes the intelligent data-driven model to predict the influence of calendar anomalies and develop profitable investment strategy. Advanced technology, such as time-series feature extraction, machine learning, and binary classification, are used to improve the system performance and make the evaluation of trading simulation trustworthy. Through experimenting on the component stocks of S&P 500, the results show that the accuracy can achieve 70% when adopting two discriminant feature representation methods, including “multi-day technical indicators” and “intra-day trading profile.” The binary classification method based on LDA-Linear Prior kernel outperforms than other learning techniques and provides the investor a stable and profitable portfolios with low risk. In addition, we believe this paper is a FinTech example which combines advanced interdisciplinary researches, including financial anomalies and big data analysis technology. [49] Decision Support System for Real-Time Trading based on On-Line Learning and Parallel Computing Techniques Szu-Hao Huang, National Chiao Tung University Abstract: A novel intraday algorithmic trading strategy is developed based on various machine learning techniques and paralleled computing architectures in this paper. The proposed binary classification framework can predict the price trends of Taiwan stock index futures after thirty minutes. Traditional learning-based approaches collect all samples during the training period as the learning materials. The major contribution of this paper is to collect a subset of similar historical financial data to train the real-time trading model. This goal can be achieved by an on-line learning technique which is required to calculate an accurate model with training time limitation. In addition, the proposed joint-AdaBoost algorithm is to improve the system performance based on the concept of paired feature learning and planar weak classifier design. The core execution components in this algorithm can be further accelerated with the aid of Open Computing Language (OpenCL) parallel computing platform. The experimental results show that the proposed learning algorithm can improve the prediction accuracy of final classifier from 53.8% to 61.68%. Compared to the pure CPU implementation, the OpenCL version which uses CPU and GPGPU simultaneously can reduce the calculation time around 83.02 times. The efficiency improvement can decrease the delay of investment opportunity which is a critical issue in real-time financial decision support system application. To sum up, 33 this paper proposed a novel learning framework based on joint-AdaBoost algorithm with similar learning samples and OpenCL parallel computation. The extended financial decision support system is also proven to work effectively and efficiently in our simulation experiments to trade the Taiwan stock index futures. [50] Affinity Propagation Clustering for Intelligent Portfolio Diversification and Investment Risk Reduction Chin Chou, National Chiao Tung University Abstract: In this paper, an intelligent portfolio selection method based on Affinity Propagation clustering algorithm is proposed to solve the stable investment problem. The goal of this work is to minimize the volatility of the selected portfolio from the component stocks of S&P 500 index. Each independent stock can be viewed as a node in graph, and the similarity measurements of stock price variations between companies are calculated as the edge weights. Affinity Propagation clustering algorithm solve the graph theory problem by repeatedly update responsibility and availability message passing matrices. This research tried to find most representative and discriminant features to model the stock similarity. The testing features are divided into two major categories, including time-series covariance, and technical indicators. The historical price and trading volume data is used to simulate the portfolio selection and volatility measurement. After grouping these investment targets into a small set of clusters, the selection process will choose fixed number of stocks from different clusters to form the portfolio. The experimental results show that the proposed system can effectively generate more stable portfolio by Affinity Propagation clustering algorithm with proper similarity features than average cases with similar settings. [51] Financial Time-series Data Analysis using Deep Convolutional Neural Networks Jou-Fan Chen, National Chiao Tung University Abstract: A novel financial time-series analysis method based on deep learning technique is proposed in this paper. In recent years, the explosive growth of deep learning researches have led to several successful applications in various artificial intelligence and multimedia fields, such as visual recognition, robot vision, and natural language processing. In this paper, we focus on the time-series data processing and prediction in financial markets. Traditional feature extraction approaches in intelligent trading decision support system are used to applying several technical indicators and expert rules to extract numerical features. The major contribution of this paper is to improve the algorithmic trading framework with the proposed planar feature representation methods and deep convolutional neural networks (CNN). The proposed system are implemented and benchmarked in the historical datasets of Taiwan Stock Index Futures. The experimental results show that the deep learning technique is effective in our trading simulation application, and may have greater potentialities to model the noisy financial data and complex social science problems. In the future, we expected that the proposed methods and deep learning framework can be applied to more innovative applications in the next financial technology (FinTech) generation. [52] A Practical Model for Analyzing Push-based Virtual Machine Live Migration Cho-Chin Lin, National Ilan University Abstract: Cloud Computing has employed virtual technology to satisfy the service requests from the customers. Virtual machine live migration can provide non-stop services while unexpected events occur. The cost of enforcing live migration is measured by the total number of duplicated pages and the impact caused by downtime. In this paper, a practical model for analyzing push-based virtual machine live migration is proposed. Based on the model, the patterns on the numbers of duplicated memory frames in 34 the iterations have been analyzed for various dirty frequencies. In addition, our model which abstracts live migration strategy into policy function is useful for developing formal method to conduct complex analysis. [53] Performance Comparison and Analysis of Yarn's Schedulers with Stress Cases Bo Li, Beijing University of Posts and Telecommunications Abstract: Hadoop, as a popular distributed storage and computing platform, has been widely used in many companies. Yarn is the resource management platform in Hadoop and plays an important role in the resource managing, because it can affect the cluster’s energy efficiency and the usability for applications. The schedulers are the brain of Yarn, which manage and schedule resources from cluster to applications. In this paper, we conduct experiments to compare and analyze the performance of Yarn’s schedulers. We use various scenarios to demonstrate the strengths and weaknesses of each scheduler from the perspective of response speed, cluster’s efficiency, scheduler’s speciality etc. Experimental results demonstrate that the FIFO Scheduler has a better performance and data locality sense for batch jobs processing than the other schedulers, but the Capacity Scheduler and the FIFO Scheduler have better response speed and cluster’s usability than the FIFO Scheduler which has a hunger problem in mixed scenario. [54] Classification of Parkinson's disease and Essential Tremor Based on Structural MRI Li Zhang, Chengdu University Abstract: Parkinson’s disease (PD) and essential tremor (ET) are two kinds of tremor disorders which always confusing doctors in clinical diagnosis. Early experiments on structural MRI have already shown that Parkinson’s disease can cause pathological changes in the brain region named Caudate_R (a part of Basal ganglia) while essential tremor cannot. Although there are many research work on the classification of PD and ET, they didn’t achieve the automatic classification of the two diseases. But big data brings new opportunities to the classification of PD and ET. In order to achieve this, we proposed a machine learning framework based on principal components analysis (PCA) and Support Vector Machine (SVM) to the classification of Parkinson’s disease and Essential Tremor. This machine learning framework has two-stage method. At first, we used principal component analysis (PCA) to extract discriminative features from structural MRI data. Then SVM classifier is employed to classify PD and ET. We used statistical analysis and machine learning method to test the differences between PD and ET in specific brain regions. As a result, the machine learning method has a better performance in extracting the differential brain regions. The highest classification accuracy is up to 93.75% in the differential brain regions. [55] Utilizing Real-Time Travel Information, Mobile Applications and Wearable Devices for Smart Public Transportation Tsz Fai Chow, Chinese University of Hong Kong Abstract: We propose a cloud platform that utilizes real-time travel information, a mobile application and wearable devices for smart public transportation. This platform is capable of retrieving the required data automatically, reporting real-time public transportation information and providing users with personalized recommendations for using public transits. Novel features of this platform include the measure of the current walking speed of the user and the use of real-time estimated arrival times of public transits at different locations for travel recommendations. We also present our on-going work of developing the proposed platform for the public transportation system in Hong Kong. We aim to develop this platform for passengers 35 for aiding their decisions and reducing their journey times, thereby improving their commuting experience and encouraging the use of public transportation. [56] Event Detection on Online Videos using Crowdsourced Time-Sync Comment Zhenyu Liao, Tongji University Abstract: In recent years, more and more people are like to watch videos online because of its convenience and social features. Due to the limit of entertainment time, there is a new requirement that people prefer to watch some hot video segments rather than an entire video. However, it is a quite time-consuming work to extract the highlight segments in videos manually because the number of videos uploaded to the internet is huge. In this paper, we propose a model of event detection on videos using Time-Sync comments provided by online users. In the model, three features of Time-Sync comments are extracted firstly. Then, user behavior relevance in time series are analyzed to find the video shots that people are interested in most. Metric and its optimization to score video shots for event detection are introduced lastly. Experiments on several movies shows that the events detected by our method coincide with the highlights in the movies. Experiments on movies show that the events detected by our method coincide with the highlights in the movies. [57] Synthetic Data Generator for Classification Rules Learning Runzong Liu, Chongqing University Abstract: Standard data set is useful to empirically evaluate classification rules learning algorithms. However, there is still no standard data set which is common enough for various situations. Data sets come from the real world are limited to specific applications. The sizes of attributes, the rules and samples of the real data are fixed. A data generator is proposed here to produce synthetic data set which can be as big as the experiments demand. The size of attributes, rules, and samples of the synthetic data sets can be easily changed to meet the demands of different learning algorithms evaluation. In the generator, related attributes are created at first. And then, rules are created based on the attributes. Samples are produced following the rules. Three decision tree algorithms are evaluated used synthetic data set produced by the proposed data generator. [58] A Flash Light System for Individuals with Visual Impairment Based on TPVM Wenbin Fang, Shanghai Jiao Tong University Abstract: We propose a flashlight system to aid visually impaired people using the paradigm of temporal psychovisual modulation (TPVM), which is new display mode taking ad-vantage of the limited flicker fusion frequency of human eyes and high refresh rate of modern display devices to achieve visual bifurcation effect. Structured light in visible spectrum is projected out onto the road surface and a synchronized camera detects the deformation and then the recognition system calculates the road flatness (e.g. smooth road, up- or down- stairs). To minimize the visual disturbance to other people, the TPVM display technique effectively conceals the structured light and making it basically a normal flashlight to normal observers. The system works in visible spectrum to minimize cost of the camera and the projector. We design fast and reliable recognition system with iterative dichotomiser 3(ID3) algorithm to differentiate condition of the pavement ahead into smooth road, wall ahead and up/down-stairs. Experimental results are presented to validate the proposed system. 36 [59] Benchmarking State-of-the-Art Deep Learning Software Tools Shaohuai Shi, Hong Kong Baptist University Abstract: Deep learning has been shown as a successful machine learning method for a variety of tasks, and its popularity results in numerous open-source deep learning software tools coming to public. Training a deep network is usually a very time-consuming process. To address the huge computational challenge in deep learning, many tools exploit hardware features such as multi-core CPUs and many-core GPUs to shorten the training time. However, different tools exhibit different features and running performance when training different types of deep networks on different hardware platforms, which makes it difficult for end users to select an appropriate pair of software and hardware. In this paper, we aim to make a comparative study of the state-of-the-art GPU-accelerated deep learning software tools, including Caffe, CNTK, TensorFlow, and Torch. We benchmark the running performance of these tools with three popular types of neural networks on two CPU platforms and three GPU platforms. Our contribution is two-fold. First, for deep learning end users, our benchmarking results can serve as a guide to selecting appropriate software tool and hardware platform. Second, for deep learning software developers, our in-depth analysis points out possible future directions to further optimize the training performance. [60] A Mobile Cloud System for Enhancing Multimedia File Transfer with IP Protection Tipaporn Juengchareonpoon, Chulalongkorn University Abstract: Fast transferring and sharing large multimedia files on the mobile network is a challenge. Most users encounter a long delay before playing a file; so this creates a bad experience for users. One common mechanism is to buffer the file to the internal memory before playing it. Hence, playing a large media file without long delay or interruption is a dream. This paper proposed a new mechanism, named as STEM, that can enhance the media file sharing and transfer speed over the Internet based on smartphones or mobile devices. In addition, this technique can also protect the intellectual property of the transferred media. [61] Distinguish True or False 4K Resolution using Frequency Domain Analysis and Free-Energy Modelling Wenhan Zhu, Shanghai Jiao Tong University Abstract: With the prevalence of Ultra-High Definition (UHD) display terminals, 4k resolution (38402160 pixels) contents are becoming a major selling points for online video media. However, due to the insufficiency of natural UHD contents, a large number of false 4k videos are circulating on the web. Those ‘4k’ contents, usually being upscaled from lower resolutions, often frustrates enthusiastic consumers and are in fact a waste of stringent bandwidth resources. In this paper, we propose to use frequency domain analysis to distinguish natural 4k contents from false ones. The basic assumption is that true 4k contents has much more high frequency responses than upscaled versions. We use free energy modelling to approximate the human viewing process so as to minimize the impact of structural complexity of visual contents. We set up a database containing more than 1k original 4k frames together with upscaled versions using many widely used interpolation algorithms. Experimental results show that the proposed method has an accuracy rate higher than 90%. [62] Performance Comparison between Five NoSQL Databases Enqing Tang, Tsinghua University 37 Abstract: Recently NoSQL databases and their related technologies are developing rapidly and are widely applied in many scenarios with their BASE (Basic Availability, Soft state, Eventual consistency) features. At present, there are more than 225 kinds of NoSQL databases. However, the overwhelming amount of databases and their constantly updated versions make it challenging for people to compare their performance and choose an appropriate one. This paper is trying to evaluate the performance of five NoSQL clusters (Redis, MongoDB, Couchbase, Cassandra, HBase) by using a measurement tool – YCSB (Yahoo! Cloud Serving Benchmark), explain the experimental results by analyzing each database’s data model and mechanism, and provide advice to NoSQL developers and users. [63] Collective Extraction for Opinion Targets and Opinion Words from Online Reviews Xiangxiang Jiang, Guilin University of Electronic Technology Abstract: Online reviews are very important for lots of Web applications. Extracting opinion targets and opinion words from online reviews is one of the core works for review analysis and mining. The traditional extraction methods mainly include two categories: the pipeline-based methods and the propagation-based ones. The former extracts opinion targets and opinion words separately, which ignores the opinion relations between them. The latter extracts opinion targets and opinion words iteratively by exploiting the nearestneighbor rules or syntactic patterns, which would probably lead to poor results due to the limitations on predefined window size and the propagating errors of dependency relation parsing. For such shortcomings of traditional methods, we propose a collective extraction method for opinion targets and opinion words based on the word alignment model. In order to tackle the time-consuming and error-prone problem of manual annotation, we further devise a semi-supervised extraction method based on active learning. Finally, we carry out a series of experiments on real-world datasets to validate the effectiveness of the proposed methods. [64] Efficient Power Allocation under Global Power Cap and Application-Level Power Budget Xiaoxue Hu, Beijing University of Aeronautics and Astronautics Abstract: Web-related applications which are typically multi-server, high-parallel, long-running are common in datacenter. They always compute over large-scale dataset and consume plenty of energy. Up until now, most research focuses on trading a loss of performance for energy saving. However, managing the power is more important compared with reducing it. In this paper, we add energy consumption to the list of managed resources and help managers to control power profile of web-related applications in the datacenter. Tenants put forward the power budget and corresponding response time target of their own applications before they rent servers. We designed strategies to make every application in cluster run under a global power cap and their own power budget. We first propose a Global Feedback Power Allocation Policy to periodically allocate the global power cap among the applications. We also devise a Local Efficient Power Policy to determine application-level power cap and allocate it among servers running the application. Extra budget during each period can be used in the rest of the tenancy to increase the application-level power cap to minimize the response time. We use Shell, WWW, DNS and Mail workloads to evaluate our policy. [65] An Adaptive Tone Mapping Algorithm Based on Gaussian Filter Chang Liu, Chongqing University Abstract: A new adaptive Tone Mapping (TM) algorithm based on Gaussian filter is proposed to display the High Dynamic Range (HDR) image on conventional digital display devices. Unlike the conventional 38 luminance mapping function, the proposed algorithm uses the separated two-dimensional Gaussian filter and empirical parameter to obtain better details and speed operation performance. Gaussian filter is mainly used for edge-preserving smoothing. Empirical parameter is introduced to adaptively adjust the overexposed luminance image after mapping. Multiple regression models are utilized to link the empirical parameter to image information which is the mean, logarithm mean and variance of luminance values. Experimental results show that the proposed algorithm retains acceptable image contrast and color information, moreover, outperforms the previous methods on running speed. [66] Treand Behavior Research by Pattern Analysis in Financial Big data - A Case Study of Taiwan Index Futures Market Mei-Chen Wu, National Chiao Tung University Abstract: Market structure provides concrete information about the market. Price patterns can be imagined as the evidence of a supply and demand states in the market. Price shifts higher as the demands exceed the available supply and vice-versa. These patterns convey precious information about what is going to happen in the market.The purpose of this study is to investigate the underlying relation between price pattern in Taiwan Futures Exchange (TAIFEX) Futures Index Market and its following trend. Forecasting the directions of price shift following the pattern through supervised learning and testing with artificial neural network (ANN). This research implements changepoint-analysis (CPA) under statistics field, and perceptually important points (PIP) theory. CPA finds the locations where the shifts in value occur. Then, PIP algorithm performs the feature extraction of the pattern. Then, the PIP is then fed to ANN to forecast the following trends. To simulate the research concept, a control model is built based on online time segmentation algorithm for comparison.The results of this research shows that robust patterns found by CPA have the ability to forecast market trend direction up to 83.6% accuracy. The result indicates that TAIFEX Futures market directions can be forecasted through its historical price robust patterns. Thus, rejecting that TAIFEX Futures Index Market follows random walk theory. In contrast, the control model which was built based on online time segmentation also has the ability to forecast but not as accurate as using the CPA method. In conclusion, analyzing the patterns reflected in the market effectively provide precious insights about its trends behavior. [67] Applying Market Profile Theory to Analyze Financial Big Data and Discover Financial Market Trading Behavior - A Case Study of Taiwan Futures Market Yu-Hsiang Hsu, National Chiao Tung University Abstract: With financial market constantly changing, prices are often affected by many factors that we cannot predict its direction especially in the market correction. If investors want to make profits, they can find a relatively low-risk entry points. This thesis is based on Market Profile Theory to research displacement of point of control by its and point of control of historical trading day and also the change of time price opportunities counts to find the best extremely short-term entry and exit points. Finally, this thesis anticipates finding potential market behavioral knowledge through experiments and statistical analysis. That can help traders make profits in a very short-term trading, and confirm the Taiwan stock exchange capitalization weighted stock index (TAIEX) futures market does not meet the weak form efficient market hypothesis. This thesis found the point of control of historical trading day can be the entry point as a reference. The whole historical trading day using the point of control in five days as a reference has better profit performance. It also shows that the point of control has the most traders accept the price properties. When the point of control shifted to the new price in a very short-term, the difference between the time price opportunities counts separated from the point of control also can be the entry point as a reference. 39 [68] Protecting Link Privacy for Large Correlated Social Networks Lin Yang, Harbin Institute of Technology Shenzhen Graduate School Abstract: Privacy is widely studied in various research do- mains and it gradually attracts more and more research efforts to investigate how to protect social network privacy. Most of existing approaches seldom consider the situation that social data might be correlated to each other. In this paper, we are the first attempt to study this issue by modeling such correlation as the probability that a vertex could be a potential friend of a given vertex. By not allowing the potential friend vertex to be selected as the neighbor vertex in the perturbed graph, we not only protect the direct neighbors but also the highly correlated indirect neighbor vertices. We then defined the privacy and the utility measurement for evaluating whether a perturbed graph is good or not. Experiments are performed on three datasets and compared with the state-of-the-art algorithm, it demonstrated that our approach can achieve especially good results on a dense graph, while is comparably as good as the compared algorithm. [69] A Protocol for Extending Analytics Capability of SQL Database Manyi Cai, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences Abstract: To extend the capability of big data analytics in SQL databases, we propose an interaction protocol and communication framework called Dex. Using Dex protocol, we can integrate a SQL database system and a big data system into a unified data analysis platform. The integrated system allows users to call complex analytics functions available in the big data analytics system with a simple SQL statement. We prototype our idea on PostgreSQL and Spark and demostrate promising performance gains against pure SQL UDF solutions. [70] Competitive Intelligence Study on Macau Food and Beverage Industry Simon Fong, University of Macau Abstract: Due to the dynamic nature of commerce, it is important to capture useful external information as a support to build appropriate strategy and make decision in cur-rent competitive market with large volume of data. A review and summary of the current competitive intelligence development, mainly new concepts and tools, is conducted in this report for proposing a system for food and beverage industry that aims at obtaining competitive advantage in Macau market. [71] Finding Optimal Meteorological Observation Locations by Multi-Source Urban Big Data Analysis Guoshuai Zhao, Xi’an Jiaotong University Abstract: In this paper, we try to solve site selection problem for building meteorological observation stations by recommending some locations. The functions of these stations are meteorological observation and prediction in regions without these. Thus in this paper two specific problems are solved. One is how to predict the meteorology in the regions without stations by using known meteorological data of other regions. The other is how to select the best locations to set up new observation stations. We design an extensible two-stage framework for the station placing including prediction model and selection model. It is very convenient for executives to add more real-life factors into our model. We consider not only selecting the locations that can provide the most accuracy predicted data but also how to minimize the cost of building new observation stations. We evaluate the proposed approach using the real meteorological data of 40 Shaanxi province. The experimental results show the better performance of our model than existing commonly used methods. [72] Research on Algorithm of PSO in Image Segmentation of Cement-Based Xiaojie Deng, Hubei University of Technology Abstract: This paper selects OTSU segmentation method. In order to verify the superiority of Chaos Particle Swarm Optimization, before segmentation, use test function detects chaos particle swarm (PSO) algorithm accuracy and efficiency. Then OTSU method were optimized and contrasted by four kinds of optimization algorithms. In order to select the best image segmentation, it provides a scalable processing platform for future research. 41 Index Keynote Speakers Carlo Ghezzi Xin Yao Ziran Zhao [1] [2] [3] Paper Session Chunzhi Wang Yi Tan Chunzhi Wang Rongzhen Li Li Zhang Xinge You Yurong Zhong Zhigang Xu Ehab Mohamed Anyong Qin Xing Liu Xiaohuan Lu Huapeng Yu Siping Shi Yi Li Peng Huang Lei Duan Luyan Xiao Xutian Zhuang Li Zhang Manhua Jiang [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] Ni Luo [25] Caiquan Xiong Jia-Yow Weng Chang Lu Chang Liu Xinlu Zong Li Zheng Xuegang Wu Binyang Li Weidian Zhan Jingwei Zhang Xichun Yue Zhiguo Gong Yubin Zhao [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] Qingquan Lai Kelvin Tsoi Chan-Fu Kuo Lujia Wang Yunpeng Shen Guangtao Zhai Haitian Li Jiantao Zhou Jianping Gu Yu-Fu Chen Szu-Hao Huang Chin Chou Jou-Fan Chen Cho-Chin Lin Bo Li Li Zhang Tsz Fai Chow Zhenyu Liao Ruizong Liu Wenbin Fang Shaohuai Shi Tipaporn Juengchareonpoon Wenhan Zhu Enqing Tang Xiangxiang Jiang Xiaoxue Hu Chang Liu Mei-Chen Wu Yu-Hsiang Hsu Lin Yang Manyi Cai Simon Fong Guoshuai Zhao Xiaojie Deng 42 [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72]