Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Anomaly Detection and Preprocessing By Ibrahim Khamis A Thesis Presented to the Masdar Institute of Science and Technology In Partial Fulfillment of the Requirements for the Degree of Master of Science In Computing and Information Science © 2014 Masdar Institute of Science and Technology All rights reserved Abstract In sustainable environments, efficient anomaly (outlier) detection is essential to help monitor and control the system with the decision making process. Anomaly detection is an inherently difficult problem due to its decisions of what is normal and what is unusual, and the ability to distinguish between the two. Another serious difficulty is that the definition of normal can change. Sensor nodes in wireless sensor networks have limited energy resources and this hinders the dissemination of the gathered data to a central location. This stimulated our research to make use of the limited computational capabilities of these sensor nodes to build a normal model of the data gathered. In our research, our goal is to determine what is normal and what is abnormal and to distinguish between Normal & abnormal. We developed an algorithm called “Two-layered Data Capture Anomaly Detection”. Our algorithm sends anomalies (2%) as well as roughly (2% or 4%) of normal data for further data processing and classification purposes. For testing purposes we also deployed three different machine learning and data mining tools. Three separate data sets were also used to validate the system. The performance of the proposed method is evaluated and compared with results obtained from the application of state of the art methods on the same data sets. In these tests our method provided very promising results. ii This research was supported by the Government of Abu Dhabi to help fulfill the vision of the late President Sheikh Zayed Bin Sultan Al Nahyan for sustainable development and empowerment of the UAE and humankind. iii Acknowledgments Praise be to Allaah, I would like to extend my gratitude to my family members for their patience and support. I would also like to take this opportunity to thanks those who actively guided and helped me in this research. Foremost, I would like to express my deep appreciation to my advisor Dr. Zeyar Aung for his continuous support for my M. Sc. Study and research. His guidance, patience, motivation, and support helped me to develop a deep understanding of the subject. Beside my advisor, I would like to thank my thesis supervisor committee members: Dr. Khaled Elbassioni and Dr. Wei Lee Woon for their valuable time, comments, and advice. Ibrahim Khamis Masdar City, April 30, 2014 iv Contents _____________________________________________________________________ 1 Introduction .......................................................................................................................... 1 1.1. Background and Motivation ...................................................................................... 1 1.2. Objectives and Contributions .................................................................................... 2 1.3. Relevance to Masdar/UAE......................................................................................... 6 1.4. Publication ................................................................................................................. 7 1.5. Thesis Organization ................................................................................................... 7 2 Literature Review ................................................................................................................. 8 2.1. Wireless Sensor Network (WSN) ............................................................................... 8 2.2. Data Mining for Outlier Detection............................................................................. 9 2.3. Outlier Detection for WSNs ..................................................................................... 10 2.3.1. Statistical-based Techniques ............................................................................... 16 2.3.2. Nearest Neighbor-based Techniques .................................................................. 16 2.3.3. Clustering-based Techniques............................................................................... 16 2.3.4. Classification-based Techniques .......................................................................... 17 2.3.5. Comparison of WSN Outlier Detection Techniques ............................................ 19 2.3.6. Recent Trends ...................................................................................................... 19 2.4. Shortcomings of Outlier Detection Techniques ...................................................... 20 2.5. Requirements for Outlier Detection in WSNs ......................................................... 21 3 Proposed Method ................................................................................................................ 22 3.1. Data Capture Anomaly Detection (DCAD) ............................................................... 22 3.2. From DCAD to TLDCAD ............................................................................................ 25 3.3. Use case ( scenario) ................................................................................................. 27 4 Experimental Setups and Results ....................................................................................... 29 4.1. Datasets ................................................................................................................... 29 4.1.1. Synthetic Datasets ............................................................................................... 29 4.1.2. Grand Saint Bernard (GSB) Dataset ..................................................................... 30 4.1.3. Wind Tower Dataset ............................................................................................ 31 4.2. Outlier Detection Performance Measurements ...................................................... 33 4.3. Algorithms Explored ................................................................................................ 35 v 4.4. 4.4.1. 4.5. 4.5.1. Experiment I: Preprocessing Approach ................................................................... 35 Experiment I Results ............................................................................................ 39 Experiment II: Classification Approach .................................................................... 41 Experiment II Results ........................................................................................... 43 5 Conclusion and Future Work.............................................................................................. 44 5.1. Conclusion ............................................................................................................... 44 5.2. Future Work ............................................................................................................ 45 A Abbreviations .................................................................................................................... 47 B Masdar Wind Tower .......................................................................................................... 48 Bibliography ............................................................................................................................ 51 vi List of Tables _____________________________________________________________________ Table 1: Comparison of Features for Multivariate Outlier Detection Techniques for WSNs, adopted from [32]. .................................................................................................................. 15 Table 2: Comparing Different Approaches on Outlier Data, adopted from [54]. ................... 18 Table 3: Example of Wind Tower Data. ................................................................................... 31 Table 4: Confusion Matrix ....................................................................................................... 33 Table 5: Preprocessing Experiment Flow of TLDCAD vs. DCAD as a Preprocessor. ................ 35 Table 6: DCAD Preprocessing Process Summary. .................................................................... 36 Table 7: TLDCAD Preprocessing Process Summary. ................................................................ 36 Table 8: Average Results for 5,000 Synthetic Data Points in Experiment I-A.......................... 40 Table 9: Average Results for 50,000 Synthetic Data Points in Experiment I-B........................ 40 Table 10: Best Achieved results for TLDCAD with SVM vs. DCAD as a Classifier in Experiment II. .............................................................................................................................................. 43 Table 11: Masdar Wind Tower Photographs and Images. ...................................................... 48 vii List of Figures _____________________________________________________________________ Figure 1: Data Capture Anomalies Detection (DCAD) Algorithm [24]. ...................................... 3 Figure 2: Our Proposed Two-layered DCAD Algorithm. ............................................................ 4 Figure 3: Thesis Contributions. ................................................................................................. 5 Figure 4: Three outlier sources in WSNs and their corresponding detection techniques, adopted from [32]. ................................................................................................................... 11 Figure 5: Categorization of WSN Outlier Detection Methods using Data Mining [23]. .......... 11 Figure 6: Generic Categorization of Outlier Detection Methods [28]. .................................... 12 Figure 7: Outlier Detection Technique for WSNs, adopted from [32]. ................................... 14 Figure 8: Advantage of Mahalonobis Distance........................................................................ 17 Figure 9: Recent Developments in Outlier Detection in WSN ................................................. 20 Figure 10: DCAD Illustration. ................................................................................................... 22 Figure 11: Effective Radius. ..................................................................................................... 24 Figure 12: TLDCAD. .................................................................................................................. 26 Figure 13: WSN, adopted from [53]. ....................................................................................... 27 Figure 14: GSB Data Scatter Plot. ............................................................................................ 30 Figure 15: Wind tower Data Scatter Plot. ............................................................................... 32 Figure 16: Precision (P) and Recall (R), adopted from [35]. .................................................... 34 Figure 17: Flowchart of DCAD as a Preprocessor vs. TLDCAD. ................................................ 37 Figure 18: One of the Ten Folds: Illustration of TLDCAD vs. DCAD as Preprocessor. .............. 38 Figure 19: Flowchart of DCAD as a Classifier vs. TLDCAD........................................................ 41 Figure 20: One of the Ten Folds: Illustration of TLDCAD & SVM vs DCAD as a Classifier. ....... 42 Figure 21: FFIDCAD with effective n [19]. ............................................................................... 46 Figure 22: Wind Tower air flow diagram (Photographed by Ibrahim Khamis) ....................... 48 Figure 23: Wind Tower Image (Photographed by Ibrahim Khamis) ........................................ 48 Figure 24: Wind Tower Bank of 75 high-pressure nozzles while introducing mist to the Wind tower Ventilation tube from inside (Photographed by Ibrahim Khamis) ............................... 48 Figure 25: Wind Tower Background (Photographed by Ibrahim Khamis)............................... 49 Figure 26: Wind Tower how it works (Photographed by Ibrahim Khamis) ............................. 49 Figure 27: Wind Tower Thermal Comfort (Photographed by Ibrahim Khamis) ...................... 50 viii CHAPTER 1: Introduction 1 CHAPTER 1 1 Introduction 1.1. Background and Motivation A Wireless sensor network (WSN) refers to a group of spatially dispersed and dedicated sensors for monitoring and recording the physical conditions of the environment and organizing the collected data at a central location. WSNs measure environmental conditions like temperature, sound, pollution level, humidity, wind speed and direction, pressure, etc. WSNs are widely used in areas such as manufacturing industry [26], military [15], environmental monitoring [3], smart power grids [6], smart buildings/homes [16], and many other applications that require distributed location-aware data sensing [2]. The advantage of using WSNs is that they are cheaper and more practical than wired networks. However WSNs are vulnerable to intrusions and faults [6] and they are resource constrained devices. In general, WSN data needs to be mined to detect anomalies as efficiently as possible. Once found, these will then be sent to the base station or central location for further processing. Outliers are encountered in many applications. Here are some terms that are commonly used in the data mining community: uncommon behavior in data, rare instances, outliers, anomalies, deviations, exceptions, rare instances, and irregularities [1]. Hawkins provided the following definition of an outlier: "An outlier CHAPTER 1: Introduction 2 is defined as an observation that deviates too much from other observations that it arouses suspicions that it was generated by a different mechanism from other observations” [28]. Anomaly detection is an inherently difficult problem as it is essentially the problem of deciding what is not normal; frequently there are no predetermined examples or models for "abnormal" data and these needs to be determined from the statistical properties of the data. Anomaly detection in wireless sensor nodes is even more challenging because they have limited power and computing resources. It is virtually impossible to disseminate all the gathered data to a central location to detect the anomalies. On the other hand anomalies are important data that of interest of us since they may represent faults, intrusion, malicious attacks or even fire alarms and also more further could be an automatic signaling for some actions like dispatching a repair crew to fix the faulty sensors for example. This motivated our research to make use of the limited computational capabilities of these devices by building a normal model of the data gathered. In this way, data that deviates from this model can be classified as "anomalous" and subsequently forwarded to a central location for further processing. This process is done inside these devices and hence saves the power that would be needed to transmit all the data. 1.2. Objectives and Contributions The objective of this thesis is to develop an efficient and robust algorithm for anomaly detection and preprocessing in energy constrained devices such as WSNs. The aim is to find a balance amongst the three desirable factors of speed, accuracy, and low energy consumption for anomaly preprocessing and detection in WSNs. Towards this end, a novel algorithm is proposed which we call the “Two-layered Data Capture CHAPTER 1: Introduction 3 Anomaly Detection” (TLDCAD) is proposed. This algorithm is based on an existing technique, the Ellipsoidal Data Capture Anomaly Detection (DCAD) method [24], which is illustrated in Figure 1. DCAD is an anomaly detection algorithm which uses mean and covariance matrix of the data to define an ellipse which captures the overall distribution of the data. Figure 1: Data Capture Anomalies Detection (DCAD) Algorithm [24]. The anomalies are then detected by setting a threshold which defines an outer boundary of the ellipse which encompasses 98% of the data. Any data points which fall outside of this boundary are considered to be anomalies and are captured and sent for further processing. In this way, DCAD helps to improve energy efficiency by selecting only points which are considered “interesting”. However, this is only part of the problem, since there should also be a way to select interesting normal points. TLDCAD seeks to address this problem by setting a second threshold on the data (in our experiments we CHAPTER 1: Introduction 4 tried 94% and 96% levels) to capture an additional layer of data points which lie between this new level and the original 98% level. These points are subsequently labeled as normal and sent to a central computer for further processing. Figure 2: Our Proposed Two-layered DCAD Algorithm. Our aim is to reduce the power consumption in resource constrained devices such as wireless sensors. We reduce the noise level by preprocessing inside the sensor node then send a reduced sampled data (before any in-between nodes communication noise) to the central computer or central node with a light version of SVM for further classification, visualization, and exploration. We accomplish that by adapting a new approach of data preprocessing by providing sampled data using a two level ellipse. This produces balanced data sets with around 50% anomalies. This sampled data is used then to work with some classifiers that require relatively balanced data sets such as the typical Support vector Machine (SVM). The data is further processed by the SVM to provide more accurate classification results. With this distributed approach CHAPTER 1: Introduction 5 we combine the speed of the ellipse method on the WSN’s node with the accuracy of the SVM on the central computer. We could say that we are using a distributed data mining approach in order to try to answer whether anomalies can become classifiable. The contributions of this thesis can be summarized as follows. Contribution 1: A new model for lightweight anomaly preprocessing and detection which applies two separate probability thresholds (TLDCAD). Contribution 2: Two distinct usage mechanisms where TLDCAD can be used either as a pre-processor, or as a classifier. Contribution 3: Comparative evaluation of TLDCAD on both synthetic and real data sets. Figure 3 shows these three main research contributions in pictorial format. Figure 3: Thesis Contributions. CHAPTER 1: Introduction 6 The advantages of the proposed approach can be summarized as follows: It provides a new approach of data preprocessing by acquiring the most informative sampled data using two ellipses. It reduces the power consumption in resource constrained devices like wireless sensor nodes; Thus, it improves the sustainability and detection capability of the whole WSN. It reduces the effect of communication channel’s noise by preprocessing inside the sensor nodes then sending a reduced set of sampled data to the server for further exploration. It improves the reliability of data by providing more efficient WSN’s data samples. In addition, it also improves security and privacy of the data, because only parts of the data are communicated. It produces more balanced data sets, which are better for classification, with around 50% or 25% outliers instead of just 2% outliers. (Some classifiers like a typical Support Vector Machine do not work well with extremely unbalanced data sets.) 1.3. Relevance to Masdar/UAE The field of energy efficiency evolved with rich areas of science and applications that need to be redesigned and reframed for such a new field. Masdar Institute is a global institute which is focused on this and other related challenges. Techniques which facilitate the use of energy constrained devices for data collection have a very direct relevance with Masdar's vision which is centered on energy efficiencies and CHAPTER 1: Introduction 7 green technologies. Moreover, the proposed method is tested on data collected from sensors attached to Wind Tower air cooling structure in Masdar Institute. 1.4. Publication Some portions of the research described in this thesis have been published in the following paper [13]. I. Khamis and Z. Aung, “Outlier preprocessing in wireless sensor networks: A two-layered ellipse approach,” in Proceedings of the 6th IEEE International Conference on Developments in eSystems Engineering (DeSE), 2013, pp. 1-6. 1.5. Thesis Organization The remainder of the thesis is organized as follow. In Chapter 2; it gives an overview of the current technologies in WSN and anomaly detection. Then it explains the proposed algorithm in Chapter 3. After that it describes the experimental setup and the results in Chapter 4, followed by conclusion and future work in Chapter 5. CHAPTER 2: Literature Review 8 CHAPTER 2 3 2 Literature Review 2.1. Wireless Sensor Network (WSN) Wireless Sensor Network (WSN) is a network that consists of number of nodes; each one is connected to other nodes wirelessly in the network. WSN are feasible solutions in situations where it is difficult, or costly or even impractical to implement wired networks [24]. There are many types of sensors settling in every sensor node. These sensors allow the node to collect many types of data. Since there are many sensors on the sensor node, then it becomes a subject of multidimensional features data collections. Some sensor nodes these days have very good computational capabilities for example the Waspmote Error! Reference source not found. is one of the sensors evices for developers. This sensor node is one of the examples that depict how wireless sensors are becoming more like mini computers. In recent research some authors of recent papers targeted the computational capability of the modern sensor nodes to detect anomalies locally in a decentralized mode [19] [24]. One of the important roles of the WSN is to detect important events or faults in the network nodes. Detecting the important event or anomalies at the node level will reduce the amount of data to be transmitted over the network since only the detected event is transmitted instead of transmitting the whole data set. In such situation the CHAPTER 2: Literature Review 9 need for some kind of event detection system become very crucial. Here comes the role of data mining techniques as discussed in the following section. 2.2. Data Mining for Outlier Detection Narita and Kitagawa defined data mining as systematically extracting useful information from data [20]. The aim of data mining is to find patterns from data sets. Data mining can be either supervised or unsupervised. Supervised data mining involves the use of training labeled data to build a classification model, and then this classification model is used to classify the new testing data. Unsupervised data mining do not use labeled data to classify the new data; it normally uses some technique like clustering to build clusters around the normal data. Hence the supervised outlier detection algorithms learn a model by the labeled training data and decide on the test data whether it is normal or abnormal. Unsupervised outlier detection finds outlier without prior knowledge of the data [21]. For example, when the data are clustered then the clusters represent the normal data and any data points fall out the clusters boarder are considered to be outliers. In addition the aim of traditional pattern recognition is to find the majority of data and deal with outliers as noise. However, noise for one person could be a signal for other person [12]. Outliers can indicate important events in some situations and can be of more importance than the normal data. For examples the fire alarm sensed data is more important than the normal data. Outlier detection is an important field of data mining techniques [1]. Outliers are named in many ways and here are some terms that are used in data mining community: Uncommon behavior in data, rare instances, outliers, anomalies, devotions, exceptions rare instances, and irregularities [1]. CHAPTER 2: Literature Review 10 There are many studies about outlier detection. For example, Jiang and Yang made clusters as a unit and find the outliers clusters as a unit [12]. In this case the whole cluster becomes an outlier. Lee et al. proposed a novel work for trajectory outlier detection [14]. The abnormal trajectory among other trajectories becomes an outlier. Moreover, Menold et al. raised the point that; data point can be compared to the median of the past and present value and the result is outlier if it exceeds certain threshold [18]. This show the implementation of the temporal data (data related to time). On the other hand some people are concerned with the privacy issue of outlier detection. Challagalla et al. inferred that detection of outliers threats some organizations and raises their concerns' about the privacy of the analyzed data; for that reason it is important to incorporate some sort of privacy protection in the outlier detection technique [4]. 2.3. Outlier Detection for WSNs There are many ways to categories outlier detection. In Figure 4, Zhang et al. [32] divided outlier detection in WSNs to three branches in terms of the outlier sources, the first is Fault detection in WSNs and it deals with noise and errors, the second division is event detection in WSNs which deal with events. The last division is intrusion detection in WSNs and this one handles the malicious attacks. CHAPTER 2: Literature Review 11 Figure 4: Three outlier sources in WSNs and their corresponding detection techniques, adopted from [32]. Qu classified the outlier detection methods as illustrated in Figure 5 [23]. In this classification they classified the outlier detection methods to five main braches: distribution based, depth based, clustering, distance based, and density based. The widely used ones are density-based and distance-based. Li and Kitagawa said that distance based method is one of the most common and simplest methods that is used for outlier detection [14]. Figure 5: Categorization of WSN Outlier Detection Methods using Data Mining [23]. CHAPTER 2: Literature Review 12 Figure 6: Generic Categorization of Outlier Detection Methods [28]. A related yet more generic classification of outlier detection methods in general (not necessarily for WSNs only) is provided by Xi [28] as illustrated in Figure 6. In this classification the author divided the outlier detection algorithms to three main categories. The first main category is classic outlier which in turn is divided to four sub categories; statistical based, distance based, deviation based, and density based approaches. The second main category is the spatial outlier and this is just a modification of the classic based approach by taking into account the spatial attributes of the data. Spatial attributes are the attributes that relate to location. The third outlier CHAPTER 2: Literature Review 13 detection main category implicitly stated by Xi is the “recent advances” in outlier detection. In this category there are two sub categories; high dimension based approach and SVM based approach.Zhang et al. [32] proposed a similar taxonomy for outlier detection techniques in WSNs as shown in Figure 7. The main categories are statistical based which is further subdivided to parametric and non-parametric, nearest neighbor based, clustering based, classification based, and finally the spectral decomposition based. The parametric based is divided to Gaussian based and nonGaussian based. The non-parametric based is divided to kernel based and histogram based. The classification based is divided to Support vector machine based and Bayesian network-based. The Bayesian network based is subdivided again to naïve Bayesian network based, Bayesian believe network based, and dynamic Bayesian network based. The spectral decomposition is subdivided to the principle component analysis only. The comparison of various features of the WSN outlier detection methods are also given in Table 1. Janssens et al. [9] also compared some outlier detection method from Machine Learning (ML) and Knowledge Discovery in Databases (KDD). The ML techniques used are SVM and Parzen Windows, and the KDD techniques used are heuristic localdensity estimation methods such as LOF and LOCI. Janssens et al. used the one class classification framework. He selected this framework to be able to use AUC (Area under the Curve) which is a famous performance measurement tool. They found that Support Vector Domain Description (SVDD) is one of the best performing methods [9]. Now, let us discuss each category of outlier detection techniques for WSNs. CHAPTER 2: Literature Review Figure 7: Outlier Detection Technique for WSNs, adopted from [32]. 14 CHAPTER 2: Literature Review 15 Table 1: Comparison of Features for Multivariate Outlier Detection Techniques for WSNs, adopted from [32]. Techniques Sensor data Outlier type Correlation Attribute Spatial Local Temporal Subramaniam et al. [47] ● Rajasegarar et al. [48] Rajasegarar et al. [49] Janakiram et al. [50]] Hill et al. [51] ● Chatzigiannakis et al. [52] Individual Global Collaboration Aggregate ● ● Centralized ● ● ● Individual ● ● ● ● ● ● ● ● ● CHAPTER 2: Literature Review 16 2.3.1. Statistical-based Techniques Statistical based are the earliest method used to detect outliers and they are model based. The two categories in this field is the parametric based approach and nonparametric based approach. The parametric based approach assumes that the data has a known distribution. In this method if the input data does not follow the assumed distribution then it may cause some problems. The parametric approach has the following sub categories Gaussian based and non-Gaussian based. Non parametric do not assume any data distribution for the data. The non-parametric based is divided to kernel based and histogram based [32] the advantage of non-parametric is that they do not require and assumption about the distribution of the data. 2.3.2. Nearest Neighbor-based Techniques This approach makes use of the nearest neighbor values to find outliers and this approach is one of the most commonly used methods [32] however this technique does not scale well when the number of the input data variables increase. 2.3.3. Clustering-based Techniques Tao and Pi observed that in many applications outlier and clustering results are needed at the same time [27]. In outlier detection techniques, the data are clustered and hence the data that are outside the cluster are considered to be anomalies. One of the latest novel examples in anomaly clustering in WSNs is the Data Capture Anomaly Detection DCAD algorithm [24]. This algorithm use the hyper elliptical boundary (cluster) to draw a normal model around the data and the data points that fall outside this ellipsoidal boundary are considered to be anomalies [5]. CHAPTER 2: Literature Review 17 Figure 8: Advantage of Mahalonobis Distance. The DCAD [24] and IDCAD [19] exploited the advantage of using Mahalonobis distance for clustering the data. If the Euclidian distance is used instead of the Mahalonobis distance then the distance from p2 to its nearest neighbor is greater than the distance from p1 to its nearest neighbor however when using the Mahalonobis distance as in Figure 8 then the two distances are the same. (The Mahalonobis distance is a descriptive statistic that provides a relative measure of a data point's distance or residual from a common point Error! Reference source not found..) ence this feature is incorporated in DCAD and iterative DCAD (IDCAD) to find an ellipsoidal model that best fit the data in order to detect the outliers. The IDCAD use the same concept as DCAD however it detects outliers online in contrast to DCAD which uses the batch mode. 2.3.4. Classification-based Techniques Classification approach is well known in data mining where the classification algorithm takes labeled input as training data and draws a model from this training CHAPTER 2: Literature Review 18 data then it accepts a new data (testing data) and labels them according to the built model. In this section two main types of classifiers are discussed. The support vector machine (SVM) and the Naïve Bayes. The Naïve Bayes classifier is subdivided in [32] to three more sub categories; Naïve Bayesian network-based, Bayesian belief network-based, and dynamic Bayesian network-based. The SVM classifier was explored in the field of WSNs in [32] and [25], and it shows very promising results. Bahrepour et al.[54] explored many techniques in his paper and found that the Quarter-Sphere SVM is one of the out performers in terms of computational cost and detection accuracy. The results in the paper are reproduced in the Table 2. Table 2: Comparing Different Approaches on Outlier Data, adopted from [54]. Technique Standard SVM Quarter-Sphere SVM FFNN The Fusion-based Approach (Naïve Bayes) The Fusion-based Approach (FFNN) Naïve Bayes Accuracy On artificial Data 98.12% 98.53% 96.95% 84.90% 85.95% 94.84% Accuracy On Real Data 97.64% 98.05% 96.04% 91.00% 98.21% 75.07 % The Quarter-Sphere SVM outperforms other methods in terms of the accuracy. However this is in a centralized mode but it is has the disadvantage of computational cost in the distributed anomaly detection method where each node has to do its anomaly detection locally onboard. In [24] the authors stated that the SVM has an issue in terms of its computational complexity. This is due to the kernel matrix computation and the linear optimization calculations. On the other hand the Dynamic Bayesian network model has the advantage of being able to operate on several data streams at once [32]. However, the Bayesian CHAPTER 2: Literature Review 19 networks algorithms in general are facing challenges whenever the numbers of the input data variables become large in WSNs [32]. 2.3.5. Comparison of WSN Outlier Detection Techniques Tables.1 and Tables.2 shown above respectively compare the features and the performances of various WSN outlier detection techniques. The important ones that we can see from the literatures and papers [32] are the SVM, Dynamic Bayesian Networks and clustering. However since the SVM is computationally complex for distributed WSN’s systems and the Dynamic Bayesian Networks do not scale well with the number of the variables the obvious choice from table1 is the clustering. The clustering technique shown in this table is proposed by the Rajasegarar et al.[19]. This technique has the following advantage points from Table1. It works will with the multivariate variables; it takes care of dealing with temporal correlations through a time window. It also experimented to compute the centralized and distributed anomaly detection approach with promising results and less communication overhead. 2.3.6. Recent Trends It is observed from one of the recent works [19] that the best direction in WSNs domain is to use unsupervised learning and mainly the clustering which already investigated in some recent works like the one class SVM and the IDCAD. However the SVM is still to be refined more because of its computational complexity. Nevertheless the best choice till now is the IDCAD because it is unsupervised, simple, has low computational complexity and it is implemented in the distributed environment and had shown good results. Moreover IDCAD is implemented in an CHAPTER 2: Literature Review 20 online environment which is a practical approach to deal with the streaming nature of the WSNs data for an overview of latest works see Figure 9. Figure 9: Recent Developments in Outlier Detection in WSN 2.4. Shortcomings of Outlier Detection Techniques Zhang et al. listed the shortcomings of the existing outlier detection techniques as follows [32]. Most of the techniques ignore the multivariate nature of the WSN and assume univariate variables where anomaly can be formed by a combination of more than one variable. Many techniques do not consider the correlations between the variables. Questions need to be answered what is the appropriate sliding window size for temporal data and what is the appropriate choice of the neighboring nodes? The work on distinguishing between the types of outlier is not sufficient, and still many techniques do not distinguish between the errors and outliers and that may lead to the loss of some important events (outliers). CHAPTER 2: Literature Review 21 The use of user defined threshold in order to determine outliers are vulnerable to the dynamic nature of the WSN data. Many techniques do not consider the mobility of some WSNs and assume the static condition of WSNs. 2.5. Requirements for Outlier Detection in WSNs Zhang et al. also enumerated the requirements outlier detection techniques as follows [32]. Since there are many shortcomings in the outlier detection techniques in the field of WSNs, these shortcomings motivate the development of dedicated outlier detection techniques for WSNs. The following are some important WSNs outlier detection requirements. Detection needs to be distributed to reduce communication overhead. Detection needs to be online to handle the streaming nature of WSNs data. It is better to use unsupervised methods since the labeled data in WSNs is not easy to get. The detection rate must be high and the false alarm rate should be as low as possible. The technique must be not complicated or computationally complex to suite the nature of the restricted resources on WSNs. The relation between the data must be considered. Also the time and neighbors locations are important to be taken into account. The technique must discriminate between the errors and the measurements in an effective manner. CHAPTER 3: Proposed Method 22 CHAPTER 3 3 Proposed Method 3.1. Data Capture Anomaly Detection (DCAD) Figure 10: DCAD Illustration. Firstly, a review of the Data Capture Anomaly Detection (DCAD)[19] is presented as this is the basis for the proposed TLDCAD algorithm. The DCAD is used mainly for outlier detection in WSNs. It works by first constructing an ellipse which captures a given percentage (normally 98%) of the data. Hence, data points which fall outside this ellipse are classified as outliers while points falling inside the ellipse are classified as normal. However DCAD sends parameters only (mean and covariance CHAPTER 3: Proposed Method 23 matrix only) and if it is used for data preprocessing (sampling) it will send 98% of normal and 2% of anomalies. That motivate us to add another layer to be able to sample the data (preprocess) to use them in a classification process with the Support vector machine. TLDCAD in addition of providing outlier detection it also sends an additional effective sample of normal data for the classification purposes. DCAD and TLDCAD are based on the assumption that the data is normally distributed; then we need to refresh our minds by starting with the two important parameters in the bivariate Gaussian distribution, the mean and the covariance matrix which are given in the following equation (1) and equation (2). Let X = { sample } are data samples at time points { (1 ) is d-dimensional vector in }, where each . That is, the vector is a data instance related to time point j and is composed of d attributes (features). ∑ ∑( Where and )( ) are the sample mean and sample covariance of respectively. The hyper ellipsoid of effective radius t centered at with covariance matrix is defined as: ( ) Where } | is the characteristic matrix of , and ’ ‘is the effective radius of . The following quantity (4) represents the Mahalonobis distance (Mahalonobis distance could be seen as Euclidean distance divided by the covariance matrix) from to and is the characteristic matrix of . CHAPTER 3: Proposed Method 24 The boundary surface of the ellipsoid is given by equation (5). ( ) } Figure 11: Effective Radius. Definition 1: Any point that is outside is considered anomaly. And that is known by computing the Mahalonobis distance of the point from the center of the data: x is anomalous for Using ( ) ⇔ with results in an ellipsoidal boundary that covers at least 98% of the data under the assumption that the data was drawn from a normal distribution.[19]. At this point: is the effective radius of the ellipse. ( ) inverse of the chi squared statistic with d-degrees of freedom and probability . is the CHAPTER 3: Proposed Method 25 In other words, 98% of data points that are normal with respect to set of data points that lie between the corresponding values of and 3.2. is defined as a by setting . From DCAD to TLDCAD Since we are looking mainly for an effective preprocessing tool in addition to outlier detection provided by DCAD, then we motivated to extend the above DCAD method into TLDCAD (Two-layered DCAD) by generating a new additional inner ellipse in such a way the band between the original ellipse generated in DCAD and the new inner ellipse covers either 2% or 4% of the outermost normal data points. For the (2%) normal data points with respect to between the corresponding values of by setting For the (4%) normal data points with respect to lie between the corresponding values of , we take a set of data points that lie and . , we take a set of data points that by setting and . Note that the purpose of the DCAD algorithm in our experiments I is to be used as a data preprocessing tool to partition the data and then send the all the data for further exploration, visualization, and classification purposes. In contrary, the TLDCAD algorithm is used to partition the data and then send only the outlier data plus a small subset of the normal data (either 2% or 4%) for further processing with the Support Vector Machine for example. The TLDCAD has the advantage of providing more balanced data (2% Normal vs. 2% Anomalies) or (4% Normal vs. 2% Anomalies) in comparison to the DCAD that provides (98% Normal vs. 2% Anomalies). CHAPTER 3: Proposed Method 26 Figure 12: TLDCAD. Note that For the TLDCAD the rationale between choosing the outermost 4% as representatives of the "normal" data because these data points are very close in Mahalonobis distance to the decision boundary and the rest are likely to be redundant since they are far away in Mahalonobis distance from the decision boundary and can hence be removed without much consequence. CHAPTER 3: Proposed Method 3.3. 27 Use case ( scenario) Figure 13: WSN, adopted from [53]. Fig.13 shows the mode of operation of the standard DCAD algorithm. Data that has been classified as anomalous is transmitted via the WSN to the central computing facility via a gateway. Note that the original aim of DCAD was anomaly detection, however, there could be other potential applications. For example, it would be useful to be able to return an efficient but representative subset of the data, which would be useful for training machine learning and other decision support algorithms. To achieve this would require an extension to the basic DCAD algorithm. What is required is a method for selecting a critical subset of the normal data. TLDCAD provides a simple but effective way of achieving this. This will help the WSN to greatly reduce energy consumption with communication 4% or 6% of effective data points instead of communicating the 100% of the data points in our preprocessing experiments. The sampled data will reduce the running time of the algorithm used at the central node of the WSN or at the external processing computer for the huge data collected from a huge data stream collected from a huge number of sensor nodes in WSN. CHAPTER 3: Proposed Method 28 Moreover, the DCAD originally just send parameters and outliers which are of no much use for the machine learning and further classification purposes. CHAPTER 4: Experimental Setups and Results 29 CHAPTER 4 4 Experimental Setups and Results 4.1. Datasets 4.1.1. Synthetic Datasets Synthetic data was generated by sampling from a bivariate Gaussian distribution as was done in [19]. The parameters of the distribution are simply the mean and covariance matrix: ( ) The synthetic data are two dimensional only for visualization purposes. We generated 7 datasets: 1. 500 2. 1,000 3. 2,000 4. 3,000 5. 4,000 6. 5,000 7. 50,000 Note that we found out by experiment our method works better with smaller data sets so we focused on small data from 500 to 5,000. In addition, we did not extend our experiment up to 50,000 or more due to the research time limit. CHAPTER 4: Experimental Setups and Results 4.1.2. 30 Grand Saint Bernard (GSB) Dataset The GSB dataset was gathered in the year 2007 from 23 sensors that deployed at the Grand-St-Bernard pass between Italy and Switzerland [44]. We extracted the data Gathered during October by station 10. Also for visualization purposes we only extracted two features: Column 9: Ambient Temperature. Column 12: Relative Humidity. The size of the extracted GSB data is 17, 302 data points of 2 dimensions. Figure 14: GSB Data Scatter Plot. CHAPTER 4: Experimental Setups and Results 31 Note that for GSB data set it seems to be there is a sudden fault in the humidity sensor of the node (which is depicted by the U-shaped graphed at the bottom of the GSB scatter plot in Fig. (14)[19]. 4.1.3. Wind Tower Dataset One of the obvious and important places to gather data is the Masdar Wind tower project. The Wind Data are collected from the Innovative Masdar Wind Tower project at the bottom of the wind tower cooling opening. The Wind Tower is a modern implementation of the traditional Arabic wind tower that has been used to provide cooling for the traditional Arabic houses [45], [46]. The data are collected from 10:15:15, 02-10-2013 up until 11:01:40, 08-10-2013. The data size = 8,082 of two dimensional data points. The two collected attributes are: Column 1 = Relative Humidity. Column 2 = Temperature. Let us highlight that; the data gathered in reality are two dimensions, where each dimension represents one attribute of the sensed data, and we focused on 2 dimensions to be able to visualize the data in our experiments. However, we could go for more dimensions in the future works. Table 3 provides an example of 2dimensional vectors containing the two attributes: temperature and humidity. Table 3: Example of Wind Tower Data. Time Measurements Humidity Temperature 43 32.4 75 31.4 99.9 28 CHAPTER 4: Experimental Setups and Results 25 32 37.4 Note that for table. 3 the data shown are not consecutives data points, they are just for showing various wind tower data values at different times. Figure 15: Wind tower Data Scatter Plot. Note that for the Wind tower data set it seems to be there is 100% saturation of the relative humidity sensor of the Arduino node (see the vertical line structure at the right of the Wind Data scatter plot in Fig. 15. Note also that we were only concerned with 2-diminonal data because the scale of the WSN nodes in the future could be huge and lots of data will be gathered and that CHAPTER 4: Experimental Setups and Results 33 is enough not to make the project complicated for now; since some sensors can get faulty or malfunction due to the harsh climate in the summer and the dusty wither and hence decided to not work for more than 2 dominions. On the other hand we did not need for now more than relative humidity and temperature analysis for our project also it easier to use 2 dominions for continues data collection and future online data monitoring and control. In addition, At present we did not run experiments with more than two attributes to visualize the data, and for the possibility of huge data collection of tens or hundreds of sensor deployments, on the other hand the required data collection from the wind tower for analyzes is only for temperature and relative humidity for the current situation, so that going for more than 2 dominions for now is not essentially relevant. 4.2. Outlier Detection Performance Measurements Detection rate and false alarm rate also known as false positive rate (FPR) and receiver operating characteristic curves are usually used to show the tradeoff between the detection rate and the false alarm rate in WSN [32], nevertheless; Intrusion detection is an important aspect of outlier detection and the metrics used commonly in this field are ROC analysis, precision, recall, F-manures and confusion matrix [5]. Table 4: Confusion Matrix Predicted labels Confusion matrix Normal Anomalies True Negative False Positive (TN) (FP) False Negative True Positives (FN) (TP) Normal Actual Labels Anomalies CHAPTER 4: Experimental Setups and Results 34 (Correctly classified) From Table 4 the following equations can be defined to calculate the precision, recall, and F-value. Figure 16: Precision (P) and Recall (R), adopted from [35]. In Figure 16 the relevant items are to the left of the straight line while the retrieved items are within the oval. The red regions represent errors. On the left these are the relevant items not retrieved (false negatives), while on the right they are the retrieved items that are not relevant (false positives) [35]. We decided to select the Precision and Recall and F1 as our metrics because they are well known from the data mining prospective, also the F1 [5] could be seen as a balanced measure that contain booth precision and recall. CHAPTER 4: Experimental Setups and Results 4.3. 35 Algorithms Explored The conducted testing on sampled data from the ellipses (using both DCAD and TLDCAD) are evaluated on two main types of classifiers: Support Vector Machine (SVM) and Artificial Neural Network (ANN). 4.4. Experiment I: Preprocessing Approach The main methods used in the preprocessing approach are adding an additional layer to DCAD to get the TLDCAD and then label the output data with two labels (classes): “anomalous” and “normal”. Then, we send the data to a classifier Support Vector Machine (SVM) or Artificial Neural Networks (ANN) and then compare the output data from the DCAD and TLDCAD to draw the final conclusion. The flowchart in Fig. 17 summarize the methodology used for the preprocessing approach. Table 5 shows of how the data are generated and how they are preprocessed using the ellipses for the SVM classifier. (That is virtually the same for the ANN classifier.) Table 5: Preprocessing Experiment Flow of TLDCAD vs. DCAD as a Preprocessor. Step 1 Synthetic data generation: scatter plot of 5,000 normally distributed synthetic data Step 2 TLDCAD’soutput: 2% normal data (between the two ellipses) and 2% anomalous data (outside the outer ellipse) TLDCAD’soutput: 4% normal data (between the two ellipses) and 2% anomalous data (outside the outer ellipse) DCAD’soutput: 98% normal data (within the ellipse) and 2% anomalous data (outside the ellipse) Step 3 Scatterd plot for Scatterd plot for Scatterd plot for CHAPTER 4: Experimental Setups and Results Step 4 36 2% normal data vs. 2% anomalous data 4% normal data vs. 2% anomalous data 98% normal data vs. 2% anomalous data SVM output for 2% normal data vs. 2% anomalous data SVM output for 4% normal data vs. 2% anomalous data SVM output for 98% normal data vs. 2% anomalous data Table 6: DCAD Preprocessing Process Summary. DCAD (98% Normal vs. 2% Outliers) processes summary Table 7: TLDCAD Preprocessing Process Summary. TLDCAD (4% Normal vs. 2% Outliers) processes summary CHAPTER 4: Experimental Setups and Results Figure 17: Flowchart of DCAD as a Preprocessor vs. TLDCAD. 37 CHAPTER 4: Experimental Setups and Results Figure 18: One of the Ten Folds: Illustration of TLDCAD vs. DCAD as Preprocessor. 38 CHAPTER 4: Experimental Setups and Results 39 4.4.1. Experiment I Results Note for both approaches, we are focusing on the “2% vs. 4%” TLDCAD since it has better results than “2% vs. 2%”. Tables 8 and 9 show some promising and even competitive results from the preprocessing approach. Note that how the F1 measure value increases with the increase of the normal layered data. For example the average value for the SVM cross validation was 0.963 for 2% TLDCAD and increased to 0.989 which is very competitive to the DCAD 98% normal output which has the value for F1 measure of 0.998. That is by using TLDCAD with 4% normal data we can obtain very competitive accuracy measures and using only 6% of the data instead of using 100% of the data in order to have just 1% of accuracy increase. The main advantage of TLDCAD over DCAD in the context of pre-processing is in terms of TLDCAD's runtime, which is much shorter than that of DCAD. This is especially important in WSNs as it will help to greatly reduce power consumption. The much reduced running times of TLDCAD over DCAD on a standard PC for various experimental setups can be observed in Tables 8 and 9. On the other hand we can observe from the 5,000 data points results that when we increase the normal data sample from 2% to 4% we can achieve approximately the same accuracy of that obtained from 98% normal data point’s classification!. Not even for the 5000 but also for the 50,000 the performance of using 4% normal vs. 98% is almost the same, while the 4% TLDCAD is around 7 times faster than the DCAD Preprocessor. CHAPTER 4: Experimental Setups and Results 40 Table 8: Average Results for 5,000 Synthetic Data Points in Experiment I-A. SVM with SMO ANN using 10 hidden layers TLDCAD: 2% normal TLDCAD: 4% normal DCAD: 98% normal vs. 2% anomalous vs. 2% anomalous vs. 2% anomalous Precision 0.958763 0.994845 0.998775 Recall 0.968750 0.984694 0.999591 F1 0.963731 0.989744 0.999183 Time (sec) 3.645893 3.780674 13.269996 Precision 0.900000 0.969697 1.000000 Recall 0.750000 0.941176 0.989276 F1 0.818182 0.955224 0.994609 Time (sec) 2.000225 2.985272 5.224756 Table 9: Average Results for 50,000 Synthetic Data Points in Experiment I-B. SVM with SMO TLDCAD: 2% normal TLDCAD: 4% normal DCAD: 98% normal vs. 2% anomalous vs. 2% anomalous vs. 2% anomalous Precision 0.905149 0.999484 1.000000 Recall 1.000000 0.961787 0.992368 F1 0.950213 0.980273 0.996169 Time (sec) 9.808853 12.681529 85.694093 Precision 0.993243 0.996656 0.999728 0.967105 0.980263 0.999728 0.980000 0.988391 0.999728 7.051017 11.293773 248.290314 ANN using 10 Recall hidden layers F1 Time sec CHAPTER 4: Experimental Setups and Results 4.5. 41 Experiment II: Classification Approach Figure 19: Flowchart of DCAD as a Classifier vs. TLDCAD. The methods used in this Classification approach are explained in Fig. 19 where we have to feed a labeled data to our classification process (TLDCAD and SVM) and feed the same data to the DCAD used as classifier, and then we compare the output to draw the final conclusion. In Figure 20 the 10-fold cross validation is depicted for the classification approach. CHAPTER 4: Experimental Setups and Results Figure 20: One of the Ten Folds: Illustration of TLDCAD & SVM vs DCAD as a Classifier. 42 CHAPTER 4: Experimental Setups and Results 43 4.5.1. Experiment II Results Table 10 show some promising results and the ability to our approach to outperform the DCAD for all the synthetic datasets except for the 50,000 data set, on the other hand our approach was equally in f1 measures with the GSB real dataset and the Masdar Institute’s Wind Tower dataset. Note that in Experiment II for the classification approach we have to randomize the data many rounds to get the data structure that matches best for Experiment II. However the average of these random rounds is not reported since the aim for us now is to show the ability for the TLDCAD to outperform the DCAD in terms of F1 measure. From Table.10 we can also see how the (SVM & TLDCAD) outperforms the DCAD in the classifying and detecting anomalies by 7% and that is depicted by the (SVM & TLDCAD) when they hit the 1.00 vs. DCAD of just 0.93 on the 500 synthetic dataset. Table 10: Best Achieved results for TLDCAD with SVM vs. DCAD as a Classifier in Experiment II. # 1 2 3 4 5 6 7 8 9 Datasets 500 synthetic 1,000 ~ 2000 ~ 3000 ~ 4000 ~ 5000 ~ 50,000~ GSB downloaded Wind Tower gathered Data F1 Measures SVM & TLDCAD 2% vs. 2% 2% vs. 4% 0.81 0.80 0.90 0.95 0.95 0.98 0.96 0.43 0.10 1.00 0.94 0.92 0.97 0.98 0.99 0.98 1.00 1.00 DCAD 2% vs. 98% (Parameters Only) mean and covariance matrix. 0.93 0.90 0.89 0.94 0.97 0.98 0.99 1.00 1.00 CHAPTER 5: Conclusion and Future Work 44 CHAPTER 5 6 5 Conclusion and Future Work 5.1. Conclusion In this research work, we have two different approaches; the first approach is the DCAD algorithm is used in this thesis to partition the data and then sends all the data for classification purposes. Building on this work we proposed a TLDCAD algorithm to send reduced amount of data than the data obtained by the DCAD. The output of the two algorithms is compared; the results obtained show promising results for TLDCAD. The current work is conducted using synthetic datasets. In addition we moved further for a second approach, that is using DCAD as a classifier and compare it to our algorithm TLDCAD joined with SVM. The results obtained for this approach where also promising and open a wide door for outlier detection and preprocessing in energy constrained devices.. Summary points are provided below with the illustration of our TLDCAD algorithm: It is based on the DCAD Algorithm. It is faster than DCAD as data preprocessor (sampling method). It is able to provide more accurate classification results on small and medium data sets ( 5000 data points). It is good for security and privacy proposes because a subset but not all the data are communicated. CHAPTER 5: Conclusion and Future Work 5.2. 45 Future Work The technique would scale with the number of attributes in terms of running time since the matrix multiplication is involved. As a future work for TLDCAD code, we are considering to generalize the algorithm to work with more than 2-dimensional data. The matrix contribution for now is just a constant. However by going to higher diminutions, our technique should scale with the number of attributes in terms of running time. That is because the matrix multiplication will become the main dominant of the TLDCAD algorithm’s running time since it has three nested loops and the code does not have any recursion involved. This will give us an , and at that time more efficient running time algorithms should be implemented to reduce the running time. However if we count the data samples then we are having a loop over the number of samples k which will give us a matrix multiplication k times. The final running time could then be expressed as . On the other hand, we can work with multiple clusters. Or try to exploit the following formulas of the iterative DCAD to make our algorithm work online instead of the current batch mode. The below images are showing how online ellipses detecting anomalies with the time dimension. If we proceed in the time dimension then we need to exploit the following formulas (12) and (13) to add our own modification to make our algorithm work online perfectly. CHAPTER 5: Conclusion and Future Work 46 Figure 21: FFIDCAD with effective n [19]. [ ( )( – ( – ) – ) ( – ) ] The equations shown above are for the Forgetting factor Iterative DCAD (FFIDCAD) with effective n. thee FFIDCAD add the forgetting factor λ to the older samples which gives more weight to the new sample over the old ones in order to forget the old samples. In order to limit the growth of k in FFIDCAD (with effective n) it has to use a constant in Equation 8. This constant is used instead of k when k ≥ . [19]. Abbreviations APPENDIX A A Abbreviations ANN: Artificial Neural Network DCAD: Data Capture Anomaly Detection FIDCAD: Forgetting Factor Iterative Data Capture Anomaly Detection IDCAD: Iterative Data Capture Anomaly Detection ML: Machine Learning SVM: Support Vector Machine TLDCAD: Two-layered Data Capture Anomaly Detection WSN: Wireless Sensor Network 47 Masdar Wind Tower APPENDIX B B Masdar Wind Tower Table 11: Masdar Wind Tower Photographs and Images. Figure 23: Wind Tower Image (Photographed by Ibrahim Khamis) Figure 22: Wind Tower air flow diagram (Photographed by Ibrahim Khamis) 48 Figure 24: Wind Tower Bank of 75 high-pressure nozzles while introducing mist to the Wind tower Ventilation tube from inside (Photographed by Ibrahim Khamis) Masdar Wind Tower Figure 25: Wind Tower Background (Photographed by Ibrahim Khamis) Figure 26: Wind Tower how it works (Photographed by Ibrahim Khamis) 49 Masdar Wind Tower Figure 27: Wind Tower Thermal Comfort (Photographed by Ibrahim Khamis) 50 Bibliography Bibliography [1] S. Alam, G. Dobbie, P. Riddle, M. A. Naeem, “A swarm intelligence based clustering approach for outlier detection,” in Proceedings of the 2010 IEEE Congress on Evolutionary Computation (CEC), 2010, pp.1-7. [2] M. A. Azim, Z. Aung, W. Xiao, V. Khadkikar, and A. Jamalipour, “Localization in wireless sensor networks by constrained simultaneous perturbation stochastic approximation technique,” in Proceedings of the 6th IEEE International Conference on Signal Processing and Communication Systems (ICSPCS), 2012, pp. 1-9. [3] M. A. Azim, F. M. Kiaie, and M. H. Ahmed, “Environmental forest monitoring using wireless sensor networks,” in Wireless Sensor Networks: Current Status and Future Trends, CRC Press, 2012, pp. 61-78. [4] A. Challagalla, S. S. S. Dhiraj, D. V. L. N. Somayajulu, T. S. Mathew, S. Tiwari, and S. S. Ahmad, “Privacy preserving outlier detection using hierarchical clustering methods,” in Proceedings of the 34th IEEE Annual Computer Software and Applications Conference Workshops (COMPSACW), 2010, pp. 152-157. [5] P. Dokas, L. Ertoz, V. Kumar, A. Lazarevic, J. Srivastava, and P. N. Tan, “Data mining for network intrusion detection,” in Proceedings of the 2002 NSF Workshop on Next Generation Data Mining (NGDM), 2002, pp. 21-30. [6] M. A. Faisal, Z. Aung, J. Williams, and A. Sanchez, “Securing advanced metering infrastructure using intrusion detection system with data stream 51 Bibliography mining,” in Proceedings of the 2012 Pacific Asia Workshop on Intelligence and Security Informatics (PAISI), 2012, pp. 96-111. [7] S. Ganapathy, N. Jaisankar, P. Yogesh, and A. Kannan, “An intelligent system for intrusion detection using outlier detection,” in Proceedings of the 2011 International Conference on Recent Trends in Information Technology (ICRTIT), 2011, pp. 119-123. [8] D. M. Hawkins. Identification of Outliers, Chapman and Hall, London, 1980. [9] J. H. M. Janssens, I. Flesch, and Eric O. Postma, “Outlier detection with oneclass classifiers from ML and KDD,” in Proceedings of the 2009 International Conference on Machine Learning and Applications (ICMLA), 2009, pp. 147153. [10] Q. Ji-lin, Q. Wen, S. Ying, and F. Yu-mei, “A nonparametric outlier detection method for financial data,” in Proceedings of the 2009 International Conference on Management Science and Engineering (ICMSE), pp. 14421447. [11] S.-Y. Jiang and Q.-b. An, “Clustering-based outlier detection method,” in Proceedings of the 5th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD) Volume 2, 2008, pp. 429-433. [12] S.-Y. Jiang and A.-M. Yang, “Framework of clustering-based outlier detection,” in Proceedings of the 6th international conference on Fuzzy Systems and Knowledge Discovery (FSKD) Volume 1, 2009, pp. 475-479. [13] I. Khamis and Z. Aung, “Outlier preprocessing in wireless sensor networks: A two-layered ellipse approach,” in Proceedings of the 6th IEEE International Conference on Developments in eSystems Engineering (DeSE), 2013, pp. 1-6. 52 Bibliography [14] J. G. Lee, J. Han, and X. Li, “Trajectory outlier detection: A partition-anddetect framework,” in Proceedings of the 24th IEEE International Conference on Data Engineering (ICDE), 2008, pp. 140-149. [15] S. H. Lee, S. Lee, H. Song, and H.-S. Lee, "Wireless sensor network design for tactical military applications: Remote large-scale environments,” in Proceedings of the 2009 IEEE Conference on Military Communications (MILCOM), 2009, pp. 1-7. [16] D. Li, Z. Aung, S. Sampalli, J. Williams, and A. Sanchez, “Privacy preservation scheme for multicast communications in smart buildings of the smart grid,” Smart Grid and Renewable Energy, vol. 4, no. 4, 2013, pp. 313324. [17] Y. Li and H. Kitagawa, “Db-outlier detection by example in high dimensional datasets,” in Proceedings of the 2007 IEEE International Workshop on Databases for Next Generation Researchers (SWOD), 2007, pp. 73-78. [18] P. H. Menold, R. K. Pearson, and F. Allgower, “Online outlier detection and removal,” in Proceedings of the 7th Mediterranean Conference on Control and Automation (MED), 1999, pp. 1110-1133. [19] M. Moshtaghi, C. Leckie, S. Karunasekera, J. C. Bezdek, S. Rajasegarar, and M. Palaniswami, “Incremental elliptical boundary estimation for anomaly detection in wireless sensor networks,” in Proceedings of the 11th IEEE International Conference on Data Mining (ICDM), 2011, pp. 467-476. [20] K. Narita and H. Kitagawa, “Outlier detection for transaction databases using association rules,” in Proceedings of the 9th International Conference on Web-Age Information Management (WAIM), 2008, pp. 373-380. 53 Bibliography [21] J. H. Oh, J. Gao, and K. Rosenblatt, “Biological data outlier detection based on Kullback-Leibler divergence,” Proceedings of the 2008 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2008, pp. 249-254. [22] K. Prakobphol and J. Zhan, “A novel outlier detection scheme for network intrusion detection systems,” in Proceedings of the 2008 International Conference on Information Security and Assurance (ISA), 2008, pp. 555-560. [23] J. Qu, “Outlier detection based on Voronoi diagram,” in Proceedings of the 4th International Conference on Advanced Data Mining and Applications (ADMA), 2008, pp. 516-523. [24] S. Rajasegarar, J. C. Bezdek, C. Leckie, and M. Palaniswami, “Elliptical anomalies in wireless sensor networks,” ACM Transactions on Sensor Networks, vol. 6, no. 1, 2009, pp. 1-28. [25] S. Rajasegarar, C. Leckie, J. C. Bezdek, and M. Palaniswami, “Centered hyperspherical and hyperellipsoidal one-class support vector machines for anomaly detection in sensor networks,” IEEE Transactions on Information Forensics and Security, vol. 5, no. 3, 2010, pp. 518-533. [26] I. Silva, L. A. Guedes, P. Portugal, and F. Vasques, “Reliability and availability evaluation of wireless sensor networks for industrial applications,” Sensors, vol. 12, no. 1, 2012, pp. 806-838. [27] Y. Tao and D. Pi, “Unifying density-based clustering and outlier detection,” in Proceedings of the 2nd International Workshop on Knowledge Discovery and Data Mining (WKDD), 2009, pp. 644-647. [28] J. Xi, “Outlier detection algorithms in data mining,” in Proceedings of the 2nd International Symposium on Intelligent Information Technology Application (IITA), 2008, pp. 94-97. 54 Bibliography [29] Z. Yang, N. Meratnia, and P. Havinga, “An online outlier detection technique for wireless sensor networks using unsupervised quarter-sphere support vector machine,” in Proceedings of the International Conference on Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP), 2008, pp. 151–156. [30] Y. Zhang, N. Meratnia, and P. Havinga, “Adaptive and online one-class support vector machine-based outlier detection techniques for wireless sensor networks,” in Proceedings of the 2009 International Conference on Advanced Information Networking and Applications Workshops (WAINA), 2009, pp. 990-995. [31] Y. Zhang, N. Meratnia, and P. J. M. Havinga, “Ensuring high sensor data quality through use of online outlier detection techniques,” International Journal of Sensor Networks, vol. 7, no. 3, 2010, pp. 141-151. [32] Y. Zhang, N. Meratnia, and P. Havinga, “Outlier detection techniques for wireless sensor networks: A survey,” IEEE Communications Surveys and Tutorials, vol. 12, no. 2, 2010, pp. 159-170. [33] http://db.csail.mit.edu/labdata/labdata.html [34] http://mathpax.com/images/statistics.pdf [35] http://en.wikipedia.org/wiki/Precision_and_recall [36] https://portal.masdar.ac.ae/Pages/NewsDetail.aspx?NID=401 [37] http://www.libelium.com/products/meshlium/wireless-sensor-networks [38] http://www.libelium.com/products/waspmote [39] http://www.libelium.com/development/developers/ [40] http://www.techopedia.com/definition/25651/wireless-sensor-network-wsn [41] http://www.seeedstudio.com/depot/grove-rtc-p-758.html?cPath=25_30 55 Bibliography [42] http://www.seeedstudio.com/depot/grove-temperaturehumidity-sensor-pro-p838.html [43] http://www.seeedstudio.com/depot/sd-card-shield-p-492.html?cPath=132_134 [44] http://lcav.epfl.ch/cms/lang/en/pid/86035 [45] http://www.masdar.ac.ae/campus-community/the-campus/windtower [46] http://masdarcity.ae/en/110/frequently-asked-questions/ [47] S. Subramaniam, T. Palpanas, D. Papadopoulos, V. Kalogerakiand, and D. Gunopulos, Online Outlier Detection in Sensor Data using Nonparametric Models, J. Very Large Data Bases, VLDB 2006. [48] S. Rajasegarar, C. Leckie, M. Palaniswami, and J.C. Bezdek, Distributed Anomaly Detection in Wireless Sensor Networks, Proc. IEEE ICCS, 2006. [49] S. Rajasegarar, C. Leckie, M. Palaniswami, and J. C. Bezdek, Quarter Sphere Based Distributed Anomaly Detection in Wireless Sensor Networks, Proc. IEEE International Conference on Communications, pp. 3864-3869, 2007. [50] D. Janakiram, A. Mallikarjuna, V. Reddy, and P. Kumar, Outlier Detection in Wireless Sensor Networks using Bayesian Belief Networks, Proc. IEEE Comsware, 2006. [51] D.J. Hill, B.S. Minsker, and E. Amir, Real-Time Bayesian Anomaly Detection for Environmental Sensor Data, Proc. 32nd Congress of the International Association of Hydraulic Engineering and Research, 2007. [52] V. Chatzigiannakis, S. Papavassiliou, M. Grammatikou, and B.Maglariset, Hierarchical Anomaly Detection in Distributed Large-Scale Sensor Networks, Proc. ISCC, 2006. [53] http://en.wikipedia.org/wiki/Wireless_sensor_network 56 Bibliography [54] M. Bahrepour,Y. Zhang, N. Meratnia, and P. Havinga, “Use of Event Detection Approaches for Outlier Detection in Wireless Sensor Networks,” IEEE Communications Surveys and Tutorials, 2009, pp. 439-444. 57