Download Anomaly Detection and Preprocessing

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Anomaly Detection and
Preprocessing
By
Ibrahim Khamis
A Thesis Presented to the
Masdar Institute of Science and Technology
In Partial Fulfillment of the Requirements for the
Degree of
Master of Science
In
Computing and Information Science
© 2014 Masdar Institute of Science and Technology
All rights reserved
Abstract
In sustainable environments, efficient anomaly (outlier) detection is essential to help
monitor and control the system with the decision making process. Anomaly detection
is an inherently difficult problem due to its decisions of what is normal and what is
unusual, and the ability to distinguish between the two. Another serious difficulty is
that the definition of normal can change. Sensor nodes in wireless sensor networks
have limited energy resources and this hinders the dissemination of the gathered data
to a central location. This stimulated our research to make use of the limited
computational capabilities of these sensor nodes to build a normal model of the data
gathered. In our research, our goal is to determine what is normal and what is
abnormal and to distinguish between Normal & abnormal. We developed an
algorithm called “Two-layered Data Capture Anomaly Detection”. Our algorithm
sends anomalies (2%) as well as roughly (2% or 4%) of normal data for further data
processing and classification purposes. For testing purposes we also deployed three
different machine learning and data mining tools. Three separate data sets were also
used to validate the system. The performance of the proposed method is evaluated and
compared with results obtained from the application of state of the art methods on the
same data sets. In these tests our method provided very promising results.
ii
This research was supported by the Government of Abu Dhabi to help fulfill the vision
of the late President Sheikh Zayed Bin Sultan Al Nahyan for sustainable development
and empowerment of the UAE and humankind.
iii
Acknowledgments
Praise be to Allaah, I would like to extend my gratitude to my family members for
their patience and support. I would also like to take this opportunity to thanks those
who actively guided and helped me in this research. Foremost, I would like to express
my deep appreciation to my advisor Dr. Zeyar Aung for his continuous support for
my M. Sc. Study and research. His guidance, patience, motivation, and support helped
me to develop a deep understanding of the subject. Beside my advisor, I would like to
thank my thesis supervisor committee members: Dr. Khaled Elbassioni and Dr. Wei
Lee Woon for their valuable time, comments, and advice.
Ibrahim Khamis
Masdar City, April 30, 2014
iv
Contents
_____________________________________________________________________
1 Introduction .......................................................................................................................... 1
1.1.
Background and Motivation ...................................................................................... 1
1.2.
Objectives and Contributions .................................................................................... 2
1.3.
Relevance to Masdar/UAE......................................................................................... 6
1.4.
Publication ................................................................................................................. 7
1.5.
Thesis Organization ................................................................................................... 7
2 Literature Review ................................................................................................................. 8
2.1.
Wireless Sensor Network (WSN) ............................................................................... 8
2.2.
Data Mining for Outlier Detection............................................................................. 9
2.3.
Outlier Detection for WSNs ..................................................................................... 10
2.3.1.
Statistical-based Techniques ............................................................................... 16
2.3.2.
Nearest Neighbor-based Techniques .................................................................. 16
2.3.3.
Clustering-based Techniques............................................................................... 16
2.3.4.
Classification-based Techniques .......................................................................... 17
2.3.5.
Comparison of WSN Outlier Detection Techniques ............................................ 19
2.3.6.
Recent Trends ...................................................................................................... 19
2.4.
Shortcomings of Outlier Detection Techniques ...................................................... 20
2.5.
Requirements for Outlier Detection in WSNs ......................................................... 21
3 Proposed Method ................................................................................................................ 22
3.1.
Data Capture Anomaly Detection (DCAD) ............................................................... 22
3.2.
From DCAD to TLDCAD ............................................................................................ 25
3.3.
Use case ( scenario) ................................................................................................. 27
4 Experimental Setups and Results ....................................................................................... 29
4.1.
Datasets ................................................................................................................... 29
4.1.1.
Synthetic Datasets ............................................................................................... 29
4.1.2.
Grand Saint Bernard (GSB) Dataset ..................................................................... 30
4.1.3.
Wind Tower Dataset ............................................................................................ 31
4.2.
Outlier Detection Performance Measurements ...................................................... 33
4.3.
Algorithms Explored ................................................................................................ 35
v
4.4.
4.4.1.
4.5.
4.5.1.
Experiment I: Preprocessing Approach ................................................................... 35
Experiment I Results ............................................................................................ 39
Experiment II: Classification Approach .................................................................... 41
Experiment II Results ........................................................................................... 43
5 Conclusion and Future Work.............................................................................................. 44
5.1.
Conclusion ............................................................................................................... 44
5.2.
Future Work ............................................................................................................ 45
A Abbreviations .................................................................................................................... 47
B Masdar Wind Tower .......................................................................................................... 48
Bibliography ............................................................................................................................ 51
vi
List of Tables
_____________________________________________________________________
Table 1: Comparison of Features for Multivariate Outlier Detection Techniques for WSNs,
adopted from ‎[32]. .................................................................................................................. 15
Table 2: Comparing Different Approaches on Outlier Data, adopted from [54]. ................... 18
Table 3: Example of Wind Tower Data. ................................................................................... 31
Table 4: Confusion Matrix ....................................................................................................... 33
Table 5: Preprocessing Experiment Flow of TLDCAD vs. DCAD as a Preprocessor. ................ 35
Table 6: DCAD Preprocessing Process Summary. .................................................................... 36
Table 7: TLDCAD Preprocessing Process Summary. ................................................................ 36
Table 8: Average Results for 5,000 Synthetic Data Points in Experiment I-A.......................... 40
Table 9: Average Results for 50,000 Synthetic Data Points in Experiment I-B........................ 40
Table 10: Best Achieved results for TLDCAD with SVM vs. DCAD as a Classifier in Experiment
II. .............................................................................................................................................. 43
Table 11: Masdar Wind Tower Photographs and Images. ...................................................... 48
vii
List of Figures
_____________________________________________________________________
Figure 1: Data Capture Anomalies Detection (DCAD) Algorithm ‎[24]. ...................................... 3
Figure 2: Our Proposed Two-layered DCAD Algorithm. ............................................................ 4
Figure 3: Thesis Contributions. ................................................................................................. 5
Figure 4: Three outlier sources in WSNs and their corresponding detection techniques,
adopted from [32]. ................................................................................................................... 11
Figure 5: Categorization of WSN Outlier Detection Methods using Data Mining ‎[23]. .......... 11
Figure 6: Generic Categorization of Outlier Detection Methods ‎[28]. .................................... 12
Figure 7: Outlier Detection Technique for WSNs, adopted from ‎[32]. ................................... 14
Figure 8: Advantage of Mahalonobis Distance........................................................................ 17
Figure 9: Recent Developments in Outlier Detection in WSN ................................................. 20
Figure 10: DCAD Illustration. ................................................................................................... 22
Figure 11: Effective Radius. ..................................................................................................... 24
Figure 12: TLDCAD. .................................................................................................................. 26
Figure 13: WSN, adopted from [53]. ....................................................................................... 27
Figure 14: GSB Data Scatter Plot. ............................................................................................ 30
Figure 15: Wind tower Data Scatter Plot. ............................................................................... 32
Figure 16: Precision (P) and Recall (R), adopted from ‎[35]. .................................................... 34
Figure 17: Flowchart of DCAD as a Preprocessor vs. TLDCAD. ................................................ 37
Figure 18: One of the Ten Folds: Illustration of TLDCAD vs. DCAD as Preprocessor. .............. 38
Figure 19: Flowchart of DCAD as a Classifier vs. TLDCAD........................................................ 41
Figure 20: One of the Ten Folds: Illustration of TLDCAD & SVM vs DCAD as a Classifier. ....... 42
Figure 21: FFIDCAD with effective n ‎[19]. ............................................................................... 46
Figure 22: Wind Tower air flow diagram (Photographed by Ibrahim Khamis) ....................... 48
Figure 23: Wind Tower Image (Photographed by Ibrahim Khamis) ........................................ 48
Figure 24: Wind Tower Bank of 75 high-pressure nozzles while introducing mist to the Wind
tower Ventilation tube from inside (Photographed by Ibrahim Khamis) ............................... 48
Figure 25: Wind Tower Background (Photographed by Ibrahim Khamis)............................... 49
Figure 26: Wind Tower how it works (Photographed by Ibrahim Khamis) ............................. 49
Figure 27: Wind Tower Thermal Comfort (Photographed by Ibrahim Khamis) ...................... 50
viii
CHAPTER 1: Introduction
1
CHAPTER
1
1 Introduction
1.1.
Background and Motivation
A Wireless sensor network (WSN) refers to a group of spatially dispersed and
dedicated sensors for monitoring and recording the physical conditions of the
environment and organizing the collected data at a central location. WSNs measure
environmental conditions like temperature, sound, pollution level, humidity, wind
speed and direction, pressure, etc. WSNs are widely used in areas such as
manufacturing industry [26], military [15], environmental monitoring [3], smart
power grids [6], smart buildings/homes [16], and many other applications that require
distributed location-aware data sensing [2].
The advantage of using WSNs is that they are cheaper and more practical than wired
networks. However WSNs are vulnerable to intrusions and faults [6] and they are
resource constrained devices. In general, WSN data needs to be mined to detect
anomalies as efficiently as possible. Once found, these will then be sent to the base
station or central location for further processing.
Outliers are encountered in many applications. Here are some terms that are
commonly used in the data mining community: uncommon behavior in data, rare
instances,
outliers,
anomalies,
deviations,
exceptions,
rare
instances,
and
irregularities [1]. Hawkins provided the following definition of an outlier: "An outlier
CHAPTER 1: Introduction
2
is defined as an observation that deviates too much from other observations that it
arouses suspicions that it was generated by a different mechanism from other
observations” [28]. Anomaly detection is an inherently difficult problem as it is
essentially the problem of deciding what is not normal; frequently there are no predetermined examples or models for "abnormal" data and these needs to be determined
from the statistical properties of the data.
Anomaly detection in wireless sensor nodes is even more challenging because they
have limited power and computing resources. It is virtually impossible to disseminate
all the gathered data to a central location to detect the anomalies. On the other hand
anomalies are important data that of interest of us since they may represent faults,
intrusion, malicious attacks or even fire alarms and also more further could be an
automatic signaling for some actions like dispatching a repair crew to fix the faulty
sensors for example. This motivated our research to make use of the limited
computational capabilities of these devices by building a normal model of the data
gathered. In this way, data that deviates from this model can be classified as
"anomalous" and subsequently forwarded to a central location for further processing.
This process is done inside these devices and hence saves the power that would be
needed to transmit all the data.
1.2.
Objectives and Contributions
The objective of this thesis is to develop an efficient and robust algorithm for anomaly
detection and preprocessing in energy constrained devices such as WSNs. The aim is
to find a balance amongst the three desirable factors of speed, accuracy, and low
energy consumption for anomaly preprocessing and detection in WSNs. Towards this
end, a novel algorithm is proposed which we call the “Two-layered Data Capture
CHAPTER 1: Introduction
3
Anomaly Detection” (TLDCAD) is proposed. This algorithm is based on an existing
technique, the Ellipsoidal Data Capture Anomaly Detection (DCAD) method [24],
which is illustrated in Figure 1. DCAD is an anomaly detection algorithm which uses
mean and covariance matrix of the data to define an ellipse which captures the overall
distribution of the data.
Figure 1: Data Capture Anomalies Detection (DCAD) Algorithm ‎[24].
The anomalies are then detected by setting a threshold which defines an outer
boundary of the ellipse which encompasses 98% of the data. Any data points which
fall outside of this boundary are considered to be anomalies and are captured and sent
for further processing.
In this way, DCAD helps to improve energy efficiency by selecting only points
which are considered “interesting”. However, this is only part of the problem, since
there should also be a way to select interesting normal points. TLDCAD seeks to
address this problem by setting a second threshold on the data (in our experiments we
CHAPTER 1: Introduction
4
tried 94% and 96% levels) to capture an additional layer of data points which lie
between this new level and the original 98% level. These points are subsequently
labeled as normal and sent to a central computer for further processing.
Figure 2: Our Proposed Two-layered DCAD Algorithm.
Our aim is to reduce the power consumption in resource constrained devices such
as wireless sensors. We reduce the noise level by preprocessing inside the sensor node
then send a reduced sampled data (before any in-between nodes communication
noise) to the central computer or central node with a light version of SVM for further
classification, visualization, and exploration. We accomplish that by adapting a new
approach of data preprocessing by providing sampled data using a two level ellipse.
This produces balanced data sets with around 50% anomalies. This sampled data is
used then to work with some classifiers that require relatively balanced data sets such
as the typical Support vector Machine (SVM). The data is further processed by the
SVM to provide more accurate classification results. With this distributed approach
CHAPTER 1: Introduction
5
we combine the speed of the ellipse method on the WSN’s node with the accuracy of
the SVM on the central computer. We could say that we are using a distributed data
mining approach in order to try to answer whether anomalies can become classifiable.
The contributions of this thesis can be summarized as follows.

Contribution 1: A new model for lightweight anomaly preprocessing and
detection which applies two separate probability thresholds (TLDCAD).

Contribution 2: Two distinct usage mechanisms where TLDCAD can be
used either as a pre-processor, or as a classifier.

Contribution 3: Comparative evaluation of TLDCAD on both synthetic
and real data sets.
Figure 3 shows these three main research contributions in pictorial format.
Figure 3: Thesis Contributions.
CHAPTER 1: Introduction
6
The advantages of the proposed approach can be summarized as follows:

It provides a new approach of data preprocessing by acquiring the most
informative sampled data using two ellipses.

It reduces the power consumption in resource constrained devices like
wireless sensor nodes; Thus, it improves the sustainability and detection
capability of the whole WSN.

It reduces the effect of communication channel’s noise by preprocessing
inside the sensor nodes then sending a reduced set of sampled data to the
server for further exploration. It improves the reliability of data by
providing more efficient WSN’s data samples. In addition, it also improves
security and privacy of the data, because only parts of the data are
communicated.

It produces more balanced data sets, which are better for classification,
with around 50% or 25% outliers instead of just 2% outliers. (Some
classifiers like a typical Support Vector Machine do not work well with
extremely unbalanced data sets.)
1.3.
Relevance to Masdar/UAE
The field of energy efficiency evolved with rich areas of science and applications
that need to be redesigned and reframed for such a new field. Masdar Institute is a
global institute which is focused on this and other related challenges. Techniques
which facilitate the use of energy constrained devices for data collection have a very
direct relevance with Masdar's vision which is centered on energy efficiencies and
CHAPTER 1: Introduction
7
green technologies. Moreover, the proposed method is tested on data collected from
sensors attached to Wind Tower air cooling structure in Masdar Institute.
1.4.
Publication
Some portions of the research described in this thesis have been published in the
following paper [13].
I. Khamis and Z. Aung, “Outlier preprocessing in wireless sensor
networks: A two-layered ellipse approach,” in Proceedings of the
6th IEEE International Conference on Developments in eSystems
Engineering (DeSE), 2013, pp. 1-6.
1.5.
Thesis Organization
The remainder of the thesis is organized as follow. In Chapter 2; it gives an
overview of the current technologies in WSN and anomaly detection. Then it explains
the proposed algorithm in Chapter 3. After that it describes the experimental setup and
the results in Chapter 4, followed by conclusion and future work in Chapter 5.
CHAPTER 2: Literature Review
8
CHAPTER
2
3
2 Literature Review
2.1.
Wireless Sensor Network (WSN)
Wireless Sensor Network (WSN) is a network that consists of number of nodes; each
one is connected to other nodes wirelessly in the network. WSN are feasible solutions
in situations where it is difficult, or costly or even impractical to implement wired
networks [24]. There are many types of sensors settling in every sensor node. These
sensors allow the node to collect many types of data. Since there are many sensors on
the sensor node, then it becomes a subject of multidimensional features data
collections.
Some sensor nodes these days have very good computational capabilities for
example the Waspmote Error! Reference source not found. is one of the sensors
evices for developers. This sensor node is one of the examples that depict how
wireless sensors are becoming more like mini computers. In recent research some
authors of recent papers targeted the computational capability of the modern sensor
nodes to detect anomalies locally in a decentralized mode [19] [24].
One of the important roles of the WSN is to detect important events or faults in
the network nodes. Detecting the important event or anomalies at the node level will
reduce the amount of data to be transmitted over the network since only the detected
event is transmitted instead of transmitting the whole data set. In such situation the
CHAPTER 2: Literature Review
9
need for some kind of event detection system become very crucial. Here comes the
role of data mining techniques as discussed in the following section.
2.2.
Data Mining for Outlier Detection
Narita and Kitagawa defined data mining as systematically extracting useful
information from data [20]. The aim of data mining is to find patterns from data sets.
Data mining can be either supervised or unsupervised. Supervised data mining
involves the use of training labeled data to build a classification model, and then this
classification model is used to classify the new testing data. Unsupervised data mining
do not use labeled data to classify the new data; it normally uses some technique like
clustering to build clusters around the normal data.
Hence the supervised outlier detection algorithms learn a model by the labeled
training data and decide on the test data whether it is normal or abnormal.
Unsupervised outlier detection finds outlier without prior knowledge of the data [21].
For example, when the data are clustered then the clusters represent the normal data
and any data points fall out the clusters boarder are considered to be outliers. In
addition the aim of traditional pattern recognition is to find the majority of data and
deal with outliers as noise. However, noise for one person could be a signal for other
person [12].
Outliers can indicate important events in some situations and can be of more
importance than the normal data. For examples the fire alarm sensed data is more
important than the normal data. Outlier detection is an important field of data mining
techniques [1]. Outliers are named in many ways and here are some terms that are
used in data mining community: Uncommon behavior in data, rare instances, outliers,
anomalies, devotions, exceptions rare instances, and irregularities [1].
CHAPTER 2: Literature Review
10
There are many studies about outlier detection. For example, Jiang and Yang
made clusters as a unit and find the outliers clusters as a unit [12]. In this case the
whole cluster becomes an outlier. Lee et al. proposed a novel work for trajectory
outlier detection [14]. The abnormal trajectory among other trajectories becomes an
outlier. Moreover, Menold et al. raised the point that; data point can be compared to
the median of the past and present value and the result is outlier if it exceeds certain
threshold [18]. This show the implementation of the temporal data (data related to
time). On the other hand some people are concerned with the privacy issue of outlier
detection. Challagalla et al. inferred that detection of outliers threats some
organizations and raises their concerns' about the privacy of the analyzed data; for that
reason it is important to incorporate some sort of privacy protection in the outlier
detection technique [4].
2.3.
Outlier Detection for WSNs
There are many ways to categories outlier detection. In Figure 4, Zhang et al. [32]
divided outlier detection in WSNs to three branches in terms of the outlier sources,
the first is Fault detection in WSNs and it deals with noise and errors, the second
division is event detection in WSNs which deal with events. The last division is
intrusion detection in WSNs and this one handles the malicious attacks.
CHAPTER 2: Literature Review
11
Figure 4: Three outlier sources in WSNs and their corresponding detection
techniques, adopted from [32].
Qu classified the outlier detection methods as illustrated in Figure 5 [23]. In this
classification they classified the outlier detection methods to five main braches:
distribution based, depth based, clustering, distance based, and density based. The
widely used ones are density-based and distance-based. Li and Kitagawa said that
distance based method is one of the most common and simplest methods that is used
for outlier detection [14].
Figure 5: Categorization of WSN Outlier Detection Methods using Data Mining ‎[23].
CHAPTER 2: Literature Review
12
Figure 6: Generic Categorization of Outlier Detection Methods ‎[28].
A related yet more generic classification of outlier detection methods in general
(not necessarily for WSNs only) is provided by Xi [28] as illustrated in Figure 6. In
this classification the author divided the outlier detection algorithms to three main
categories. The first main category is classic outlier which in turn is divided to four
sub categories; statistical based, distance based, deviation based, and density based
approaches. The second main category is the spatial outlier and this is just a
modification of the classic based approach by taking into account the spatial attributes
of the data. Spatial attributes are the attributes that relate to location. The third outlier
CHAPTER 2: Literature Review
13
detection main category implicitly stated by Xi is the “recent advances” in outlier
detection. In this category there are two sub categories; high dimension based
approach and SVM based approach.Zhang et al. [32] proposed a similar taxonomy for
outlier detection techniques in WSNs as shown in Figure 7. The main categories are
statistical based which is further subdivided to parametric and non-parametric, nearest
neighbor based, clustering based, classification based, and finally the spectral
decomposition based. The parametric based is divided to Gaussian based and nonGaussian based. The non-parametric based is divided to kernel based and histogram
based. The classification based is divided to Support vector machine based and
Bayesian network-based. The Bayesian network based is subdivided again to naïve
Bayesian network based, Bayesian believe network based, and dynamic Bayesian
network based. The spectral decomposition is subdivided to the principle component
analysis only. The comparison of various features of the WSN outlier detection
methods are also given in Table 1.
Janssens et al. [9] also compared some outlier detection method from Machine
Learning (ML) and Knowledge Discovery in Databases (KDD). The ML techniques
used are SVM and Parzen Windows, and the KDD techniques used are heuristic localdensity estimation methods such as LOF and LOCI. Janssens et al. used the one class
classification framework. He selected this framework to be able to use AUC (Area
under the Curve) which is a famous performance measurement tool. They found that
Support Vector Domain Description (SVDD) is one of the best performing
methods [9].
Now, let us discuss each category of outlier detection techniques for WSNs.
CHAPTER 2: Literature Review
Figure 7: Outlier Detection Technique for WSNs, adopted from ‎[32].
14
CHAPTER 2: Literature Review
15
Table 1: Comparison of Features for Multivariate Outlier Detection Techniques for WSNs, adopted from ‎[32].
Techniques
Sensor data
Outlier type
Correlation
Attribute
Spatial
Local
Temporal
Subramaniam
et al. [47]
●
Rajasegarar et
al. [48]
Rajasegarar et
al. [49]
Janakiram et
al. [50]]
Hill et al. [51]
●
Chatzigiannakis
et al. [52]
Individual
Global
Collaboration
Aggregate
●
●
Centralized
●
●
●
Individual
●
●
●
●
●
●
●
●
●
CHAPTER 2: Literature Review
16
2.3.1. Statistical-based Techniques
Statistical based are the earliest method used to detect outliers and they are model
based. The two categories in this field is the parametric based approach and nonparametric based approach. The parametric based approach assumes that the data has
a known distribution. In this method if the input data does not follow the assumed
distribution then it may cause some problems. The parametric approach has the
following sub categories Gaussian based and non-Gaussian based. Non parametric do
not assume any data distribution for the data. The non-parametric based is divided to
kernel based and histogram based [32] the advantage of non-parametric is that they do
not require and assumption about the distribution of the data.
2.3.2. Nearest Neighbor-based Techniques
This approach makes use of the nearest neighbor values to find outliers and this
approach is one of the most commonly used methods [32] however this technique
does not scale well when the number of the input data variables increase.
2.3.3. Clustering-based Techniques
Tao and Pi observed that in many applications outlier and clustering results are
needed at the same time [27]. In outlier detection techniques, the data are clustered
and hence the data that are outside the cluster are considered to be anomalies. One of
the latest novel examples in anomaly clustering in WSNs is the Data Capture
Anomaly Detection DCAD algorithm [24]. This algorithm use the hyper elliptical
boundary (cluster) to draw a normal model around the data and the data points that
fall outside this ellipsoidal boundary are considered to be anomalies [5].
CHAPTER 2: Literature Review
17
Figure 8: Advantage of Mahalonobis Distance.
The DCAD [24] and IDCAD [19] exploited the advantage of using Mahalonobis
distance for clustering the data. If the Euclidian distance is used instead of the
Mahalonobis distance then the distance from p2 to its nearest neighbor is greater than
the distance from p1 to its nearest neighbor however when using the Mahalonobis
distance as in Figure 8 then the two distances are the same. (The Mahalonobis
distance is a descriptive statistic that provides a relative measure of a data point's
distance or residual from a common point Error! Reference source not found..)
ence this feature is incorporated in DCAD and iterative DCAD (IDCAD) to find an
ellipsoidal model that best fit the data in order to detect the outliers. The IDCAD use
the same concept as DCAD however it detects outliers online in contrast to DCAD
which uses the batch mode.
2.3.4. Classification-based Techniques
Classification approach is well known in data mining where the classification
algorithm takes labeled input as training data and draws a model from this training
CHAPTER 2: Literature Review
18
data then it accepts a new data (testing data) and labels them according to the built
model. In this section two main types of classifiers are discussed. The support vector
machine (SVM) and the Naïve Bayes. The Naïve Bayes classifier is subdivided
in [32] to three more sub categories; Naïve Bayesian network-based, Bayesian belief
network-based, and dynamic Bayesian network-based. The SVM classifier was
explored in the field of WSNs in [32] and [25], and it shows very promising results.
Bahrepour et al.[54] explored many techniques in his paper and found that the
Quarter-Sphere SVM is one of the out performers in terms of computational cost and
detection accuracy. The results in the paper are reproduced in the Table 2.
Table 2: Comparing Different Approaches on Outlier Data, adopted from [54].
Technique
Standard SVM
Quarter-Sphere SVM
FFNN
The Fusion-based Approach (Naïve Bayes)
The Fusion-based Approach (FFNN)
Naïve Bayes
Accuracy
On artificial
Data
98.12%
98.53%
96.95%
84.90%
85.95%
94.84%
Accuracy
On Real Data
97.64%
98.05%
96.04%
91.00%
98.21%
75.07 %
The Quarter-Sphere SVM outperforms other methods in terms of the accuracy.
However this is in a centralized mode but it is has the disadvantage of computational
cost in the distributed anomaly detection method where each node has to do its
anomaly detection locally onboard. In [24] the authors stated that the SVM has an
issue in terms of its computational complexity. This is due to the kernel matrix
computation and the linear optimization calculations.
On the other hand the Dynamic Bayesian network model has the advantage of
being able to operate on several data streams at once [32]. However, the Bayesian
CHAPTER 2: Literature Review
19
networks algorithms in general are facing challenges whenever the numbers of the
input data variables become large in WSNs [32].
2.3.5. Comparison of WSN Outlier Detection Techniques
Tables.1 and Tables.2 shown above respectively compare the features and the
performances of various WSN outlier detection techniques. The important ones that
we can see from the literatures and papers [32] are the SVM, Dynamic Bayesian
Networks and clustering. However since the SVM is computationally complex for
distributed WSN’s systems and the Dynamic Bayesian Networks do not scale well
with the number of the variables the obvious choice from table1 is the clustering. The
clustering technique shown in this table is proposed by the Rajasegarar et al.[19]. This
technique has the following advantage points from Table1. It works will with the
multivariate variables; it takes care of dealing with temporal correlations through a
time window. It also experimented to compute the centralized and distributed
anomaly detection approach with promising results and less communication overhead.
2.3.6. Recent Trends
It is observed from one of the recent works [19] that the best direction in WSNs
domain is to use unsupervised learning and mainly the clustering which already
investigated in some recent works like the one class SVM and the IDCAD. However
the SVM is still to be refined more because of its computational complexity.
Nevertheless the best choice till now is the IDCAD because it is unsupervised, simple,
has low computational complexity and it is implemented in the distributed
environment and had shown good results. Moreover IDCAD is implemented in an
CHAPTER 2: Literature Review
20
online environment which is a practical approach to deal with the streaming nature of
the WSNs data for an overview of latest works see Figure 9.
Figure 9: Recent Developments in Outlier Detection in WSN
2.4.
Shortcomings of Outlier Detection Techniques
Zhang et al. listed the shortcomings of the existing outlier detection techniques as
follows [32].

Most of the techniques ignore the multivariate nature of the WSN and assume
univariate variables where anomaly can be formed by a combination of more
than one variable.

Many techniques do not consider the correlations between the variables.

Questions need to be answered what is the appropriate sliding window size for
temporal data and what is the appropriate choice of the neighboring nodes?

The work on distinguishing between the types of outlier is not sufficient, and
still many techniques do not distinguish between the errors and outliers and
that may lead to the loss of some important events (outliers).
CHAPTER 2: Literature Review

21
The use of user defined threshold in order to determine outliers are vulnerable
to the dynamic nature of the WSN data.

Many techniques do not consider the mobility of some WSNs and assume the
static condition of WSNs.
2.5.
Requirements for Outlier Detection in WSNs
Zhang et al. also enumerated the requirements outlier detection techniques as
follows [32].

Since there are many shortcomings in the outlier detection techniques in the
field of WSNs, these shortcomings motivate the development of dedicated
outlier detection techniques for WSNs. The following are some important
WSNs outlier detection requirements.

Detection needs to be distributed to reduce communication overhead.

Detection needs to be online to handle the streaming nature of WSNs data.

It is better to use unsupervised methods since the labeled data in WSNs is not
easy to get.

The detection rate must be high and the false alarm rate should be as low as
possible.

The technique must be not complicated or computationally complex to suite
the nature of the restricted resources on WSNs.

The relation between the data must be considered. Also the time and neighbors
locations are important to be taken into account.

The technique must discriminate between the errors and the measurements in
an effective manner.
CHAPTER 3: Proposed Method
22
CHAPTER
3
3 Proposed Method
3.1.
Data Capture Anomaly Detection (DCAD)
Figure 10: DCAD Illustration.
Firstly, a review of the Data Capture Anomaly Detection (DCAD)[19] is presented as
this is the basis for the proposed TLDCAD algorithm. The DCAD is used mainly for
outlier detection in WSNs. It works by first constructing an ellipse which captures a
given percentage (normally 98%) of the data. Hence, data points which fall outside
this ellipse are classified as outliers while points falling inside the ellipse are
classified as normal. However DCAD sends parameters only (mean and covariance
CHAPTER 3: Proposed Method
23
matrix only) and if it is used for data preprocessing (sampling) it will send 98% of
normal and 2% of anomalies. That motivate us to add another layer to be able to
sample the data (preprocess) to use them in a classification process with the Support
vector machine. TLDCAD in addition of providing outlier detection it also sends an
additional effective sample of normal data for the classification purposes.
DCAD and TLDCAD are based on the assumption that the data is normally
distributed; then we need to refresh our minds by starting with the two important
parameters in the bivariate Gaussian distribution, the mean and the covariance matrix
which are given in the following equation (1) and equation (2).
Let X = {
sample
} are data samples at time points {
(1
) is d-dimensional vector in
}, where each
. That is, the vector
is a data
instance related to time point j and is composed of d attributes (features).
∑
∑(
Where
and
)(
)
are the sample mean and sample covariance of
respectively.
The hyper ellipsoid of effective radius t centered at
with covariance matrix
is
defined as:
(
)
Where
}
|
is the characteristic matrix of
, and ’ ‘is the effective radius of
.
The following quantity (4) represents the Mahalonobis distance (Mahalonobis
distance could be seen as Euclidean distance divided by the covariance matrix) from
to
and
is the characteristic matrix of
.
CHAPTER 3: Proposed Method
24
The boundary surface of the ellipsoid is given by equation (5).
(
)
}
Figure 11: Effective Radius.
Definition 1: Any point that is outside
is considered anomaly. And that is known
by computing the Mahalonobis distance of the point from the center of the data:
x is anomalous for
Using
( )
⇔
with
results in an ellipsoidal boundary that covers at
least 98% of the data under the assumption that the data was drawn from a normal
distribution.[19]. At this point:
is the effective radius of the ellipse. ( )
inverse of the chi squared statistic with d-degrees of freedom and probability .
is the
CHAPTER 3: Proposed Method
25
In other words, 98% of data points that are normal with respect to
set of data points that lie between the corresponding values of
and
3.2.
is defined as a
by setting
.
From DCAD to TLDCAD
Since we are looking mainly for an effective preprocessing tool in addition to outlier
detection provided by DCAD, then we motivated to extend the above DCAD method
into TLDCAD (Two-layered DCAD) by generating a new additional inner ellipse in
such a way the band between the original ellipse generated in DCAD and the new
inner ellipse covers either 2% or 4% of the outermost normal data points.
For the (2%) normal data points with respect to
between the corresponding values of
by setting
For the (4%) normal data points with respect to
lie between the corresponding values of
, we take a set of data points that lie
and
.
, we take a set of data points that
by setting
and
.
Note that the purpose of the DCAD algorithm in our experiments I is to be used as a
data preprocessing tool to partition the data and then send the all the data for further
exploration, visualization, and classification purposes. In contrary, the TLDCAD
algorithm is used to partition the data and then send only the outlier data plus a small
subset of the normal data (either 2% or 4%) for further processing with the Support
Vector Machine for example. The TLDCAD has the advantage of providing more
balanced data (2% Normal vs. 2% Anomalies) or (4% Normal vs. 2% Anomalies) in
comparison to the DCAD that provides (98% Normal vs. 2% Anomalies).
CHAPTER 3: Proposed Method
26
Figure 12: TLDCAD.
Note that For the TLDCAD the rationale between choosing the outermost 4% as
representatives of the "normal" data because these data points are very close in
Mahalonobis distance to the decision boundary and the rest are likely to be redundant
since they are far away in Mahalonobis distance from the decision boundary and can
hence be removed without much consequence.
CHAPTER 3: Proposed Method
3.3.
27
Use case ( scenario)
Figure 13: WSN, adopted from [53].
Fig.13 shows the mode of operation of the standard DCAD algorithm. Data that
has been classified as anomalous is transmitted via the WSN to the central computing
facility via a gateway.
Note that the original aim of DCAD was anomaly detection, however, there could
be other potential applications. For example, it would be useful to be able to return an
efficient but representative subset of the data, which would be useful for training
machine learning and other decision support algorithms. To achieve this would
require an extension to the basic DCAD algorithm. What is required is a method for
selecting a critical subset of the normal data.
TLDCAD provides a simple but effective way of achieving this. This will help the
WSN to greatly reduce energy consumption with communication 4% or 6% of
effective data points instead of communicating the 100% of the data points in our
preprocessing experiments. The sampled data will reduce the running time of the
algorithm used at the central node of the WSN or at the external processing computer
for the huge data collected from a huge data stream collected from a huge number of
sensor nodes in WSN.
CHAPTER 3: Proposed Method
28
Moreover, the DCAD originally just send parameters and outliers which are of no
much use for the machine learning and further classification purposes.
CHAPTER 4: Experimental Setups and Results
29
CHAPTER
4
4 Experimental Setups and Results
4.1.
Datasets
4.1.1.
Synthetic Datasets
Synthetic data was generated by sampling from a bivariate Gaussian distribution
as was done in [19]. The parameters of the distribution are simply the mean and
covariance matrix:
(
)
The synthetic data are two dimensional only for visualization purposes.
We generated 7 datasets:
1. 500
2. 1,000
3. 2,000
4. 3,000
5. 4,000
6. 5,000
7. 50,000
Note that we found out by experiment our method works better with smaller data
sets so we focused on small data from 500 to 5,000. In addition, we did not extend our
experiment up to 50,000 or more due to the research time limit.
CHAPTER 4: Experimental Setups and Results
4.1.2.
30
Grand Saint Bernard (GSB) Dataset
The GSB dataset was gathered in the year 2007 from 23 sensors that deployed at
the Grand-St-Bernard pass between Italy and Switzerland [44]. We extracted the data
Gathered during October by station 10. Also for visualization purposes we only
extracted two features:

Column 9: Ambient Temperature.

Column 12: Relative Humidity.
The size of the extracted GSB data is 17, 302 data points of 2 dimensions.
Figure 14: GSB Data Scatter Plot.
CHAPTER 4: Experimental Setups and Results
31
Note that for GSB data set it seems to be there is a sudden fault in the humidity
sensor of the node (which is depicted by the U-shaped graphed at the bottom of the
GSB scatter plot in Fig. (14)[19].
4.1.3.
Wind Tower Dataset
One of the obvious and important places to gather data is the Masdar Wind tower
project. The Wind Data are collected from the Innovative Masdar Wind Tower project
at the bottom of the wind tower cooling opening. The Wind Tower is a modern
implementation of the traditional Arabic wind tower that has been used to provide
cooling for the traditional Arabic houses [45], [46].
The data are collected from 10:15:15, 02-10-2013 up until 11:01:40, 08-10-2013.
The data size = 8,082 of two dimensional data points.
The two collected attributes are:
Column 1 = Relative Humidity.

Column 2 = Temperature.
Let us highlight that; the data gathered in reality are two dimensions, where each
dimension represents one attribute of the sensed data, and we focused on 2
dimensions to be able to visualize the data in our experiments. However, we could go
for more dimensions in the future works. Table 3 provides an example of 2dimensional vectors containing the two attributes: temperature and humidity.
Table 3: Example of Wind Tower Data.
Time
Measurements
Humidity
Temperature
43
32.4
75
31.4
99.9
28
CHAPTER 4: Experimental Setups and Results
25
32
37.4
Note that for table. 3 the data shown are not consecutives data points, they are just
for showing various wind tower data values at different times.
Figure 15: Wind tower Data Scatter Plot.
Note that for the Wind tower data set it seems to be there is 100% saturation of the
relative humidity sensor of the Arduino node (see the vertical line structure at the
right of the Wind Data scatter plot in Fig. 15.
Note also that we were only concerned with 2-diminonal data because the scale of
the WSN nodes in the future could be huge and lots of data will be gathered and that
CHAPTER 4: Experimental Setups and Results
33
is enough not to make the project complicated for now; since some sensors can get
faulty or malfunction due to the harsh climate in the summer and the dusty wither and
hence decided to not work for more than 2 dominions. On the other hand we did not
need for now more than relative humidity and temperature analysis for our project
also it easier to use 2 dominions for continues data collection and future online data
monitoring and control.
In addition, At present we did not run experiments with more than two attributes
to visualize the data, and for the possibility of huge data collection of tens or hundreds
of sensor deployments, on the other hand the required data collection from the wind
tower for analyzes is only for temperature and relative humidity for the current
situation, so that going for more than 2 dominions for now is not essentially relevant.
4.2.
Outlier Detection Performance Measurements
Detection rate and false alarm rate also known as false positive rate (FPR) and
receiver operating characteristic curves are usually used to show the tradeoff between
the detection rate and the false alarm rate in WSN [32], nevertheless; Intrusion
detection is an important aspect of outlier detection and the metrics used commonly in
this field are ROC analysis, precision, recall, F-manures and confusion matrix [5].
Table 4: Confusion Matrix
Predicted labels
Confusion matrix
Normal
Anomalies
True Negative
False Positive
(TN)
(FP)
False Negative
True Positives
(FN)
(TP)
Normal
Actual Labels
Anomalies
CHAPTER 4: Experimental Setups and Results
34
(Correctly classified)
From Table 4 the following equations can be defined to calculate the precision,
recall, and F-value.
Figure 16: Precision (P) and Recall (R), adopted from ‎[35].
In Figure 16 the relevant items are to the left of the straight line while the
retrieved items are within the oval. The red regions represent errors. On the left these
are the relevant items not retrieved (false negatives), while on the right they are the
retrieved items that are not relevant (false positives) [35].
We decided to select the Precision and Recall and F1 as our metrics because they
are well known from the data mining prospective, also the F1 [5] could be seen as a
balanced measure that contain booth precision and recall.
CHAPTER 4: Experimental Setups and Results
4.3.
35
Algorithms Explored
The conducted testing on sampled data from the ellipses (using both DCAD and
TLDCAD) are evaluated on two main types of classifiers: Support Vector Machine
(SVM) and Artificial Neural Network (ANN).
4.4.
Experiment I: Preprocessing Approach
The main methods used in the preprocessing approach are adding an additional
layer to DCAD to get the TLDCAD and then label the output data with two labels
(classes): “anomalous” and “normal”. Then, we send the data to a classifier Support
Vector Machine (SVM) or Artificial Neural Networks (ANN) and then compare the
output data from the DCAD and TLDCAD to draw the final conclusion. The
flowchart in Fig. 17 summarize the methodology used for the preprocessing approach.
Table 5 shows of how the data are generated and how they are preprocessed using
the ellipses for the SVM classifier. (That is virtually the same for the ANN classifier.)
Table 5: Preprocessing Experiment Flow of TLDCAD vs. DCAD as a Preprocessor.
Step 1
Synthetic data generation: scatter plot of 5,000 normally distributed synthetic data
Step 2
TLDCAD’s‎output:
2% normal data
(between the two ellipses) and 2%
anomalous data (outside the outer
ellipse)
TLDCAD’s‎output:
4% normal data
(between the two ellipses) and 2%
anomalous data (outside the outer
ellipse)
DCAD’s‎output:
98% normal data
(within the ellipse)
and 2% anomalous data (outside
the ellipse)
Step 3
Scatterd plot for
Scatterd plot for
Scatterd plot for
CHAPTER 4: Experimental Setups and Results
Step 4
36
2% normal data vs. 2%
anomalous data
4% normal data vs. 2% anomalous
data
98% normal data vs. 2%
anomalous data
SVM output for
2% normal data vs. 2%
anomalous data
SVM output for
4% normal data vs. 2% anomalous
data
SVM output for 98% normal data
vs. 2% anomalous data
Table 6: DCAD Preprocessing Process Summary.
DCAD (98% Normal vs. 2% Outliers) processes summary
Table 7: TLDCAD Preprocessing Process Summary.
TLDCAD (4% Normal vs. 2% Outliers) processes summary
CHAPTER 4: Experimental Setups and Results
Figure 17: Flowchart of DCAD as a Preprocessor vs. TLDCAD.
37
CHAPTER 4: Experimental Setups and Results
Figure 18: One of the Ten Folds: Illustration of TLDCAD vs. DCAD as Preprocessor.
38
CHAPTER 4: Experimental Setups and Results
39
4.4.1. Experiment I Results
Note for both approaches, we are focusing on the “2% vs. 4%” TLDCAD since it
has better results than “2% vs. 2%”.
Tables 8 and 9 show some promising and even competitive results from the
preprocessing approach. Note that how the F1 measure value increases with the
increase of the normal layered data.
For example the average value for the SVM cross validation was 0.963 for 2%
TLDCAD and increased to 0.989 which is very competitive to the DCAD 98%
normal output which has the value for F1 measure of 0.998. That is by using
TLDCAD with 4% normal data we can obtain very competitive accuracy measures
and using only 6% of the data instead of using 100% of the data in order to have just
1% of accuracy increase.
The main advantage of TLDCAD over DCAD in the context of pre-processing is
in terms of TLDCAD's runtime, which is much shorter than that of DCAD. This is
especially important in WSNs as it will help to greatly reduce power consumption.
The much reduced running times of TLDCAD over DCAD on a standard PC for
various experimental setups can be observed in Tables 8 and 9. On the other hand we
can observe from the 5,000 data points results that when we increase the normal data
sample from 2% to 4% we can achieve approximately the same accuracy of that
obtained from 98% normal data point’s classification!. Not even for the 5000 but also
for the 50,000 the performance of using 4% normal vs. 98% is almost the same, while
the 4% TLDCAD is around 7 times faster than the DCAD Preprocessor.
CHAPTER 4: Experimental Setups and Results
40
Table 8: Average Results for 5,000 Synthetic Data Points in Experiment I-A.
SVM with
SMO
ANN using
10 hidden
layers
TLDCAD: 2%
normal
TLDCAD: 4%
normal
DCAD: 98%
normal
vs. 2%
anomalous
vs. 2%
anomalous
vs. 2%
anomalous
Precision
0.958763
0.994845
0.998775
Recall
0.968750
0.984694
0.999591
F1
0.963731
0.989744
0.999183
Time
(sec)
3.645893
3.780674
13.269996
Precision
0.900000
0.969697
1.000000
Recall
0.750000
0.941176
0.989276
F1
0.818182
0.955224
0.994609
Time
(sec)
2.000225
2.985272
5.224756
Table 9: Average Results for 50,000 Synthetic Data Points in Experiment I-B.
SVM with
SMO
TLDCAD: 2%
normal
TLDCAD: 4%
normal
DCAD: 98%
normal
vs. 2%
anomalous
vs. 2%
anomalous
vs. 2%
anomalous
Precision
0.905149
0.999484
1.000000
Recall
1.000000
0.961787
0.992368
F1
0.950213
0.980273
0.996169
Time
(sec)
9.808853
12.681529
85.694093
Precision
0.993243
0.996656
0.999728
0.967105
0.980263
0.999728
0.980000
0.988391
0.999728
7.051017
11.293773
248.290314
ANN using 10 Recall
hidden layers F1
Time sec
CHAPTER 4: Experimental Setups and Results
4.5.
41
Experiment II: Classification Approach
Figure 19: Flowchart of DCAD as a Classifier vs. TLDCAD.
The methods used in this Classification approach are explained in Fig. 19 where
we have to feed a labeled data to our classification process (TLDCAD and SVM) and
feed the same data to the DCAD used as classifier, and then we compare the output to
draw the final conclusion. In Figure 20 the 10-fold cross validation is depicted for the
classification approach.
CHAPTER 4: Experimental Setups and Results
Figure 20: One of the Ten Folds: Illustration of TLDCAD & SVM vs DCAD as a Classifier.
42
CHAPTER 4: Experimental Setups and Results
43
4.5.1. Experiment II Results
Table 10 show some promising results and the ability to our approach to
outperform the DCAD for all the synthetic datasets except for the 50,000 data set, on
the other hand our approach was equally in f1 measures with the GSB real dataset and
the Masdar Institute’s Wind Tower dataset.
Note that in Experiment II for the classification approach we have to randomize
the data many rounds to get the data structure that matches best for Experiment II.
However the average of these random rounds is not reported since the aim for us now
is to show the ability for the TLDCAD to outperform the DCAD in terms of F1
measure.
From Table.10 we can also see how the (SVM & TLDCAD) outperforms the
DCAD in the classifying and detecting anomalies by 7% and that is depicted by the
(SVM & TLDCAD) when they hit the 1.00 vs. DCAD of just 0.93 on the 500
synthetic dataset.
Table 10: Best Achieved results for TLDCAD with SVM vs. DCAD as a Classifier in
Experiment II.
#
1
2
3
4
5
6
7
8
9
Datasets
500 synthetic
1,000 ~
2000 ~
3000 ~
4000 ~
5000 ~
50,000~
GSB downloaded
Wind Tower
gathered Data
F1 Measures
SVM & TLDCAD
2% vs. 2% 2% vs. 4%
0.81
0.80
0.90
0.95
0.95
0.98
0.96
0.43
0.10
1.00
0.94
0.92
0.97
0.98
0.99
0.98
1.00
1.00
DCAD
2% vs. 98% (Parameters Only)
mean and covariance matrix.
0.93
0.90
0.89
0.94
0.97
0.98
0.99
1.00
1.00
CHAPTER 5: Conclusion and Future Work
44
CHAPTER
5
6
5 Conclusion and Future Work
5.1.
Conclusion
In this research work, we have two different approaches; the first approach is the
DCAD algorithm is used in this thesis to partition the data and then sends all the data
for classification purposes. Building on this work we proposed a TLDCAD algorithm
to send reduced amount of data than the data obtained by the DCAD. The output of
the two algorithms is compared; the results obtained show promising results for
TLDCAD. The current work is conducted using synthetic datasets. In addition we
moved further for a second approach, that is using DCAD as a classifier and compare
it to our algorithm TLDCAD joined with SVM. The results obtained for this approach
where also promising and open a wide door for outlier detection and preprocessing in
energy constrained devices..
Summary points are provided below with the illustration of our TLDCAD
algorithm:

It is based on the DCAD Algorithm.

It is faster than DCAD as data preprocessor (sampling method).

It is able to provide more accurate classification results on small and medium
data sets (

5000 data points).
It is good for security and privacy proposes because a subset but not all the
data are communicated.
CHAPTER 5: Conclusion and Future Work
5.2.
45
Future Work
The technique would scale with the number of attributes in terms of running time
since the matrix multiplication is involved. As a future work for TLDCAD code, we
are considering to generalize the algorithm to work with more than 2-dimensional
data. The matrix contribution for now is just a constant. However by going to higher
diminutions, our technique should scale with the number of attributes in terms of
running time. That is because the matrix multiplication will become the main
dominant of the TLDCAD algorithm’s running time since it has three nested loops
and the code does not have any recursion involved. This will give us an
, and at
that time more efficient running time algorithms should be implemented to reduce the
running time. However if we count the data samples then we are having a loop over
the number of samples k which will give us a matrix multiplication k times. The final
running time could then be expressed as
.
On the other hand, we can work with multiple clusters. Or try to exploit the
following formulas of the iterative DCAD to make our algorithm work online instead
of the current batch mode.
The below images are showing how online ellipses detecting anomalies with the
time dimension. If we proceed in the time dimension then we need to exploit the
following formulas (12) and (13) to add our own modification to make our algorithm
work online perfectly.
CHAPTER 5: Conclusion and Future Work
46
Figure 21: FFIDCAD with effective n ‎[19].
[
(
)(
–
(
–
)
–
)
(
–
)
]
The equations shown above are for the Forgetting factor Iterative DCAD (FFIDCAD)
with effective n. thee FFIDCAD add the forgetting factor λ to the older samples which
gives more weight to the new sample over the old ones in order to forget the old
samples. In order to limit the growth of k in FFIDCAD (with effective n) it has to use
a constant
in Equation 8. This constant is used instead of k when k ≥
. [19].
Abbreviations
APPENDIX
A
A Abbreviations
ANN: Artificial Neural Network
DCAD: Data Capture Anomaly Detection
FIDCAD: Forgetting Factor Iterative Data Capture Anomaly Detection
IDCAD: Iterative Data Capture Anomaly Detection
ML: Machine Learning
SVM: Support Vector Machine
TLDCAD: Two-layered Data Capture Anomaly Detection
WSN: Wireless Sensor Network
47
Masdar Wind Tower
APPENDIX
B
B Masdar Wind Tower
Table 11: Masdar Wind Tower Photographs and Images.
Figure 23: Wind Tower Image
(Photographed by Ibrahim Khamis)
Figure 22: Wind Tower air flow diagram
(Photographed by Ibrahim Khamis)
48
Figure 24: Wind Tower Bank of 75
high-pressure nozzles while
introducing mist to the Wind tower
Ventilation tube from inside
(Photographed by Ibrahim Khamis)
Masdar Wind Tower
Figure 25: Wind Tower Background (Photographed by Ibrahim Khamis)
Figure 26: Wind Tower how it works (Photographed by Ibrahim Khamis)
49
Masdar Wind Tower
Figure 27: Wind Tower Thermal Comfort (Photographed by Ibrahim Khamis)
50
Bibliography
Bibliography
[1]
S. Alam, G. Dobbie, P. Riddle, M. A. Naeem, “A swarm intelligence based
clustering approach for outlier detection,” in Proceedings of the 2010 IEEE
Congress on Evolutionary Computation (CEC), 2010, pp.1-7.
[2]
M. A. Azim, Z. Aung, W. Xiao, V. Khadkikar, and A. Jamalipour,
“Localization in wireless sensor networks by constrained simultaneous
perturbation stochastic approximation technique,” in Proceedings of the 6th
IEEE International Conference on Signal Processing and Communication
Systems (ICSPCS), 2012, pp. 1-9.
[3]
M. A. Azim, F. M. Kiaie, and M. H. Ahmed, “Environmental forest
monitoring using wireless sensor networks,” in Wireless Sensor Networks:
Current Status and Future Trends, CRC Press, 2012, pp. 61-78.
[4]
A. Challagalla, S. S. S. Dhiraj, D. V. L. N. Somayajulu, T. S. Mathew, S.
Tiwari, and S. S. Ahmad, “Privacy preserving outlier detection using
hierarchical clustering methods,” in Proceedings of the 34th IEEE Annual
Computer Software and Applications Conference Workshops (COMPSACW),
2010, pp. 152-157.
[5]
P. Dokas, L. Ertoz, V. Kumar, A. Lazarevic, J. Srivastava, and P. N. Tan,
“Data mining for network intrusion detection,” in Proceedings of the 2002
NSF Workshop on Next Generation Data Mining (NGDM), 2002, pp. 21-30.
[6]
M. A. Faisal, Z. Aung, J. Williams, and A. Sanchez, “Securing advanced
metering infrastructure using intrusion detection system with data stream
51
Bibliography
mining,” in Proceedings of the 2012 Pacific Asia Workshop on Intelligence
and Security Informatics (PAISI), 2012, pp. 96-111.
[7]
S. Ganapathy, N. Jaisankar, P. Yogesh, and A. Kannan, “An intelligent system
for intrusion detection using outlier detection,” in Proceedings of the 2011
International Conference on Recent Trends in Information Technology
(ICRTIT), 2011, pp. 119-123.
[8]
D. M. Hawkins. Identification of Outliers, Chapman and Hall, London, 1980.
[9]
J. H. M. Janssens, I. Flesch, and Eric O. Postma, “Outlier detection with oneclass classifiers from ML and KDD,” in Proceedings of the 2009 International
Conference on Machine Learning and Applications (ICMLA), 2009, pp. 147153.
[10]
Q. Ji-lin, Q. Wen, S. Ying, and F. Yu-mei, “A nonparametric outlier detection
method for financial data,” in Proceedings of the 2009 International
Conference on Management Science and Engineering (ICMSE), pp. 14421447.
[11]
S.-Y. Jiang and Q.-b. An, “Clustering-based outlier detection method,” in
Proceedings of the 5th International Conference on Fuzzy Systems and
Knowledge Discovery (FSKD) Volume 2, 2008, pp. 429-433.
[12]
S.-Y. Jiang and A.-M. Yang, “Framework of clustering-based outlier
detection,” in Proceedings of the 6th international conference on Fuzzy
Systems and Knowledge Discovery (FSKD) Volume 1, 2009, pp. 475-479.
[13]
I. Khamis and Z. Aung, “Outlier preprocessing in wireless sensor networks: A
two-layered ellipse approach,” in Proceedings of the 6th IEEE International
Conference on Developments in eSystems Engineering (DeSE), 2013, pp. 1-6.
52
Bibliography
[14]
J. G. Lee, J. Han, and X. Li, “Trajectory outlier detection: A partition-anddetect framework,” in Proceedings of the 24th IEEE International Conference
on Data Engineering (ICDE), 2008, pp. 140-149.
[15]
S. H. Lee, S. Lee, H. Song, and H.-S. Lee, "Wireless sensor network design
for tactical military applications: Remote large-scale environments,” in
Proceedings of the 2009 IEEE Conference on Military Communications
(MILCOM), 2009, pp. 1-7.
[16]
D. Li, Z. Aung, S. Sampalli, J. Williams, and A. Sanchez, “Privacy
preservation scheme for multicast communications in smart buildings of the
smart grid,” Smart Grid and Renewable Energy, vol. 4, no. 4, 2013, pp. 313324.
[17]
Y. Li and H. Kitagawa, “Db-outlier detection by example in high dimensional
datasets,” in Proceedings of the 2007 IEEE International Workshop on
Databases for Next Generation Researchers (SWOD), 2007, pp. 73-78.
[18]
P. H. Menold, R. K. Pearson, and F. Allgower, “Online outlier detection and
removal,” in Proceedings of the 7th Mediterranean Conference on Control
and Automation (MED), 1999, pp. 1110-1133.
[19]
M. Moshtaghi, C. Leckie, S. Karunasekera, J. C. Bezdek, S. Rajasegarar, and
M. Palaniswami, “Incremental elliptical boundary estimation for anomaly
detection in wireless sensor networks,” in Proceedings of the 11th IEEE
International Conference on Data Mining (ICDM), 2011, pp. 467-476.
[20]
K. Narita and H. Kitagawa, “Outlier detection for transaction databases using
association rules,” in Proceedings of the 9th International Conference on
Web-Age Information Management (WAIM), 2008, pp. 373-380.
53
Bibliography
[21]
J. H. Oh, J. Gao, and K. Rosenblatt, “Biological data outlier detection based
on Kullback-Leibler divergence,” Proceedings of the 2008 IEEE International
Conference on Bioinformatics and Biomedicine (BIBM), 2008, pp. 249-254.
[22]
K. Prakobphol and J. Zhan, “A novel outlier detection scheme for network
intrusion detection systems,” in Proceedings of the 2008 International
Conference on Information Security and Assurance (ISA), 2008, pp. 555-560.
[23]
J. Qu, “Outlier detection based on Voronoi diagram,” in Proceedings of the
4th International Conference on Advanced Data Mining and Applications
(ADMA), 2008, pp. 516-523.
[24]
S. Rajasegarar, J. C. Bezdek, C. Leckie, and M. Palaniswami, “Elliptical
anomalies in wireless sensor networks,” ACM Transactions on Sensor
Networks, vol. 6, no. 1, 2009, pp. 1-28.
[25]
S. Rajasegarar, C. Leckie, J. C. Bezdek, and M. Palaniswami, “Centered
hyperspherical and hyperellipsoidal one-class support vector machines for
anomaly detection in sensor networks,” IEEE Transactions on Information
Forensics and Security, vol. 5, no. 3, 2010, pp. 518-533.
[26]
I. Silva, L. A. Guedes, P. Portugal, and F. Vasques, “Reliability and
availability evaluation of wireless sensor networks for industrial applications,”
Sensors, vol. 12, no. 1, 2012, pp. 806-838.
[27]
Y. Tao and D. Pi, “Unifying density-based clustering and outlier detection,” in
Proceedings of the 2nd International Workshop on Knowledge Discovery and
Data Mining (WKDD), 2009, pp. 644-647.
[28]
J. Xi, “Outlier detection algorithms in data mining,” in Proceedings of the 2nd
International Symposium on Intelligent Information Technology Application
(IITA), 2008, pp. 94-97.
54
Bibliography
[29]
Z. Yang, N. Meratnia, and P. Havinga, “An online outlier detection technique
for wireless sensor networks using unsupervised quarter-sphere support vector
machine,” in Proceedings of the International Conference on Intelligent
Sensors, Sensor Networks and Information Processing (ISSNIP), 2008, pp.
151–156.
[30]
Y. Zhang, N. Meratnia, and P. Havinga, “Adaptive and online one-class
support vector machine-based outlier detection techniques for wireless sensor
networks,” in Proceedings of the 2009 International Conference on Advanced
Information Networking and Applications Workshops (WAINA), 2009, pp.
990-995.
[31]
Y. Zhang, N. Meratnia, and P. J. M. Havinga, “Ensuring high sensor data
quality through use of online outlier detection techniques,” International
Journal of Sensor Networks, vol. 7, no. 3, 2010, pp. 141-151.
[32]
Y. Zhang, N. Meratnia, and P. Havinga, “Outlier detection techniques for
wireless sensor networks: A survey,” IEEE Communications Surveys and
Tutorials, vol. 12, no. 2, 2010, pp. 159-170.
[33]
http://db.csail.mit.edu/labdata/labdata.html
[34]
http://mathpax.com/images/statistics.pdf
[35]
http://en.wikipedia.org/wiki/Precision_and_recall
[36]
https://portal.masdar.ac.ae/Pages/NewsDetail.aspx?NID=401
[37]
http://www.libelium.com/products/meshlium/wireless-sensor-networks
[38]
http://www.libelium.com/products/waspmote
[39]
http://www.libelium.com/development/developers/
[40]
http://www.techopedia.com/definition/25651/wireless-sensor-network-wsn
[41]
http://www.seeedstudio.com/depot/grove-rtc-p-758.html?cPath=25_30
55
Bibliography
[42]
http://www.seeedstudio.com/depot/grove-temperaturehumidity-sensor-pro-p838.html
[43]
http://www.seeedstudio.com/depot/sd-card-shield-p-492.html?cPath=132_134
[44]
http://lcav.epfl.ch/cms/lang/en/pid/86035
[45]
http://www.masdar.ac.ae/campus-community/the-campus/windtower
[46]
http://masdarcity.ae/en/110/frequently-asked-questions/
[47]
S. Subramaniam, T. Palpanas, D. Papadopoulos, V. Kalogerakiand, and D.
Gunopulos, Online Outlier Detection in Sensor Data using Nonparametric
Models, J. Very Large Data Bases, VLDB 2006.
[48]
S. Rajasegarar, C. Leckie, M. Palaniswami, and J.C. Bezdek, Distributed
Anomaly Detection in Wireless Sensor Networks, Proc. IEEE ICCS, 2006.
[49]
S. Rajasegarar, C. Leckie, M. Palaniswami, and J. C. Bezdek, Quarter Sphere
Based Distributed Anomaly Detection in Wireless Sensor Networks, Proc.
IEEE International Conference on Communications, pp. 3864-3869, 2007.
[50]
D. Janakiram, A. Mallikarjuna, V. Reddy, and P. Kumar, Outlier Detection in
Wireless Sensor Networks using Bayesian Belief Networks, Proc. IEEE
Comsware, 2006.
[51]
D.J. Hill, B.S. Minsker, and E. Amir, Real-Time Bayesian Anomaly Detection
for Environmental Sensor Data, Proc. 32nd Congress of the International
Association of Hydraulic Engineering and Research, 2007.
[52]
V. Chatzigiannakis, S. Papavassiliou, M. Grammatikou, and B.Maglariset,
Hierarchical Anomaly Detection in Distributed Large-Scale Sensor Networks,
Proc. ISCC, 2006.
[53]
http://en.wikipedia.org/wiki/Wireless_sensor_network
56
Bibliography
[54]
M. Bahrepour,Y. Zhang, N. Meratnia, and P. Havinga, “Use of Event Detection
Approaches for Outlier Detection in Wireless Sensor Networks,” IEEE
Communications Surveys and Tutorials, 2009, pp. 439-444.
57