Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Proposed Statistical Approach for Outlier Detection Dr. Amr Mohamed Mohamed Kamal Ph.D. in Computers and Information Information Technology Department College of Applied Sciences Ministry of Higher Education, Ibri, Sultanate of Oman. Email: [email protected] Abstract: This paper illustrates some applications, causes, techniques, approaches of anomaly detection, and important issues that need to be addressed when dealing with anomalies. It also suggests a proposed statistical approach for anomaly detection. Keywords Anomalies detection; anomaly score; outlier detection; outlier score; deviation detection; data cleaning; discordant observation; and exception mining. 1. Literature review: Gupta provides a comprehensive and structured overview of a large set of interesting outlier definitions for various forms of temporal data [3]. Ranjan presents a new clustering approach for anomaly intrusion detection by using the approach of K-medoids method of clustering and its certain modifications [5]. The proposed algorithm is able to achieve high detection rate and overcomes the disadvantages of K-means algorithm [5]. Shon concentrated on machine learning techniques for detecting attacks from internet anomalies [6]. The machine learning framework consists of two major components: Genetic Algorithm (GA) for feature selection and Support Vector Machine (SVM) for packet classification. Thiprungsri examined the use of clustering technology to automate fraud filtering during an audit [4]. 2. Introduction: Anomaly is a pattern in the data that does not conform to the expected behaviour. In anomaly detection, the goal is to find objects that are different from most other objects. Often, anomalous objects are known as outliers, since on a scatter plot of the data, they lie far away from other data points. Anomaly detection is also known as deviation detection, because anomalous objects have attribute values that deviate significantly from the expected or typical attribute values, or as exception mining, because anomalies are exceptional in some cases. There are a variety of anomaly detection approaches from several areas, including statistics, machine learning, and data mining. All try to capture the idea that an anomalous data object is unusual or in some way inconsistent with other objects. Although unusual objects or events are, by definition, relatively rare, this does not mean that they do not occur frequently in absolute terms. Anomalous values may indicate either a problem or a new phenomenon to be investigated. However, when they occur, their consequences can be quite dramatic and quite often in a negative sense. The following examples illustrate some different applications for which anomalies are of considerable interest: Fraud detection: The purchasing behavior of someone who steals a credit card is probably different from the original owner. Credit card companies attempt to detect a theft by looking for buying patterns that characterize theft or by noticing a change from a typical behavior. Intrusion detection: Unfortunately, attacks on computer systems and computer networks are commonplace. While some of these attacks such as those designed to disable or overwhelm computers and networks, are obvious, other attacks, such as those designed to secretly gather information, are difficult to detect. Many of these instructions can only be detected by monitoring systems and networks for unusual behavior. Ecosystem disturbances: In the natural world, there are atypical events that can have a significant effect on human beings. Examples include hurricanes, floods, droughts, heat weaves, global warming, and fires. The goal is often to predict the likelihood of these events and the causes of them. Public health: If all children in a city are vaccinated for a particular disease, e.g., measles, then the occurrence of a few cases scattered across various hospitals in a city is an anomalous event that may indicate a problem with the vaccination programs in the city. 1 Although much of the recent interest in anomaly detection has been driven by applications in which anomalies are the focus, historically, anomaly detection (and removal) has been viewed as a technique for improving data objects analysis. For instance, a relatively small number of outliers can distort the mean and standard deviation of a set of values or alter the set of clusters produced by a clustering algorithm. The term cluster refers to a group of data objects among which there exists a certain degree of similarity [1]. Therefore, anomaly detection (and removal) is often a part of data processing. 3. Some issues of anomalies: 3.1 Data from different classes: An object may be different from other objects, (anomalous), because it is of a different type or class. To illustrate, someone committing credit card fraud belongs to a different class of credit card users than those people who use credit cards legitimately. 3.2 Natural variation: Many data sets can be modeled by statistical distributions, such as a normal (Gaussian) distribution, where most of the objects are near a center (average object) and the probability of a data object decreases rapidly as the distance of the object from the center of the distribution increases. 3.3 Data measurement and collection errors: Errors in data collection or measurement process are another source of anomalies. Measurement may be recorded incorrectly because of human error, a problem with the measuring device, or the presence of noise. The goal is to eliminate such anomalies, since they provide no interesting information but also reduce the quality of the data and the subsequent data analysis. Indeed, the removal of this type of anomaly is the focus of data preprocessing, specifically data cleaning. So, noise should be removed before outlier detection. 4. Techniques to anomaly detection: I will illustrate a high level description of some anomaly detection techniques and their associated definitions of an anomaly. 4.1 Model based techniques: Many anomaly detection techniques first build a model of the data. Anomalies are objects that do not fit the model very well. For example, a model of the distribution of the data can be created by using the data to estimate the parameters of a probability distribution. An object does not fit the model very well; i.e., it is an anomaly, if it is not very likely under the distribution. If the model is a set of clusters, then an anomaly is an object that does not strongly belong to any cluster [4]. When a regression model is used, an anomaly is an object that is relatively far from its predicted value [4]. Because anomalous and normal objects can be viewed as defining two distinct classes, classification techniques can be used for building models of these two classes [1]. In some cases, it is difficult to build a model; e.g., because the statistical distribution of data is unknown or no training data are available. In these situations, techniques that do not require a model, such as those described below, can be used. 4.2 Proximity-based techniques: It is often possible to define a proximity measure between objects, and a number of anomaly detection approaches are based on proximities. Anomalous objects are those that are distant from most of the other objects. Many of the techniques in this area are based on distances and are referred to as distancebased outlier detection techniques [1]. 4.3 Density-based techniques: Objects that are in regions of low density are relatively distant from their neighbors, and can be considered anomalous [5]. 5. Use of class labels: There are three basic approaches to anomaly detection: unsupervised, supervised, and semisupervised [4]. The major distinction is the degree to which class labels (anomaly or normal) are available for at least some of the data. 5.1 Supervised anomaly detection: Labels are available for both normal data and anomalies [4]. 5.2 Unsupervised anomaly detection: No labels are assumed. Based on the assumption that anomalies are very rare compared to normal data. In such cases, the objective is to assign a score (or a label) each instance that reflects the degree to which the instance is anomalous [4]. 5.3 Semi-supervised anomaly detection: Labels are available only for normal data. In Semi-supervised setting, the objective is to find an anomaly label or score for a set of given objects by using the information from labeled normal objects. 6. Important issues that need to be addressed when dealing with anomalies: 6.1 Number of attributes used to define an anomaly: Since an object may have many attributes, it may have anomalous values for some attributes, but ordinary values for other attributes. Furthermore, an object may be anomalous even none of its attribute values are individually anomalous. For example, it is common to have people who are 70 cm tall 2 (child) or are 150 kg in weight, but uncommon to have a 70 cm tall person who weights 150 kg. A general definition of an anomaly must specify how the values of multiple attributes are used to determine weather or not an object is an anomaly. This is a particularly important issue when the dimensionality of data is high. 6.2 Global versus local perspective: An object may seem unusual with respect to all objects, but not with respect to objects in its local neighborhood. For example, a person whose height is 2.3 m is unusually tall with respect to the general population, but not with respect to professional basketball players. 6.3 Degree to which a point is an anomaly: An object is either an anomaly or it is not. Frequently, this does not reflect the underlying reality that some objects are more extreme anomalies than others. Hence, it is desirable to have some assessment of the degree to which an object is anomalous. This assessment is known as the anomaly or outlier score. 7. Statistical approaches: Depending on weather we are working with a population or a sample, a numerical measure is known as either a parameter or a statistic. Parameter: is a measure computed from the entire population. As long as, the population does not change, the value of the parameter will not change [2]. Statistic is a measure from a sample that has been selected from a population. The value of the statistic will depend on which sample is selected [2]. Statistical approaches are model-based approaches; i.e., a model is created for the data, and objects are evaluated with respect to how well they fit the model. Most statistical approaches to outlier detection are based on building a probability distribution model and considering how likely objects are under that model. This paper represents one of the statistical approaches for outlier detection. 8. Probabilistic definition of an outlier: An outlier is an object that has a low probability with respect to a probability distribution model. If data are assumed to have a Gaussian distribution, then the mean and standard deviation of the underlying distribution can be estimated by computing the mean and standard deviation of the data. The probability of each object under the distribution can then be estimated. A wide variety of statistical tests have been devised to detect outliers, or discordant observations. So, there are two basic assumptions: 1. Normal objects are in the center of the data space. 2. Outliers are located at the border of the data space [±3]. So, we will use the statistical concepts of central tendency (sample mean, median, and mode) and measure of variation (variance and standard deviation) in our proposed approach. 9. Important issues that need to be addressed when dealing with probabilistic definition of an outlier: 9.1 Identifying the specific distribution of a data set: Probability is the way decision makers express their uncertainty about outcomes and events. Discrete distributions such as (uniform, binomial, multinomial, hyper geometric, Poisson, negative binomial and geometric) combined with the continuous distributions such as (normal, gamma, exponential, Chisquare, and weibull) are used frequently in business decision making. Discrete random variables are determined by counting. Continuous random variables are determined by measuring. Of course, if the wrong model is chosen, then an object can be erroneously identified as an outlier. 9.2 The number of attributes used: Data set is univariate, bivariate, or multivariate depending on whether it contains information on one variable only, on two variables, or on more than two [9]. Most statistical outlier detection techniques apply to a single attribute, but some techniques have been defined for multivariate data. In this paper, I propose a framework for detecting outliers in a Univariate environment. 10. Detecting outliers in a Univariate Normal Distribution: The Gaussian (normal) distribution is one of the most frequently used distributions in statistics, and I will use it to describe a simple approach to statistical outlier detection. In continuous probability distributions, we find the probability that a value is within a specified range. Its graph, called the normal curve, is the bell shaped curve that describes the distribution of so many sets of data which occur in nature, industry, and research. The mathematical equation for the probability distribution of the continuous variable depends on the two parameters µ and σ, its mean and standard deviation. Here I shall denote the density function of X by n(x; µ, σ) The normal distribution density function of the normal random variable X, with mean µ and 2 variance , is f(x) = n(x; µ, σ) = 1 e(1 / 2)[( x µ) / ] 2 Where 2 , -∞ <x<∞ , =3.14159 and e=2.71828 [7] 3 Once µ, σ are specified, the normal curve is completely determined. The area under a probability curve must be equal to 1, and therefore the more variable the set of observations, the lower and wider the corresponding curve will be. 10.1 Properties of the normal curve: 1. The highest point on the normal curve is located at the mean, which is also the median and the mode of the distribution. 2. The curve is symmetric about a vertical axis through the mean µ. 3. If a random variable has a small variance or standard deviation, we would expect most of the values to be grouped around the mean. A large value of indicates a greater variability, and therefore the area is to be more spread out. 4. The normal curve approaches the horizontal axis asymptotically as we proceed in either direction away from the mean. 5. The total area under the curve and above the horizontal axis is equal to 1. I shall now show that the parameters µ and 2 are indeed the mean and the variance of the normal distribution. To evaluate the meaning, I write E(X) = 2 1 (1 / 2)[( x ) / ] dx Setting xe 2 - z= ( x ) / → z = x- Differentiating both sides by x, we will get z*0 + * dz = 1-0 → dx= dz dx So, we obtain 1 z dz ( z )e 2 - 2/2 E(X) = = 1 z e 2 µ ze 2 2/2 dz + z2/2 dz The first integral is µ times the area under a normal curve with mean zero and variance 1, and hence equal to µ. The second integral is equal to zero. The variance of the normal distribution is given by 2 = E [ (X - µ)2 ] = 1 2 (1 / 2)[( x ) / ] dx (X - µ) e 2 - 2 Again setting z= ( x ) / → z = x- Differentiating both sides by x, we will get z*0 + * dz = 1-0 → dx= dz dx 2 E [ (X - µ) ] = 2 2 z e 2 z2/2 dz Integrating by parts with u=z z2 / 2 ze that and dv= z2 / 2 so that du=dz and v=- e , we find 2 2 2 E [ (X - µ) ]= (0+1)= Changing µ shifts the distribution left or right. Changing σ increases or decreases the spread as shown in figure1. No matter what and are, the area between - and + is about 68%; the area between -2 and +2 is about 95%; and the area between -3 and +3 is about 99.7%. Almost all values fall within 3 standard deviations. Often, the three-sigma interval [±3] is called a tolerance interval that contains almost all of the measurements in a normally distributed population [8] as shown in figure2. There is a unique normal curve for every combination of and . There are many theoretically unlimited numbers of such combinations. Fortunately, we are able to transform all the observations of any normal random variable X to a new set of observations of a normal random variable Z with mean zero and variance 1. This can be done by means of transformation Z= ( X ) / . Whenever X assumes a value x, the corresponding value of Z is given by z= ( x ) / . f(X) Changing μ shifts the distribution left or right. Changing σ increases or decreases the spread. X Fig.1 Effect of changing and Fig.2 Tolerance interval 4 Therefore, if X falls between the values x= x1 and x= x2 , the random variable Z will fall between the corresponding values z1 = ( x1 ) / and z2 = ( x2 ) / . So, all normal distributions can be converted into the standard normal curve by subtracting the mean and dividing by the standard deviation Consequently, we can write P( x1 <X< x2 )= 1 x (1 / 2)[( x ) / ] dx = xe 2 x 2 2 1 z2 1 z e 2 z 2/ 2 dz = 1 z2 n(z;0,1) dz= P( z1 <Z< z 2 ) Making Approach", seventh edition, Pearson International Edition.Upper Saddle River, New Jersy, U.S.A, 2008. [3] Manish Gupta, Jing Gao, Charu C. Aggarwal, and Jiawei Han, "Outlier Detection for Temporal Data: A Survey", IEEE transactions on knowledge and data engineering, vol. 25, no. 1, January 2014 [4] Sutapat Thiprungsri, Miklos A. Vasarhelyi, "Cluster Analysis for Anomaly Detection in Accounting Data: An Audit Approach", The International Journal of Digital Accounting Research Vol. 11, pp. 69 - 84 ISSN: 15778517, 2011. [5] Ravi Ranjan and G. Sahoo, "A new clustering approach for anomaly intrusion detection", International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.4, No.2, March 2014. [6] Taeshik Shon, Yongdue Kim, Cheolwon Lee, and Jongsub Moon,"A Machine Learning Framework for Network Anomaly Detection using SVM and GA", Proceedings of the 2005 IEEE Workshop on Information Assurance and Security United States Military Academy, West Point, NY, U.S.A, 2005. [7] Derek L. Waller, "Statistics for business", Elsevier, Book Aid International, Sabre Foundation, 2008. [8] Bruce L. Bowerman, Richard T.O'Connell, J.B. Orris, and Emily S. Murphree, "Essential of Business Statistics", McGraw-Hill, Irwin, 2010. [9] Heinz Kohler, "Statistics for Business and Economics", Thomson Learning, Inc, 2002. z1 But, it is very important to notice that: 1) Not all continuous random variables are normally distributed. 2) Both the mean and standard deviation are extremely sensitive to outliers. Effectively one “bad point” can completely skew the mean. 3) It is important to evaluate how well the data are approximated by a normal distribution. 11. A proposed statistical approach for outlier detection: 1) Look at the histogram and check does it appear bell shaped. 2) Compute descriptive summary measures (mean, median, and mode). 3) Do about 68 % of observations lie within 1 standard deviation of the mean? Do about 95% of observations lie within 2 standard deviations of the mean? Do about 99% of observations lie within 3 standard deviations of the mean? 4) Be cautious, about sample size, because the distribution is highly influenced by sample size. 12. Conclusion: 1. Outlier detection using Univariate Normal Distribution is a very promising technique for detecting critical information in data, and can be applied in various application domains. 2. Nature of outlier detection problem is dependent on the scope of application domain. 3. Different techniques are required to solve a particular problem formulation. References: [1] Hongbo Du, "Data Mining Techniques and Applications – An Introduction", Cengage Learning EMEA, Cheriton House, North Way, Andover, Hampshire, SP10 5BE, UK., 2010. [2] David F. Groebner, Patrick W. Shannon, Phillip C. Fry, and Kent D. Smith, "Business Statistics – A Decision 5