Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Vertical Distance-based Outlier Detection Method with Local Pruning Dongmei Ren, Imad Rahal, William Perrizo Kirk Scott North Dakota State University Computer Science Computer Science Department The University of Alaska Anchorage Fargo, ND, 58105, USA Anchorage, AK 99508 1-701-231-6403 [email protected] ABSTRACT “One person’s noise is another person’s signal”. Outlier detection is used to clean up datasets and also to discover useful anomalies, such as criminal activities in electronic commerce, computer intrusion attacks, terrorist threats, agricultural pest infestations, etc. Thus, outlier detection is critically important in the information-based society. This paper focuses on finding outliers in large datasets using distance-based methods. First, to speedup outlier detections, we revise Knorr and Ng’s distance-based outlier definition; second, a vertical data structure, instead of traditional horizontal structures, is adopted to facilitate efficient outlier detection further. We tested our methods against national hockey league dataset and show an order of magnitude of speed improvement compared to the contemporary distance-based outlier detection approaches. Categories and Subject Descriptors parameters, such as the mean and the variance, are known, (c) the number of expected outliers, (d) and the types of expected outliers such as upper or lower outliers in an ordered sample [1]. Yet, despite all of these options and decisions, there is no guarantee of finding outliers, either because there may not be any test developed for a specific combination, or because no standard distribution can adequately model the observed distribution [1]. The second category of outlier studies in statistics is depth-based. Each data object is represented as a point in a k-dimensional space, and is assigned a depth. Outliers are more likely to be data objects with smaller depths. There are many definitions of depth that have been proposed. In theory, depth-based approaches can work for large values of k , where k is the number of dimensions; however, in practice, while there exists efficient algorithms for k equal to 2 or 3, depth-based approaches become inefficient for large datasets when k is large than based approaches rely on the computation of k-dimensional convex hulls which has a lower bound complexity of Ω (N k/2), where N is the number of samples [5]. H.2.8 [Database Applications]: Data Mining General Terms Algorithms, Performance, Experimentation, Theory. Keywords Outlier Detection, Neighborhood Search Distance-based, P-Tree, Pruning, 1. INTRODUCTION Many studies that consider outlier identification as their primary objective are in statistics.. Barnett and Lewis provide a comprehensive treatment, listing about 100 discordance tests for normal, exponential, Poisson, and binomial distributions [21]. The choice of appropriate discordance tests depends on a number of factors: (a) the distribution, (b) whether or not the distribution Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’04 November 8-13, 2004, Washington, DC, USA. Copyright 2004 ACM 1-58113-874-1/04/0011…$5.00. In data mining, Knorr and Ng proposed a unified definition for outliers, which is defined as: an object O in a dataset T is a UO (p, D) outlier if at least fraction p of the objects in T lie at a distance greater than or equal to D from O [1][2]. They show that for many discordance tests in statistics, if an object O is an outlier according to a specific discordance test, then O is also a UO (p, D) outlier for some suitably defined p and D. They proposed a cell structure-based outlier detection algorithm, where the whole set of cells resembles a data cube. They showed that their method works well with low dimensionality. However, the method has two shortcomings. One is that it can not achieve good performance with very large datasets and high dimensional datasets. Another disadvantage is that it takes a global view of the dataset making it difficult to detecting some kinds of outliers in the datasets with more complex structures where data are distributed with different densities and arbitrary shapes. There are other definitions and methods for outliers, such as density-based and clustering-based methods. We focus on distance-based outlier detection in this paper. In this paper, we propose a vertical “by-neighbor” outlier detection method with local pruning, which can detect outliers efficiently and scale well with large datasets. Our contributions are: 1) we introduce a neighborhood-based outlier definition which improves the speed of the outlier detection process. Based on this definition, we propose two properties which can be used as a pruning method and a “by-neighbor” labeling method respectively. 2) By using the pruning and the “by-neighbor” technologies, we propose our new outlier detection algorithm which works much more efficiently over large datasets than current approaches in terms of speed and scalability. 3) Our method is implemented using a vertical representation of data, named P-Trees, which can speed up our method further. Experiments show that our vertical outlier detection method outperforms the current distance-based outlier detection methods significantly in terms of speed and scalability to data size. The paper is organized as follows; section 2 first introduces our restatement for Knorr and Ng’s distance-based outlier definition in terms of neighborhood. Next, based on this restatement, a pruning method and a “by-neighbor” labeling method are proposed; section 3 gives an overview of the P-Tree technology1. Our outlier-detection algorithm with local pruning is described in section 4. We experimentally demonstrate our improvement in terms of efficiency and scalability in section 5. In section 6, we provide a theoretical complexity study for our work. Finally we conclude the paper with possible future research extensions. 2. DEFINITION OF DISTANCE-BASED (DB) OUTLIERS In this section, a formal definition for outliers is given based on neighborhood. We also propose some properties which can be used to prune non-outlier data points 2.1 Definitions Definition of a DB outlier In [1], Knorr and Ng’s defined distance based outlier as: An object O in a dataset T is a DB (p, D) outlier if at least fraction p of the objects in T lies at a distance greater than or equal to threshold distance, D, from O. An equivalent description can be stated as: An object O in a dataset T is a DB (p, D) outlier if at most fraction (1-p) of the objects in T lies less than threshold distances D from O. We give a neighborhood based outlier definition formally in the following paragraph. Definition 1 (Neighborhood) The Neighborhood of a data point O with the radius r is defined as a set Nbr (O, r) = {x X | |O-x| r}, where, X is the dataset, x is a point in X, and |O-x| is the distance between O and x. It is also called the r-neighborhood of O and the points in this neighborhood are called the neighbors of O. The number of neighbors of O is denoted as N (Nbr (O, r)). Definition 2 (Outliers) Based on the neighborhood definition, a point O is considered as a DB (p, D) outlier if N (Nbr (O,D)) ≤ (1-p)*|X|, where |X| is the total size of the dataset X. We define outliers as a subset of the dataset X with N (Nbr(x,D)) ≤ (1-p)*|X|, where x denotes a point 1 Patents are pending on the P-tree technology. This work was partially supported by GSA Grant ACT#: K96130308. in the outlier set. The outlier set is denoted as Ols(X, p, D) = {xX | N (Nbr(x,D)) ≤ (1-p)*|X|}. 2.2 Properties Property 1 The max distance of two points in the neighborhood Nbr (O, D) is 2*D, where 2*D means 2 times D. Property 1 can be proved easily by the following: Assume that O1, O2 are any two points in Nbr (O, D), the distance between two points O1 and O2 is: Dist (O1, O2) ≤ 2*D. It is observed from this equation that 2*D is the max distance between any two points in the Nbr C (O, D). Therefore, property 1 holds. Property 2 If N (Nbr (O, D/2)) ≥ (1-p)*|X|, then all the points in Nbr ((O, D/2)) are not outliers. Proof: Assume that point Q is an arbitrary point in Nbr (O, D/2). Q’s D-neighborhood Nbr (Q, D) completely contains Nbr (O, D/2). Since N (Nbr (Q, D)) ≥ N (Nbr (O, D/2)) ≥ |X|*(1-p), we can conclude that Q is not an outlier according to our outlier definition. Property 2 can be used as a pruning rule. Based on this property, all points in the Nbr (O, D/2) can be pruned off. The pruning makes the detection process more efficient as we shall demonstrate in our experiments. Figure 1 show the pruning pictorially in Figure1, where the Nbr (O, D/2) is pruned. Nbr (O, D/2) D Q O D/2 D Figure 1 Pruning Nbr (O, D/2) Property 3 If N ( Nbr (O, 2D)) < |X| * (1-p), then all the points in the Nbr (O, D) are outliers. Proof: For any point Q in the Nbr (O, D), Nbr (Q, D) Nbr (O, 2D) holds, which means that the D-neighborhood of Q is a subset of the 2*D-neighborhood of P. From set theory, we have N (Nbr (Q, D)) ≤ N (Nbr (O, 2D)) < |X| * (1-p). According to the aforementioned outlier definition, all the points in Nbr (O, D) are outliers. According to property 3, all points in Nbr (O, D) can be labeled as outliers. Therefore, property 3 provides a powerful means to label outliers by neighborhood, instead of point by point, which will speed up the outlier detection process significantly. Figure 2 shows the “by-neighbor” process pictorially. All the points in the Nbr (O, D) can be considered without undergoing further processing. Nbr(O,D) As shown in c) of Figure 3, the root value, also called the root count, of P1 tree is 3, which is the count of 1s in the entire bit file. The second level of P1 contains the counts 1s in the two halves separately, which are 0 and 3. D 3.2 P-Tree Operations Q O D 2D Figure 2. Label Outliers by-neighbors AND, OR and NOT logic operations are the most frequently used P-tree operations. For efficient implementation, we use a variation of P-trees, called Pure-1 trees (P1-trees). A tree is pure-1, denoted as p1, if all the values in the sub-tree are 1’s. Figure 4 shows the P1-trees corresponding to the P-trees in c), d), and e) of Figure 3. Figure 5 shows the result of AND (a), OR (b) and NOT (c) operations of P-Tree. 3. REVIEW OF P-TREES Most data mining algorithms assume that the data being mined have some sort of structure such as relational tables in databases or data cubes in data warehouses [13]. Traditionally, data are represented horizontally and processed tuple by tuple (i.e. row by row) in the database and data mining areas. The traditional horizontally oriented record structures are known to scale poorly to very large data sets. In our previous work, we proposed a novel vertical data structure, the P-Tree. In the P-Tree approach, we decompose attributes of relational tables into separate files by bit position and compress the vertical bit files using a data-mining-ready structure called the P-tree. Instead of processing horizontal data vertically, we process these vertical P-trees horizontally through fast logical operations. Since P-trees markedly compress the data and the P-tree logical operations scale extremely well, this common vertical data structure approach has the potential of addressing the curse of non-scalability with respect to size. A number of papers proposing P-Tree-based data-mining algorithms have been published, in which it was explored and experimentally shown that P-trees facilitate efficient data mining over large datasets. Figure 3 Construction of P-Tree In this section, we briefly review some useful features (which will be used in this paper) of the P-Tree, including its optimized logical operations. 3.1 Construction of P-Tree Given a data set with d attributes, X = (A1, A2 … Ad), and the binary representation of the jth attribute Aj as bj.m, bj.m-1,..., bj.i, …, bj.1, bj.0, we decompose each attribute into bit files, one file for each bit position [15]. To build a P-tree, a bit file is recursively partitioned into halves and each half into sub-halves until the subhalf is pure entirely 1-bits or entirely 0-bits. The detailed construction of P-trees is illustrated by an example in Figure 3. For simplicity, assume each transaction has one attribute. We represent the attribute in binary, e.g., (7)10 = (111)2. We then vertically decompose the attribute into three separate bit files as shown in b). The corresponding basic P-trees, P1, P2 and P3, are constructed, which are shown in c), d) and e). Figure 4 P1-trees for the transaction set ( in figure 3) Figure 5 AND, OR and NOT Operations 3.3 Predicate P-Tree There are many variants of predicated P-Tree, such as value PTrees, tuple P-Trees, mask P-Trees, etc. We will describe inequality P-Trees in this section, which will be used to search for neighbors in section 4.2. An inequality P-tree represents data points within a data set X satisfying an inequality predicate, such as x>v and x<v. Without loss of generality, we will discuss two inequality P-trees: Pxv and Pxv . The calculation of Pxv and Px v is as follows: Pxv : Calculation of Let x be an m-bit data point within a data set X, and Pm, Pm-1, …, P0 be P-trees for the vertical bit files of X. Let v = bm…bi…b0, where bi is ith binary bit value of v, and Pxv be the predicate tree for the predicate x v , then Pxv = Pm opm … Pi opi Pi-1 … op1 P0, i = 0, 1 … m, where: 1) opi is AND i=1, Pxv opi is Pxv is similar to the X, and P’m, P’m-1, … P’0 be the complement P-trees for the vertical bit files of X. Let v=bm…bi…b0, where bi is ith binary bit Px v ·2D O : Calculation of calculation of Pxv . Let x be an m-bit data point within a data set value of v, and In case N (Nbr(O,D) < (1-p)*|X|, the algorithm expands the neighborhood radius from D to 2D and computes N (Nbr (O,2*D). If N (Nbr(O,2*D) < (1-p)*|X|, all the D-neighbors can be claimed as outliers. Those points are inserted into the outlier set and need not undergo further detection either, which contributes to the speed improvement as well; otherwise, only point O is an outlier. In figure 6, all points in the large gray neighborhood are outliers. Step 3: The process is conducted iteratively until all points in the dataset X are examined. 2) the operators are right binding; right binding means operators are associated from right to left, e.g., P 2 op2 P1 op1 P0 is equivalent to (P2 op2 (P1 op1 P0)). For example, the inequality tree Px ≥101 = (P2 AND (P OR 0)). Calculation of Step 2: Second, search for the D-neighborhood of O and calculate the number of neighbors, N (Nbr(O,D); In case N (Nbr(O,D) ≥ (1-p)*|X|, where |X| is the total size of the dataset, reduce the radius from D to D/2; then, calculate N(Nbr(O,D/2). If N(Nbr(O,D/2) > (1-p) * |X|, all the D/2neighbors are not outliers. Therefore, those points need not undergo further testing, which is one of the ways where the efficiency of our algorithm comes into the picture. In figure 6, all points in the small gray neighborhood are pruned off. ·D O ·D/2 ·D be the predicate tree for the predicate x v , = P’mopm … P’i opi P’i-1 … opk+1P’k, k i m , then Px v 1) opi is AND 2) k is the rightmost bit position with value of “0”, i.e., b k=0, bj=1, j<k, 3) the operators are right binding. For example, the inequality tree Px101 = (P’2 OR 1). i is OR where Figure 6 “By- neighbors” Outlier detection with local pruning ; 4. A VERTICAL OUTLIER DETECTON METHOD WITH LOCAL PRUNING In section 4.1, we propose a “by-neighbor” outlier detection method with local pruning based on property 2 and property 3 mentioned in section 2. In section 4.2, the method is implemented using the P-Tree data representation. Inequality P-Trees and the optimized logical operations of the P-Trees facilitate efficient neighborhood search significantly, which further speeds up the outlier detection process. 4.1 The “By-neighbor” Outlier Detection Algorithm with Local Pruning The “by-neighbor” outlier detection algorithm is based on our neighborhood-based outlier definition, property 1 and property 2. The algorithm is described as follows. Step 1: First, select one point O arbitrarily from the dataset; 4.2 Vertical Approach Using P-Trees The algorithm is implemented using the vertical data model, PTrees. By using P-Trees, the above outlier detection process can be made even faster. The speed comes from: 1) In the P-Tree based approach, the neighborhood search is fast because it is conducted by inequality P-Tree ANDing; 2) the computation of the number of neighbors is efficient because it is done by extracting the value of the root node of the neighborhood P-Tree, and 3) the algorithm prunes points by neighborhood using the optimized P-Trees operation, AND, which accelerates the process significantly. The vertical method works as follows. First, the dataset to be mined is represented as the set of P-Trees. Secondly, one point O in the dataset is selected arbitrarily; then, the D-neighbors are searched using the fast computation of inequality P-Tree, and the D-neighbors are represented with an inequality P-Tree, which is called a neighborhood P-Tree. In the neighborhood P-Tree, ‘1’ means the point is a neighbor of the point O, while 0 means the point is not a neighbor. Thirdly, the number of points in Dneighbors is calculated efficiently by extracting values from the root node of the neighbor P-Tree [15]. Finally, do the pruning or “by-neighbor” outlier determination. The pruning process can be executed efficiently by ANDing the unprocessed P-Tree, which represents the unprocessed dataset, with the D-Neighborhood PTree. In this case, the points can be pruned by masking the corresponding points in the result P-Trees. The “by-neighbor” outlier labeling can be executed by ORing the D/2-neighborhood P-Tree and the outlier P-Tree, a P-tree representing the outlier set. By searching neighbors using inequality P-Trees and pruning using the P-Tree AND operation, the vertical approach improves the outlier-detection process dramatically in terms of speed. The formal algorithm is shown in figure 7. improvements. It has an order of magnitude of speed improvement over the nested loop approach. As for scalability, the PODMP is the best among the three. It scales well with increasing dataset sizes. Figure 9 shows this pictorially. Comparision of NL, PODM, PODMP 1800 1600 run time 1400 Algorithm: “Vertical Outlier Detection with Local Pruning” Input: D: Distance threshold, f: outlier fraction, T: dataset Output: Ols: outlier set // N: total number of points in dataset T //PT: P-tree represented dataset //PU: P-tree represented unprocessed dataset //PN: P-tree represented neighborhood //PO: P-tree represented outlier set //PNO: P-tree represented non candidate outlier set Figure 7 Pseudo code of the algorithm 5. EXPERIMENTAL DEMONSTRATION In this section, we design an experiment to compare our method with Knorr and Ng’s nested loop approach (denoted as NL in figures 8 and 9)[2]. We implemented a P-Tree-based outlier detection method without pruning, named PODM, and P-Tree based outlier detection method using pruning, PODMP, as discussed in detail in section 4. Our purpose is to show our approach’s efficiency and scalability. We ran the experiment on a 1400-MHZ AMD machine with 1GB main memory, running Debian Linux version 4.0. We tested the algorithms over the National Hockey League (NHL) 1996 dataset. To show how our algorithm scales as data size increases, the dataset is divided into five groups with increasing size, where larger data groups contain repeating records. Figure 8 demonstrates that the two P-Tree-based methods has a significant improvement in speed compared to Knorr and Ng’s Nested Loop method. Also, it is shown that by using pruning and “byneighbor” labeling, the PODMP benefits from additional speed 800 600 200 0 256 1024 4096 data size 16384 65536 Figure 8 Comparison of NL, PODM, and PODMP IF m > (1- f) * N // not outlier PN findNeigborhood (P, D/2); m PN.rootCount (); IF m > (1-f) * N PNO PNO PN; // is OR operation of P-trees PU PU ∩ PNO; // pruining ENDIF ENDIF } ENDWHILE NL PODM PODMP 1000 400 // Build up P-Trees set for T; PT T; PU PT; WHILE ( ! PU. size() ) { PN findNeigborhood (P, D); m PN.rootCount (); // retrieve value of the root node scalability of NL, PODM and PODMP 1800 1600 1400 1200 run time IF m <= (1- f) * N PO PO U P; PN findNeighborhood (P,2D); m PN.rootCount (); IF m < (1-f) * N, PO PO PN; ENDIF ENDIF 1200 1000 NL 800 PODM 600 PODMP 400 200 0 -200256 1024 4096 16384 65536 data size Figure 9 Comparison of Scalability of NL, PODM, and PODMP 6. COMPUTATIONAL COMPLEXITY ANALYSIS In this section, we analyze and compare our vertical outlier detection method with current methods with respect to computational complexity. For our method, the worst case is O (k N), where k is complexity for the P-Tree operations, which changes with the number of P-Trees, and N is the total number of points in the dataset. The best case is O (k) and it occurs when we prune all the points in the dataset after processing the first point. With pruning, complexity can be described as O (k M), where M ≤ N. For the worst case scenario, M=N. Table 1 shows the comparison. Table 1 Complexity Comparison Algorithm Complexity Nested-loop O(N*N) Tree Indexed O(N*logN) Cell Based linear in N, exponential in d (dimension ) Our approach O (k N), where k is much small than logN 7. CONCLUSION AND FUTURE WORK Outlier detection is becoming critically important to many areas such as monitoring of criminal activities in electronic commerce, and credit card fraud. In this paper, we propose a neighborhoodbased outlier definition, a pruning rule, and a “by-neighbor” labeling technique, based on which, a novel outlier detection algorithm is proposed. Pruning and “by-neighbor” labeling can speed up the outlier process significantly. Our method is implemented using a vertical data representation model, P-Tree, instead of traditional horizontal structures, which can speed up our approach further. Experiments show that our method has an order of magnitude of speed improvement compared to the current distance-based outlier detection approaches and the scalability of our algorithm outperforms the-state-of-the-art approaches as well. An order of complexity analysis clearly shows the reason for these observed performance improvements. In future work, we will explore the use of P-Trees in densitybased outlier detection. The encouraging results shown in this paper demonstrate a great potential for improving the efficiency of density-based outlier detection as well. 8. REFERENCES [1]. Knorr, Edwin M. and Raymond T. Ng. A Unified Notion of Outliers: Properties and Computation. 3rd International Conference on Knowledge Discovery and Data Mining Proceedings, 1997, pp. 219-222. [2]. Knorr, Edwin M. and Raymond T. Ng. Algorithms for Mining Distance-Based Outliers in Large Datasets. Very Large Data Bases Conference Proceedings, 1998, pp. 24-27. [3]. Knorr, Edwin M. and Raymond T. Ng. Finding Intentional Knowledge of Distance-Based Outliers. Very Large Data Bases Conference Proceedings, 1999, pp. 211-222. [4]. Sridhar Ramaswamy, Rajeev Rastogi, Kyuseok Shim, “Efficient algorithms for mining outliers from large datasets”, International Conference on Management of Data and Symposium on Principles of Database Systems, Proceedings of the 2000 ACM SIGMOD international conference on Management of data Year of Publication: 2000, ISSN:0163-5808 [5]. Markus M.Breunig, Hans-Peter kriegel, Raymond T.Ng, Jorg Sander, “OPTICS-OF: Identifying Local Outliers”, Proc. of PKDD '99, Prague. Czech Republic, Lecture Notes in Computer Science (LNAI 1704), Springer Verlag, 1999, p. 262-270. [6] Arning, Andreas, Rakesh Agrawal, and Prabhakar Raghavan. A Linear Method for Deviation Detection in Large Databases. 2nd International Conference on Knowledge Discovery and Data Mining Proceedings, 1996, pp. 164-169. [7] Bowman, B., Debray, S. K., and Peterson, L. L. Reasoning about naming systems. ACM Trans. Program. Lang. Syst., 15, 5 (Nov. 1993), 795-825. [8] Ding, W., and Marchionini, G. A Study on Video Browsing Strategies. Technical Report UMIACS-TR-97-40, University of Maryland, College Park, MD, 1997. [9] Fröhlich, B. and Plate, J. The cubic mouse: a new device for three-dimensional iput. In Proceedings of the SIGCHI conference on Human factors in computing systems (CHI ’00) (The Hague, The Netherlands, April 1-6, 2000). ACM Press, New York, NY, 2000, 526-531. [10] Lamport, L. LaTeX User’s Guide and Document Reference Manual. Addison-Wesley, Reading, MA, 1986. [11] Sannella, M. J. Constraint Satisfaction and Debugging for Interactive User Interfaces. Ph.D. Thesis, University of Washington, Seattle, WA, 1994. [12] S. Sarawagi, R. Agrawal, and N. Megiddo. DiscoveryDriven Exploration of OLAP Data Cubes. EDBT'98. [13] Jiawei han, Micheline kambr, “Data mining concepts and techniques”, Morgan kaufman Publishers [14] Shashi Shekhar, Chang-Tien Lu & Pusheng Zhang, “Detecting graph-based spatial outliers”, Journal of Inteelegent dat analysis 6 (20022) 451-468, IOS press. [15] Q. Ding, M. Khan, A. Roy, and W. Perrizo, The P-tree algebra. Proceedings of the ACM SAC, Symposium on Applied Computing, 2002. [16] M. Khan, Q. Ding and W. Perrizo, K-nearest Neighbor Classification on Spatial Data Stream Using P-trees, Proceedings of PAKDD 2002, Springer-Verlag, Lecture Notes in Artificial Intelligence 2336, May 2002, pp. 517528. [17] Pan, F., Wang, B., Ren, D., Hu, X. and Perrizo, W., Proximal Support Vector Machine for Spatial Data Using Peano Trees, CAINE 2003 [18] Imad Rahal, William Perrizo, An optimized Approach for KNN Text Categorization using P-tees. Proceedings of the ACM SAC, Symposium on Applied Computing (Nicosia, Cyprus),March 2004. [19] Fei Pan, Imad Rahal, Yue Cui, William Perrzio, Efficient Ranking of Keyword Queries Using P-Trees”, 19th International Conference on Computers and Their Applications, pp. 278-281, 2004. [20] Maum Serazi, Amal Perera, Qiang Ding, Vasiliy Malakhov, Imad Rahal, Fen Pan, Dongmei Ren, Weihua Wu, and William Perrizo. "DataMIME™". ACM SIGMOD, Paris, France, June 2004. [21] V.BARNETT, T.LEWIS, “Outliers in Statistic Data”, John Wiley’s Publisher.