Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Name: Jisu Oh, Shan Huang Date : April 12, 2004 Course : Csci 8715 Professor : Shashi Shekhar Project Report (draft version) “Spatial Outlier Detection” Shan Huang, Jisu Oh Computer Science Department, University of Minnesota, 200 Union Street SE, Minneapolis, MN 55455, U.S.A E-mail: [email protected], [email protected] http://www-users.cs.umn.edu/~joh/csci8715/HW-list.htm 1. Introduction A spatial outlier is a spatially referenced object whose non-spatial attribute values are significantly different from the values of its neighborhood. Identification of spatial outliers can lead to the discovery of unexpected, interesting, and useful spatial patterns for further analysis. WEKA is a collection of machine learning algorithms for solving real-world data mining problems. It is written in Java and runs on almost any platform. Basic data mining functions as well as regression, association rules and clustering algorithms have also been implemented in WEKA, but their algorithms can only operate on traditional non-spatial database. The purpose of this project is to build a new class, which can detect spatial outlier in a spatial data set. 2. Motivation Machine learning/data mining discovers new things or structure that is unknown to humans. It enables a computer program to automatically analyze large-scale data and decide what information is most important. We can then use this information to make predictions or to make decisions faster and more accurately. 1 Many organizations rely on spatial analysis to make business and agency decisions and to conduct research. The main difference between data mining in relational DBS and in spatial DBS is the interest of neighboring object’s attributes may have an influence on the current object, so the neighboring object have to be considered as well. The explicit location and extension of spatial objects define implicit relations of spatial neighborhood which are used by spatial data mining algorithms. Therefore, new techniques are required for effective and efficient data mining. WEKA is a collection of machine learning algorithms for solving real-world data mining problems. It is written in Java and runs on almost any platform. Basic data mining functions as well as regression, association rules and clustering algorithms have also been implemented in WEKA, but these algorithms can only operate on traditional non-spatial database. The aim of this project is to build new classes and algorithm which can handle spatial data, such as spatial regression, spatial association rule (co-location), and spatial outlier detection. 2. Related works Detecting spatial outliers is useful in many applications of geographic information systems, including transportation, ecology, public safety, public health, climatology, and location based services [2]. Shekhar et al. introduced a method for detecting spatial outliers in graph data set based on the distribution property of the difference between an attribute value and the average attribute value of its neighbors [3]. Shekhar also proposed an algorithm to find all outliers in a dataset, which replace many statistical discordance tests, regardless of any knowledge about the underlying distribution of the attributes [7]. Stephen D. Bay et al. introduced a simple nested loop algorithm to detect spatial 2 outlier, which gives linear time performance when data is in random order and a simple pruning rule is used [4]. Existing methods for finding outliers can only deal efficiently with two dimensions/attributes of a dataset. A distance-based detection method was introduced by Sridhar Ramaswamy et al., which ranks each point on the basis of its distance to its kth nearest neighbor and declares the top n points in this ranking to outliers. A highly efficient partition-based algorithm was also introduced in this paper [6]. Edwin M. Knorr et al. proposed another distance-base outlier detection method that can be done efficiently for large datasets, and for k-dimensional datasets with large value of k [9]. Spatial outliers are most time represented as point data, but they are frequently represented in region, i.e., a group of point. Jiang Zhao et al. proposed a wavelet analysis based approach to detect region outlier [5]. Markus M. Breunig et al. showed a different approach to detecting spatial outliers; it was done by assigning to each object a degree of being an outlier, the degree, which was called the local outlier factor of an object, depends on how isolated the object is with respect to the surrounding neighborhood [10]. Currently, there are many spatial statictis software available. S-PLUS spatial statistics are the first comprehensive, object-oriented software package for the analysis of spatial data. It includes a fairly wide range of techniques for spatial data analysis. R is a language similar to S for statistical data analysis, based on modern programming concepts and released under the GNU General Public License. It follows a broad outline of existing collections of functions for spatial statistics written for S. Functions for three types of spatial statistics are covered: spatially continuous data, point pattern data, and area data. SAS is another powerful analytical and reporting system. The SAS Bridge for ESRI provides a new way to exchange spatial attribute data between ArcGIS, the market 3 leading geographic information system (GIS) software from ESRI, and SAS. This new product links spatial, numeric and textual data through a single interface to improve efficiency, produce more intelligent results and communicate those results more effectively. 3. Problem Statement The input data set using in this project were collected from the sensor stations embedded in Interstate highways surrounding the Twin Cities area in Minnesota, US. Each station measures the traffic volume and occupancy on a particular stretch of the highway at 5-min intervals. Each data set consists of 288 rows of the 5-min detector records, starting from 0:0 AM; each row contains 300 tuples of (volume, occupancy) for 150 stations; each tuple in the row represents the traffic volume and occupancy of the detector within the 5-min period. The neighbor is defined in terms of topological rather than Euclidean distance. Our objective is to determine stations that are “outliers” based on the volumes of the traffic measurements from each station. A spatial outlier is a spatially referenced object whose non-spatial attribute values are significantly different from those of other spatially referenced objects in its spatial neighborhood. In this application, the outlier would be the one station which detects a very high volume compare to the neighboring station. For instance, at 1:00 AM, station A detects a volume of 250, which the two neighbor stations B and C only collect single digits volume, then in this case station A would be considered as an local outlier. The algorithm used in this project was proposed in the paper “A Unified Approach to Detecting Spatial Outliers”.[7] The location is compared to its neighborhood using the function: S(x) = [ f x y N(x)(f(y))], where f(x) - attribute value for a location x N(x) - set of neighbors of x 4 Ey N(x)(f(y)) - average attribute value for the neighbors of x S(x) – difference of the attribute value of a sensor located at x and the average attribute value of x’s neighbors. Spatial statistic is used for detecting spatial outliers for normally distributed f(x). Zs(x) = s( x) s s s - Mean value of S(x) s - Standard deviation of S(x) - Specified confidence level 4. Implementation 4.1 Algorithm The algorithm is divided into two subparts, (1) Model construction (2) Outlier detection. The first part of the algorithm is finding Ey N(x)(f(y) (E(x)), the average attributes value for the neighbors of x. For each station, its two neighbor stations are retrieved, and the average of neighbor station’s volume is computed. The second part of the algorithm, for each iteration one outlier is detected. First, the standard deviation and the average for the all the E(x) is computed, then for each station using f(x) – E(x) to find the S(x), which S(x) is the function that compares a station with its neighborhood. Lastly, the spatial statistics Zs(x) = s( x) s are computed and compare to , user specified value. In s the outlier detection program, it means 68%, 95%, or 99% confidence interval. Once one outlier is identified, its original value is replace with the average value of its neighborhood, and the algorithm will starts over again to second outlier, and so on. In this algorithm, the number of outliers are detected is depend on user’s specification, for instance, if user need to find 10 outliers in a given data set, the algorithm will run for 10 iterations. 5 4.2 User Interface The user interface of our application is based on WEKA, in other words, it works WEKA environment. So its interface looks like WEKA, but the differences are dealing with spatial outliers effectively. To find outliers, there are 3 kinds of user specified feature: chosen data file, types of confidence interval, the number of outliers. These features allow users to figure out different outlier sets that are founded depending on their choices. And users can find outliers again and again, it means, they can detect different outlier sets on same data set continuously. And our system provides detected outliers through 3 different ways: plain text, overall traffic volume for one day, and neighbor relationship between stations. ‘Outlier result’ panel display plain text, which consist of detail information about time slots of one day, measured time, stations, and their volume. And users can see overall view of this information on one image with two graph, one is an average traffic volume at each time and each station and detected outliers given timeslot and stations. Different colors of the graphs indicate different volume. It would be helpful to get a big idea about the outliers. Last visual result is image to show volume of user specified station and its neighborhood. Using this image, users see 3 different traffic volume graphs and can compare them each other. This enable for users to analyze relationship between user specified station and its neighborhood. For example, suppose we want to see traffic volume of station 24. The system displays traffic volume of station 23, 24, and 25. From this one, users know pattern of traffic volume of station 23 and 25 are very similar but not station 24 so station 24 should be one of outliers. As mentioned so far, interface of our system consist of several visual components to use easily rather than command line. User-centered interface is big difference from existing systems. 5. Methodology Constructing several experiments to test how exactly find outliers using different spatial data . 1) Case study 6 We will find a set of outliers using different data sets then analyze how exactly they are found. 6. Contributions Major contribution of this project is development application to find spatial outlier using WEKA system. WEKA provides basic data mining functions but these are working on non-spatial database. Building a new class which can detect sets of spatial outliers using given spatial data asset and incorporating the class in existing WEKA will enable the discovery of unexpected, interesting, and useful spatial patterns for further analysis. 7. Conclusion still working on 8. Future work - upgrade to allow various file format and data type - provide written analysis about outlier information - experiments to find more efficient algorithm using different outlier detection algorithms. - Some tool to compare or contrast analysis of different result from different options to detect outliers References [1] EXPLORATORY ANALYSIS OF SPATIAL DATA [2] Chang-Tien Lu, Dechang Chen, Yufeng Kou, “Algorithms for Spatial Outlier Detection”, 15th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'03) November 03 - 05, 2003 7 [3] Shashi Shekhar, Chang-Tien Lu, Pusheng Zhang , “Detecting graph-based spatial outliers: algorithms and applications (a summary of results)”, Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, San Francisco, CA, USA. ACM, 2001 [4] Stephen D. Bay, Mark Schwabacher , “Research track: Mining distance-based outliers in near linear time with randomization and a simple pruning” ruleProceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 29-38, Washington, D.C. ACM 2003 [5] Jiang Zhao, Chang-Tien Lu, Yufeng Kou, “Detecting region outliers in meteorological data”, Proceedings of the eleventh ACM international symposium on Advances in geographic information systems, pp . 49-55, New Orleans, Louisiana, USA, 2003 [6] Sridhar Ramaswamy, Rajeev Rastogi, Kyuseok Shim, “Efficient algorithms for mining outliers from large data sets”, 2000 ACM SIGMOD international conference on Management of data, pp. 427-438, Dallas, Texas, USA. ACM 2000 [7] S. Shekhar, C. T. Lu, and P. Zhang, “A Unified Approach to Detecting Spatial Outliers” , GeoInformatica, pp. 139-166. 2003 [8] [9] Edwin M. Knorr, Raymond T. Ng, “A unified approach for mining outliers”, Proceedings of the 1997 conference of the Centre for Advanced Studies on Collaborative research, pp.11, Toronto, Ontario, Canada, 1997 Edwin M. Knorr, Raymond T. Ng, Vladimir Tucakov, “Distance-based outliers: algorithms and applications”, The VLDB Journal - The International Journal on Very Large Data Bases, pp. 237-253, Volume 8 , Issue 3-4, 2000 [10] [11] Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, “LOF: identifying density-based local outliers”, Jörg Sander, 2000 ACM SIGMOD international conference on Management of data, pp. 93-104, ACM, New York, NY, USA , 2000 Ian H. Witten and Eibe Frank, Morgan Kaufmann, “"Data Mining: Practical machine learning tools with Java implementations," San Fran 8