Download report2 - University of Minnesota

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Name: Jisu Oh, Shan Huang
Date : April 12, 2004
Course : Csci 8715
Professor : Shashi Shekhar
Project Report (draft version)
“Spatial Outlier Detection”
Shan Huang, Jisu Oh
Computer Science Department, University of Minnesota, 200 Union Street SE,
Minneapolis, MN 55455, U.S.A
E-mail: [email protected], [email protected]
http://www-users.cs.umn.edu/~joh/csci8715/HW-list.htm
1. Introduction
A spatial outlier is a spatially referenced object whose non-spatial attribute values are
significantly different from the values of its neighborhood. Identification of spatial
outliers can lead to the discovery of unexpected, interesting, and useful spatial
patterns for further analysis. WEKA is a collection of machine learning algorithms
for solving real-world data mining problems. It is written in Java and runs on almost
any platform. Basic data mining functions as well as regression, association rules and
clustering algorithms have also been implemented in WEKA, but their algorithms can
only operate on traditional non-spatial database. The purpose of this project is to
build a new class, which can detect spatial outlier in a spatial data set.
2. Motivation
Machine learning/data mining discovers new things or structure that is unknown to
humans. It enables a computer program to automatically analyze large-scale data and
decide what information is most important. We can then use this information to make
predictions or to make decisions faster and more accurately.
1
Many organizations rely on spatial analysis to make business and agency decisions
and to conduct research. The main difference between data mining in relational DBS
and in spatial DBS is the interest of neighboring object’s attributes may have an
influence on the current object, so the neighboring object have to be considered as
well. The explicit location and extension of spatial objects define implicit relations of
spatial neighborhood which are used by spatial data mining algorithms. Therefore,
new techniques are required for effective and efficient data mining.
WEKA is a collection of machine learning algorithms for solving real-world data
mining problems. It is written in Java and runs on almost any platform. Basic data
mining functions as well as regression, association rules and clustering algorithms
have also been implemented in WEKA, but these algorithms can only operate on
traditional non-spatial database.
The aim of this project is to build new classes and algorithm which can handle spatial
data, such as spatial regression, spatial association rule (co-location), and spatial
outlier detection.
2. Related works
Detecting spatial outliers is useful in many applications of geographic information
systems, including transportation, ecology, public safety, public health, climatology,
and location based services [2].
Shekhar et al. introduced a method for detecting spatial outliers in graph data set
based on the distribution property of the difference between an attribute value and the
average attribute value of its neighbors [3]. Shekhar also proposed an algorithm to
find all outliers in a dataset, which replace many statistical discordance tests,
regardless of any knowledge about the underlying distribution of the attributes [7].
Stephen D. Bay et al. introduced a simple nested loop algorithm to detect spatial
2
outlier, which gives linear time performance when data is in random order and a
simple pruning rule is used [4]. Existing methods for finding outliers can only deal
efficiently with two dimensions/attributes of a dataset.
A distance-based detection method was introduced by Sridhar Ramaswamy et al.,
which ranks each point on the basis of its distance to its kth nearest neighbor and
declares the top n points in this ranking to outliers. A highly efficient partition-based
algorithm was also introduced in this paper [6]. Edwin M. Knorr et al. proposed
another distance-base outlier detection method that can be done efficiently for large
datasets, and for k-dimensional datasets with large value of k [9]. Spatial outliers are
most time represented as point data, but they are frequently represented in region, i.e.,
a group of point. Jiang Zhao et al. proposed a wavelet analysis based approach to
detect region outlier [5].
Markus M. Breunig et al. showed a different approach to detecting spatial outliers; it
was done by assigning to each object a degree of being an outlier, the degree, which
was called the local outlier factor of an object, depends on how isolated the object is
with respect to the surrounding neighborhood [10].
Currently, there are many spatial statictis software available. S-PLUS spatial
statistics are the first comprehensive, object-oriented software package for the
analysis of spatial data. It includes a fairly wide range of techniques for spatial data
analysis.
R is a language similar to S for statistical data analysis, based on modern
programming concepts and released under the GNU General Public License. It
follows a broad outline of existing collections of functions for spatial statistics written
for S. Functions for three types of spatial statistics are covered: spatially continuous
data, point pattern data, and area data.
SAS is another powerful analytical and reporting system. The SAS Bridge for ESRI
provides a new way to exchange spatial attribute data between ArcGIS, the market
3
leading geographic information system (GIS) software from ESRI, and SAS. This
new product links spatial, numeric and textual data through a single interface to
improve efficiency, produce more intelligent results and communicate those results
more effectively.
3. Problem Statement
The input data set using in this project were collected from the sensor stations
embedded in Interstate highways surrounding the Twin Cities area in Minnesota, US.
Each station measures the traffic volume and occupancy on a particular stretch of the
highway at 5-min intervals. Each data set consists of 288 rows of the 5-min detector
records, starting from 0:0 AM; each row contains 300 tuples of (volume, occupancy)
for 150 stations; each tuple in the row represents the traffic volume and occupancy of
the detector within the 5-min period. The neighbor is defined in terms of topological
rather than Euclidean distance. Our objective is to determine stations that are
“outliers” based on the volumes of the traffic measurements from each station.
A spatial outlier is a spatially referenced object whose non-spatial attribute values are
significantly different from those of other spatially referenced objects in its spatial
neighborhood. In this application, the outlier would be the one station which detects
a very high volume compare to the neighboring station. For instance, at 1:00 AM,
station A detects a volume of 250, which the two neighbor stations B and C only
collect single digits volume, then in this case station A would be considered as an
local outlier.
The algorithm used in this project was proposed in the paper “A Unified Approach to
Detecting Spatial Outliers”.[7] The location is compared to its neighborhood using
the function:
S(x) = [ f x  y  N(x)(f(y))], where
f(x) - attribute value for a location x
N(x) - set of neighbors of x
4
Ey  N(x)(f(y)) - average attribute value for the neighbors of x
S(x) – difference of the attribute value of a sensor located at x and the average
attribute value of x’s neighbors.
Spatial statistic is used for detecting spatial outliers for normally distributed f(x).
Zs(x) =
s( x)  s

s
s - Mean value of S(x)
s - Standard deviation of S(x)
 - Specified confidence level
4. Implementation
4.1 Algorithm
The algorithm is divided into two subparts, (1) Model construction (2) Outlier detection.
The first part of the algorithm is finding Ey  N(x)(f(y) (E(x)), the average attributes value
for the neighbors of x. For each station, its two neighbor stations are retrieved, and the
average of neighbor station’s volume is computed. The second part of the algorithm, for
each iteration one outlier is detected. First, the standard deviation and the average for the
all the E(x) is computed, then for each station using f(x) – E(x) to find the S(x), which
S(x) is the function that compares a station with its neighborhood. Lastly, the spatial
statistics Zs(x) =
s( x)  s
 are computed and compare to  , user specified value. In
s
the outlier detection program, it means 68%, 95%, or 99% confidence interval. Once one
outlier is identified, its original value is replace with the average value of its
neighborhood, and the algorithm will starts over again to second outlier, and so on. In
this algorithm, the number of outliers are detected is depend on user’s specification, for
instance, if user need to find 10 outliers in a given data set, the algorithm will run for 10
iterations.
5
4.2 User Interface
The user interface of our application is based on WEKA, in other words, it works WEKA
environment. So its interface looks like WEKA, but the differences are dealing with
spatial outliers effectively. To find outliers, there are 3 kinds of user specified feature:
chosen data file, types of confidence interval, the number of outliers. These features
allow users to figure out different outlier sets that are founded depending on their choices.
And users can find outliers again and again, it means, they can detect different outlier sets
on same data set continuously.
And our system provides detected outliers through 3 different ways: plain text, overall
traffic volume for one day, and neighbor relationship between stations. ‘Outlier result’
panel display plain text, which consist of detail information about time slots of one day,
measured time, stations, and their volume. And users can see overall view of this
information on one image with two graph, one is an average traffic volume at each time
and each station and detected outliers given timeslot and stations. Different colors of the
graphs indicate different volume. It would be helpful to get a big idea about the outliers.
Last visual result is image to show volume of user specified station and its neighborhood.
Using this image, users see 3 different traffic volume graphs and can compare them each
other. This enable for users to analyze relationship between user specified station and its
neighborhood. For example, suppose we want to see traffic volume of station 24. The
system displays traffic volume of station 23, 24, and 25. From this one, users know
pattern of traffic volume of station 23 and 25 are very similar but not station 24 so station
24 should be one of outliers. As mentioned so far, interface of our system consist of
several visual components to use easily rather than command line. User-centered
interface is big difference from existing systems.
5. Methodology
Constructing several experiments to test how exactly find outliers using different spatial
data .
1) Case study
6
We will find a set of outliers using different data sets then analyze how exactly they are
found.
6. Contributions
Major contribution of this project is development application to find spatial outlier using
WEKA system. WEKA provides basic data mining functions but these are working on
non-spatial database. Building a new class which can detect sets of spatial outliers using
given spatial data asset and incorporating the class in existing WEKA will enable the
discovery of unexpected, interesting, and useful spatial patterns for further analysis.
7. Conclusion
still working on
8. Future work
- upgrade to allow various file format and data type
- provide written analysis about outlier information
- experiments to find more efficient algorithm using different outlier detection
algorithms.
- Some tool to compare or contrast analysis of different result from different options
to detect outliers
References
[1]
EXPLORATORY ANALYSIS OF SPATIAL DATA
[2]
Chang-Tien Lu, Dechang Chen, Yufeng Kou, “Algorithms for Spatial Outlier
Detection”, 15th IEEE International Conference on Tools with Artificial
Intelligence (ICTAI'03) November 03 - 05, 2003
7
[3]
Shashi Shekhar, Chang-Tien Lu, Pusheng Zhang , “Detecting graph-based spatial
outliers: algorithms and applications (a summary of results)”, Proceedings of the
seventh ACM SIGKDD international conference on Knowledge discovery and
data mining, San Francisco, CA, USA. ACM, 2001
[4]
Stephen D. Bay, Mark Schwabacher , “Research track: Mining distance-based
outliers in near linear time with randomization and a simple pruning”
ruleProceedings of the ninth ACM SIGKDD international conference on
Knowledge discovery and data mining, pp. 29-38, Washington, D.C. ACM 2003
[5]
Jiang Zhao, Chang-Tien Lu, Yufeng Kou, “Detecting region outliers in
meteorological data”, Proceedings of the eleventh ACM international symposium
on Advances in geographic information systems, pp . 49-55, New Orleans,
Louisiana, USA, 2003
[6]
Sridhar Ramaswamy, Rajeev Rastogi, Kyuseok Shim, “Efficient algorithms for
mining outliers from large data sets”, 2000 ACM SIGMOD international
conference on Management of data, pp. 427-438, Dallas, Texas, USA. ACM 2000
[7]
S. Shekhar, C. T. Lu, and P. Zhang, “A Unified Approach to Detecting Spatial
Outliers” , GeoInformatica, pp. 139-166. 2003
[8]
[9]
Edwin M. Knorr, Raymond T. Ng, “A unified approach for mining outliers”,
Proceedings of the 1997 conference of the Centre for Advanced Studies on
Collaborative research, pp.11, Toronto, Ontario, Canada, 1997
Edwin M. Knorr, Raymond T. Ng, Vladimir Tucakov, “Distance-based outliers:
algorithms and applications”, The VLDB Journal - The International Journal on
Very Large Data Bases, pp. 237-253, Volume 8 , Issue 3-4, 2000
[10]
[11]
Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, “LOF: identifying
density-based local outliers”, Jörg Sander, 2000 ACM SIGMOD international
conference on Management of data, pp. 93-104, ACM, New York, NY, USA ,
2000
Ian H. Witten and Eibe Frank, Morgan Kaufmann, “"Data Mining: Practical
machine learning tools with Java implementations," San Fran
8