Download OUTLAW: An Outlier Detection and Visual - Rutgers

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
OUTLAW: An Outlier Detection and Visual Analysis Tool Using
Geo-Spatial Associations
V. Janeja, V. Atluri and N. R. Adam
MSIS Department and CIMIC Rutgers University
{vandana,atluri,[email protected]}
Type: Short paper
Demo as well: No
Contact Author:
Prof. Vijay Atluri, MSIS Department and CIMIC, 180 University Avenue
Rutgers University, Newark, NJ 07102, U.S.A.
Telephone: 973-353-1642, Fax: 973-353-5003
Email: [email protected]
OUTLAW: An Outlier Detection and Visual Analysis Tool Using
Geo-Spatial Associations
1. Introduction
U.S. Customs deal with a huge number of cargo trucks and shipments crossing borders by air, water and
land. For example, in the year 2000, 489 million people, 127 million passenger vehicles, 11.6 million
maritime containers, 11.5 million trucks, 2.2 million railroad cars, 829,000 planes and 211,000 vessels
passed through U.S. border inspection systems [1]. Moreover, one third of the trucks entering the United
States per year come through just four international bridges between the province of Ontario and the states
of Michigan and New York. Typically, a thorough physical inspection of a loaded 40-foot container or an
18-wheel truck border takes five inspectors three hours [1]. While recent events call for heightened security
measures at border control, it is not practical to inspect all the cargo crossing the border. Therefore, it is
essential to minimize the required human effort in the inspection process without compromising the quality
of the inspection.
In this paper, we present a system for Outlier analysis by measuring waywardness, called OUTLAW,
which is still under development. OUTLAW enables outlier detection and visual analysis using data mining
techniques that rely on geo-spatial association. Essentially, it detects abnormal and wayward behavior of
cargo or goods from that of normal. We consider several practical limitations of the existing systems and
address them in OUTLAW. A wayward behavior may appear to be normal due to lack of correlation.
Moreover, we believe that it is essential to consider spatial relationships such as spatial proximity, spatial
correlation, and geo-spatial associations, as the geo-spatial attributes of the shipments and cargo contribute
significantly to the wayward behavior. We combine disparate data, such as (….) in meaningful ways. Use
of thematic map coloring, geographic visualization of individual variables can be effective in identifying
correlations between the variables, week spots, loop holes, wayward routes and vagrants. Based on the
correlation of the data, we generate a predictive model to detect an index for measuring the waywardness.
Most of the traditional outlier detection techniques deal with distance based, density based and deviation
based outlier measures [6, 7, 8]. Another approach for spatial outliers was proposed in [14]. (Can we be
more specific about this approach, such as distance based, etc.? ) Many of these methods are a by-product
of clustering methods, which consider data points that lie outside of a cluster as outliers. However, there
could be a scenario that part of, or a whole cluster of data points could be an outlier based on the domain
being studied. For example, in a network intrusion detection scenario, a sudden emergence of a number of
users at a point in the network or a web site is considered to be an anomaly, whereas in case of a general
survey of criminal behavior few data points will be outside of a regular cluster. This is a contradictory
scenario, and indicates that it cannot be resolved by using a simple or singular technique.
Another disadvantage to traditional outlier detection is that it is treated as a binary property, i.e., either a
data object is an outlier or it is not. In this case, the set of factors contributing to the outlierness are not
considered. Breunig et al. [5] consider the degree of being an outlier or local factor. But it also relies on
density and deviation based outlier detection and thus suffers from the same consequences as being singular
in approach.
OUTLAW uses a novel technique to analyzing high dimensional geospatial data for detecting outliers. It
uses correlation of several parameters to determine outliers in a specific domain. More specifically, it
utilizes a multitude of factors to predict outliers using various views of the data set. It uses a new measure
for the waywardness of a data object by analyzing the waywardness in each view and computes a unified
measure of level of anomaly. It employs an N-step mechanism to detect vagrants or outliers in the normal
scheme of events.
2. Cargo data: Sources and description
A preliminary study of the customs [9] domain revealed the need to draw data from various agencies
(Figure 1) in order to determine some level of disparity. For example, EPA and CDC, which have the
Hazardous materials database and their distribution over the entire country can be valuable.
Figure 1: Database parameters
Figure 2: Qualifiers for anomaly detection
The qualifiers include different attributes of shipping and importer data, and cargo information that include
spatial data. We believe that in this specific domain, spatial proximity and spatial correlation plays a key
role. During the analysis, it was found that apart from the type of cargo, the route taken by the shipment or
cargo has a great importance in finding the anomaly. For example, there could be a correlation between two
shipments from the importers carrying some associated HAZMAT. This lays the foundation for the
architecture of OUTLAW.
4. Architecture
The OUTLAW system architecture is shown in figure 3. OUTLAW conducts the analysis of cargo routes
by correlating different map layers. These map layers, such as HAZMAT distribution over the country or
the world drug distribution (figure 4), are geocoded maps from various government sources like the CDC,
EPA, and Census etc [10,11,12,13]. The various data dimensions are maintained in the form of multiple
shape files in the geospatial repository.
Figure 3: OUTLAW Architecture
Basic outlier detection is done using spatial proximity, convergence and divergence of routes, among
others. Based on this first iteration, appropriate map layers are identified and activated. The routes are then
visually overlaid on top of this activated layer, which assists a human to make his own judgement. In
addition, the geospatial association of the data will be done again using the new knowledge and activated
map layers.
It is important to note that the detection is not based on just one correlation, but several rules, where each
initiates a flag. More than “n” number of flags means an outlier has been detected. Figure 5 and 6 show
such correlation examples. Assume a car parts importer has a base at Mexicali in Mexico and one at
Barranquilla, Colombia, and they send two separate shipments from both these locations to New York and
another from Mexicali to San Francisco. The rule engine will be capable of analyzing all the relevant data
and statistics to show a correlation between these shipments and a possible drug disbursement via these
shipments. This is not an indication that any shipment coming from those locations will carry drugs but it is
our aim to develop such a robust system, which would be capable of doing that.
Figure4: World Drug Distribution
Figure 5: Correlating other geographic layers
Acknowledgements
The authors thank Dr. Rey Koslowski, Rutgers University, for his early input and discussions on this
project. This work is motivated by the various discussions we had with the various efforts going on at the
US customs for innovation in information technology for border security, and by SAP America for its
initiatives in the same direction.
References:
1.
2.
3.
4.
5.
6.
7.
Stephen Flynn, America the Vulnerable, Foreign Affairs January/February 2002
Jiawei Han, Micheline Kamber, Data Mining: Concepts and Techniques
Efficient and Effective Clustering Methods for Spatial Data Mining (1994)
Raymond Ng.20th International Conference on Very Large Data Bases, September 12--15, 1994,
Santiago, Chile proceedings
D. Hawkins. Identification of outliers. Chapman and Hall 1980.
Breunig M. M., Kriegel H.-P., Ng R., Sander J.: LOF: Identifying Density-Based Local Outliers,
Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD 2000), Dallas, TX, 2000, pp.
93-104.
Sander J., Ester M., Kriegel H.-P., Xu X.: Density-Based Clustering in Spatial Databases: The
Algorithm GDBSCAN and its Applications, in: Data Mining and Knowledge Discovery, an Int.
Journal, Kluwer Academic Publishers, Vol. 2, No. 2, 1998, pp. 169-194.
Algorithms for Mining Distance-Based Outliers in Large Datasets (1998)Edwin M. Knorr,
Raymond T. Ng .Proc. 24th Int. Conf. Very Large Data Bases, VLDB
8.
A
Linear
Method
for
Deviation
Detection
in
Large
Databases
(1996)
Andreas Arning, Rakesh Agrawal, Prabhakar Raghavan-Knowledge Discovery and Data Mining.
9.
The United States Customs Service, http://www.customs.gov
10. The US Census bureau, http://www.census.gov/prod/www/titles.html
11. The Massachusetts Institute for Social and Economic Research(MISER), Foreign Trade Database
12. The Agency for Toxic Substances and Disease Registry (ATSDR), Hazardous Substance Release
and Health Effects Database
13. The Office of National Drug Control Policy document NCJ163927, “The National Drug Control
Strategy, 1997: Budget Summary"
14. S. Chawla, S. Shekhar, W. Wu, Predicting Locations Using Map Similarity (PLUMS): A
Framework for Spatial Data Mining, Proc. of the 6th International Conference on Knowledge
Discovery and Data Mining, Boston, MA, 2000.
15. Spatial Data Mining: Progress and Challenges - Krzysztof Koperski Jiawei Han Junas Adhikary