Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
OUTLAW: An Outlier Detection and Visual Analysis Tool Using Geo-Spatial Associations V. Janeja, V. Atluri and N. R. Adam MSIS Department and CIMIC Rutgers University {vandana,atluri,[email protected]} Type: Short paper Demo as well: No Contact Author: Prof. Vijay Atluri, MSIS Department and CIMIC, 180 University Avenue Rutgers University, Newark, NJ 07102, U.S.A. Telephone: 973-353-1642, Fax: 973-353-5003 Email: [email protected] OUTLAW: An Outlier Detection and Visual Analysis Tool Using Geo-Spatial Associations 1. Introduction U.S. Customs deal with a huge number of cargo trucks and shipments crossing borders by air, water and land. For example, in the year 2000, 489 million people, 127 million passenger vehicles, 11.6 million maritime containers, 11.5 million trucks, 2.2 million railroad cars, 829,000 planes and 211,000 vessels passed through U.S. border inspection systems [1]. Moreover, one third of the trucks entering the United States per year come through just four international bridges between the province of Ontario and the states of Michigan and New York. Typically, a thorough physical inspection of a loaded 40-foot container or an 18-wheel truck border takes five inspectors three hours [1]. While recent events call for heightened security measures at border control, it is not practical to inspect all the cargo crossing the border. Therefore, it is essential to minimize the required human effort in the inspection process without compromising the quality of the inspection. In this paper, we present a system for Outlier analysis by measuring waywardness, called OUTLAW, which is still under development. OUTLAW enables outlier detection and visual analysis using data mining techniques that rely on geo-spatial association. Essentially, it detects abnormal and wayward behavior of cargo or goods from that of normal. We consider several practical limitations of the existing systems and address them in OUTLAW. A wayward behavior may appear to be normal due to lack of correlation. Moreover, we believe that it is essential to consider spatial relationships such as spatial proximity, spatial correlation, and geo-spatial associations, as the geo-spatial attributes of the shipments and cargo contribute significantly to the wayward behavior. We combine disparate data, such as (….) in meaningful ways. Use of thematic map coloring, geographic visualization of individual variables can be effective in identifying correlations between the variables, week spots, loop holes, wayward routes and vagrants. Based on the correlation of the data, we generate a predictive model to detect an index for measuring the waywardness. Most of the traditional outlier detection techniques deal with distance based, density based and deviation based outlier measures [6, 7, 8]. Another approach for spatial outliers was proposed in [14]. (Can we be more specific about this approach, such as distance based, etc.? ) Many of these methods are a by-product of clustering methods, which consider data points that lie outside of a cluster as outliers. However, there could be a scenario that part of, or a whole cluster of data points could be an outlier based on the domain being studied. For example, in a network intrusion detection scenario, a sudden emergence of a number of users at a point in the network or a web site is considered to be an anomaly, whereas in case of a general survey of criminal behavior few data points will be outside of a regular cluster. This is a contradictory scenario, and indicates that it cannot be resolved by using a simple or singular technique. Another disadvantage to traditional outlier detection is that it is treated as a binary property, i.e., either a data object is an outlier or it is not. In this case, the set of factors contributing to the outlierness are not considered. Breunig et al. [5] consider the degree of being an outlier or local factor. But it also relies on density and deviation based outlier detection and thus suffers from the same consequences as being singular in approach. OUTLAW uses a novel technique to analyzing high dimensional geospatial data for detecting outliers. It uses correlation of several parameters to determine outliers in a specific domain. More specifically, it utilizes a multitude of factors to predict outliers using various views of the data set. It uses a new measure for the waywardness of a data object by analyzing the waywardness in each view and computes a unified measure of level of anomaly. It employs an N-step mechanism to detect vagrants or outliers in the normal scheme of events. 2. Cargo data: Sources and description A preliminary study of the customs [9] domain revealed the need to draw data from various agencies (Figure 1) in order to determine some level of disparity. For example, EPA and CDC, which have the Hazardous materials database and their distribution over the entire country can be valuable. Figure 1: Database parameters Figure 2: Qualifiers for anomaly detection The qualifiers include different attributes of shipping and importer data, and cargo information that include spatial data. We believe that in this specific domain, spatial proximity and spatial correlation plays a key role. During the analysis, it was found that apart from the type of cargo, the route taken by the shipment or cargo has a great importance in finding the anomaly. For example, there could be a correlation between two shipments from the importers carrying some associated HAZMAT. This lays the foundation for the architecture of OUTLAW. 4. Architecture The OUTLAW system architecture is shown in figure 3. OUTLAW conducts the analysis of cargo routes by correlating different map layers. These map layers, such as HAZMAT distribution over the country or the world drug distribution (figure 4), are geocoded maps from various government sources like the CDC, EPA, and Census etc [10,11,12,13]. The various data dimensions are maintained in the form of multiple shape files in the geospatial repository. Figure 3: OUTLAW Architecture Basic outlier detection is done using spatial proximity, convergence and divergence of routes, among others. Based on this first iteration, appropriate map layers are identified and activated. The routes are then visually overlaid on top of this activated layer, which assists a human to make his own judgement. In addition, the geospatial association of the data will be done again using the new knowledge and activated map layers. It is important to note that the detection is not based on just one correlation, but several rules, where each initiates a flag. More than “n” number of flags means an outlier has been detected. Figure 5 and 6 show such correlation examples. Assume a car parts importer has a base at Mexicali in Mexico and one at Barranquilla, Colombia, and they send two separate shipments from both these locations to New York and another from Mexicali to San Francisco. The rule engine will be capable of analyzing all the relevant data and statistics to show a correlation between these shipments and a possible drug disbursement via these shipments. This is not an indication that any shipment coming from those locations will carry drugs but it is our aim to develop such a robust system, which would be capable of doing that. Figure4: World Drug Distribution Figure 5: Correlating other geographic layers Acknowledgements The authors thank Dr. Rey Koslowski, Rutgers University, for his early input and discussions on this project. This work is motivated by the various discussions we had with the various efforts going on at the US customs for innovation in information technology for border security, and by SAP America for its initiatives in the same direction. References: 1. 2. 3. 4. 5. 6. 7. Stephen Flynn, America the Vulnerable, Foreign Affairs January/February 2002 Jiawei Han, Micheline Kamber, Data Mining: Concepts and Techniques Efficient and Effective Clustering Methods for Spatial Data Mining (1994) Raymond Ng.20th International Conference on Very Large Data Bases, September 12--15, 1994, Santiago, Chile proceedings D. Hawkins. Identification of outliers. Chapman and Hall 1980. Breunig M. M., Kriegel H.-P., Ng R., Sander J.: LOF: Identifying Density-Based Local Outliers, Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD 2000), Dallas, TX, 2000, pp. 93-104. Sander J., Ester M., Kriegel H.-P., Xu X.: Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and its Applications, in: Data Mining and Knowledge Discovery, an Int. Journal, Kluwer Academic Publishers, Vol. 2, No. 2, 1998, pp. 169-194. Algorithms for Mining Distance-Based Outliers in Large Datasets (1998)Edwin M. Knorr, Raymond T. Ng .Proc. 24th Int. Conf. Very Large Data Bases, VLDB 8. A Linear Method for Deviation Detection in Large Databases (1996) Andreas Arning, Rakesh Agrawal, Prabhakar Raghavan-Knowledge Discovery and Data Mining. 9. The United States Customs Service, http://www.customs.gov 10. The US Census bureau, http://www.census.gov/prod/www/titles.html 11. The Massachusetts Institute for Social and Economic Research(MISER), Foreign Trade Database 12. The Agency for Toxic Substances and Disease Registry (ATSDR), Hazardous Substance Release and Health Effects Database 13. The Office of National Drug Control Policy document NCJ163927, “The National Drug Control Strategy, 1997: Budget Summary" 14. S. Chawla, S. Shekhar, W. Wu, Predicting Locations Using Map Similarity (PLUMS): A Framework for Spatial Data Mining, Proc. of the 6th International Conference on Knowledge Discovery and Data Mining, Boston, MA, 2000. 15. Spatial Data Mining: Progress and Challenges - Krzysztof Koperski Jiawei Han Junas Adhikary