Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Irvine(CA), November 6, 2008 Finding Regional Co-Location Patterns for Sets of Continuous Variables in Spatial Datasets Christoph Eick (University of Houston, USA), Rachana Parmar (University of Houston, USA), Wei Ding (University of Massachusetts at Boston, USA, USA), Tomasz Stepinski (Lunar and Planetary Institute, Houston, USA), Jean-Phillippe Nicot (Bureau of Economic Geology, University of Texas, Austin, USA) ACM-GIS08 Data Mining & Machine Learning Group CS@UH Talk Outline 1. Introduction Co-location Mining 2. Clustering with Plug-in Fitness Functions 3. An Interestingness Measure for Co-location Mining Involving Continuous Variables. 4. Case Study: Arsenic Pollution in the Texas Water Wells 5. CLEVER—A Representative-based Clustering Algorithm 6. Conclusion. ACM-GIS08 Data Mining & Machine Learning Group CS@UH 1. Introduction “Spatial co-locations represent the subsets of features which are frequently located together in geographic space” [Shekhar] Most of the past research centers on finding categorical co-location patterns which are global. However, many real world datasets contain continuous variables, and global knowledge may be inconsistent with regional knowledge ACM-GIS08 Data Mining & Machine Learning Group CS@UH Regional Co-location Mining Goal: To discover regional co-location patterns involving continuous variables in which continuous variables take values from the wings of their statistical distribution A novel framework that operates in the continuous domain is proposed to accomplish this goal. Regional Co-location Mining Dataset: (longitude,latitude,<concentrations>+) Data Mining & Machine Learning Group CS@UH Why is Regional Knowledge Important in Spatial Data Mining? A special challenge in spatial data mining is that information is usually not uniformly distributed in spatial datasets. It has been pointed out in the literature that “whole map statistics are seldom useful”, that “most relationships in spatial data sets are geographically regional, rather than global”, and that “there is no average place on the Earth’s surface” [Goodchild03, Openshaw99]. Therefore, it is not surprising that domain experts are mostly interested in discovering hidden patterns at a regional or local scale rather than a global scale. ACM-GIS08 Data Mining & Machine Learning Group CS@UH Related Work Shekhar et al. discuss several interesting approaches to mine spatial co-location patterns of categorical features. Huang et al. address the problem of mining colocation patterns with rare features. Srikant and Agrawal use discretization of continuous variables to form categorical variables on which classical association rule mining is applied. Calder et al. introduce an approach to use rank correlation to mine quantitative association rules. Achtert and others give a method to derive quantitative, non-spatial models to describe correlation clusters. ACM-GIS08 Data Mining & Machine Learning Group CS@UH 2. Clustering with Plug-in Fitness Functions Motivation: Finding subgroups in geo-referenced datasets has many applications. However, in many applications the subgroups to be searched for do not share the characteristics considered by traditional clustering algorithms, such as cluster compactness and separation. Domain knowledge frequently imposes additional requirements concerning what constitutes a “good” subgroup. Consequently, it is desirable to develop clustering algorithms that provide plug-in fitness functions that allow domain experts to express desirable characteristics of subgroups they are looking for. Only very few clustering algorithms published in the literature provide plug-in fitness functions; consequently existing clustering paradigms have to be modified and extended by our research to provide such capabilities. ACM-GIS08 Data Mining & Machine Learning Group CS@UH Current Suite of Spatial Clustering Algorithms Representative-based: SCEC, SRIDHCR, SPAM, CLEVER Grid-based: SCMRG ACM-GIS08 Agglomerative: MOSAIC Density-based: SCDE, DCONTOUR (not really plug-in but some fitness functions can be simulated) Density-based Grid-based Representative-based Agglomerative-based Clustering Algorithms Remark: All algorithms partition a dataset into clusters by maximizing a reward-based, plug-in fitness function. ACM-GIS08 Data Mining & Machine Learning Group CS@UH Spatial Clustering Alg. Cont. Datasets are assumed to have the following structure: (<spatial attributes>;<non-spatial attributes>) e.g. (longitude, latitude; <chemical concentrations>+) Clusters are found in the subspace of the spatial attributes, called regions in the following. The non-spatial attributes are used by the fitness function but neither in distance computations nor by the clustering algorithm itself. Clustering algorithms are assumed to maximize reward-based fitness functions that have the following structure: q( X ) reward (c) i(c) * c cX b cX where b is a parameter that determines the premium put on cluster size (larger values fewer, larger clusters) ACM-GIS08 Data Mining & Machine Learning Group CS@UH 3. An Interestingness Measure for Co-location Mining Involving Continuous Variables Goal is to discover interesting regions with interesting co-location patterns. Clustering algorithms that maximize fitness functions of the form already exist: b q( X ) reward (c) i(c) * c cX cX To use those algorithms for this task, an interestingness measure has to be designed. ACM-GIS08 Data Mining & Machine Learning Group CS@UH Co-location Measure for Continuous Variables Products of z-scores of continuous variables are used to measure the interestingness of co-location patterns. Pattern A - Attribute A has high values Pattern A - Attribute A has low values z - score A, o if z - score A, o 0 z A , o otherwise 0 z - score A, o if z - score A, o 0 z A , o otherwise 0 ACM-GIS08 Data Mining & Machine Learning Group CS@UH Interestingness of a Pattern Interestingness of a pattern B (e.g. B= {C, D, E}) for an object o, i( B, o) z ( p, o) pB Interestingness of a pattern B for a region c, i B, o * purity ( B, c) B, c oc c Remark: Purity (i(B,o)>0) measures the percentage of objects that exhibit pattern B in region c. ACM-GIS08 Data Mining & Machine Learning Group CS@UH Region Interestingness Region interestingness is assessed by computing the most prevalent pattern: i c max B S & B 1& P B ( B , c ) Region interestingness solely depends on the most interesting co-location set for the region. ACM-GIS08 Data Mining & Machine Learning Group CS@UH Example of a Result All experiments: P(B) = (AsB or AsB) and |B|<5. b = 1.3, θ=1.0 Experiment 1 Exp. No. Region Reward Maximum Valued Pattern in theRegion Purity Average Product for maximum valued pattern 23 174.3191 AsMoVF- 0.83 211.0179 2 40 104.8576 AsMoV 0.65 161.3194 3 11 92.9385 AsMoVSO42- 0.55 170.3873 4 36 89.4068 AsBCl-TDS 0.58 153.2687 30.5775 AsMoClTDS 0.57 53.5107 Top 5 Regions Region Size 1 Exp. 1 5 ACM-GIS08 7 Data Mining & Machine Learning Group CS@UH Summary Pattern Interestingness in a region is evaluated using products of (cut-off) z-scores. In general, products of z-scores measure correlation. Additionally, purity is considered that is controlled by a parameter : Finally, the parameter b determines how much premium is put on the size of a region when computing region rewards. ACM-GIS08 Data Mining & Machine Learning Group CS@UH 4. Case Study ACM-GIS08 Data Mining & Machine Learning Group CS@UH Arsenic Water Pollution Problem Arsenic pollution is a serious problem in the Texas water supply. Hard to explain what causes arsenic pollution to occur. Several Datasets were created using the Ground Water Database (GWDB) by Texas Water Development Board (TWDB) that tests water wells regularly, one of which was used in the experimental evaluation in the paper: All the wells have a non-null samples for arsenic Multiple sample values are aggregated using avg/max functions Other chemicals may have null values Format: (Longitude, Latitude, <z-values of chemical concentrations>) ACM-GIS08 Data Mining & Machine Learning Group CS@UH Interesting Observations High arsenic is a well-known problem in Southern Ogallala aquifer in the Texas Panhandle and in the Southern Gulf Cost aquifer. The colocation mining framework was able to identify regions in this areas, as for example for b=1.3, =1.0 Rank 1, 2 and 3 regions are in Ogallala aquifer. Rank 4 region is in Gulf cost aquifer. The approach not only identified that high arsenic is associated with high vanadium and molybdenum but was also able to discriminate against companion elements like sulfate and fluoride. ACM-GIS08 Data Mining & Machine Learning Group CS@UH Interesting Observations cont. For b=1.5, the extent of arsenic contamination in Texas: Ogallala Aquifer, Southern Gulf Coast, and West Texas basins, could be recognized. For b=2.0, loosening of cluster definition results in a display of the known, often described as sharp, boundaries between high and low arsenic areas in the Ogallala (Ranks 2 and 4) and the Gulf Coast (Ranks 1 and 3) aquifers. In general, for b=1.3 and b=1.5 the discovered regions tend to lie inside Texas aquifers, which is expected, because wells inside the same aquifer are connected by water flow. The algorithm also finds some inconsistent co-location sets. As for example, for b=1.5, rank 3 region in west Texas has high arsenic colocated with high chloride, while rank 4 region in south Texas has low arsenic with high chloride which can be attributed to geographical differences in regions. When is increased to 5, not surprisingly all top regions have purities of 90% or above. ACM-GIS08 Data Mining & Machine Learning Group CS@UH Example: Differences in Results Medium/High Rewards for Purity Table 5. Top 5 regions ranked by reward (as per formula 8). Exp. No. Exp. 2 Exp. 4 Top 5 Regions Region Reward Maximum Valued Pattern in theRegion Purity Average Product for maximum valued pattern Region Size 1 181 61684.5323 AsMoVF- 0.49 52.1019 2 80 24040.6315 AsBCl-TDS 0.48 70.7322 3 467 1884.8856 AsTDS 0.91 0.2047 4 23 701.7072 AsCl-SO42-TDS 0.78 8.1287 5 189 587.9790 AsF- 0.78 0.2909 1 7 11669.7965 AsBCl-TDS 1.0 630.1097 2 117 10407.3250 AsVF- 0.91 12.8550 3 4 2203.2526 AsV SO42-TDS 1.0 275.4066 4 5 2 530 1531.4887 1426.9140 AsMoVB AsTDS 1.0 0.90 541.4630 0.1939 All: (AsB or AsB) and |B|<5 Experiment 2 b = 1.5, θ=1.0 Experiment 4 b = 1.5, θ=1.0 High Reward Regions =1 and =5 =1 ACM-GIS08 =5 Data Mining & Machine Learning Group CS@UH Challenges Kind of “seeking a needle in a haystack” problem, because we search for both interesting places and interesting patterns. The Interestingness measure is not anti-monotone: a superset of a co-location set might be more interesting. Only considering the maximum valued pattern when evaluating regions is somewhat crude (employed solution: used seeded pattern and run algorithm multiple times) Observation: different fitness function parameter settings lead to quite different results, many of which are valuable to domain experts. New challenge: results of many runs have to be analyzed which is a lot of manual labor need a tool for that. ACM-GIS08 Data Mining & Machine Learning Group CS@UH Representative-based Clustering Attribute1 2 1 3 4 Attribute2 Objective: Find a set of objects OR such that the clustering X obtained by using the objects in OR as representatives minimizes q(X). Properties: Cluster shapes are convex polygons Popular Algorithms: K-means. K-medoids Data Mining & Machine Learning Group CS@UH 5. CLEVER (ClustEring using representatiVEs and Randomized hill climbing) Is a representative-based, sometimes called prototypebased clustering algorithm Uses variable number of clusters and larger neighborhood sizes to battle premature termination and randomized hill climbing and adaptive sampling to reduce complexity. Searches for optimal number of clusters ACM-GIS08 Data Mining & Machine Learning Group CS@UH 6. Summary A novel framework for mining co-location patterns involving multiple continuous variables with values from the wings of their statistical distribution is proposed. Regional co-location mining is approached as a clustering problem in which a reward-based fitness function has to be maximized. The approach was successfully applied in a real world case study involving arsenic contamination. The case study revealed known areas of arsenic contamination and also some unknown areas with interesting features. Different parameters lead to characterization of arsenic patterns at different scales. In general, the regional co-location mining framework has been valuable to domain experts in that it provided a data-driven approach that suggests promising hypotheses for future research. A novel prototype-based clustering named CLEVER was also introduced. ACM-GIS08 Data Mining & Machine Learning Group CS@UH References S. Shekhar and Y. Huang, “Discovering spatial co-location patterns: A summary of results,” Lecture Notes in Computer Science, vol. 2121, pp. 236+, 2001. Y. Huang, J. Pei, and H. Xiong, “Mining co-location patterns with rare events from spatial data sets,” Geoinformatica, vol. 10, no. 3, pp. 239–260, 2006. R. Srikant and R. Agrawal, “Mining quantitative association rules in large relational tables,” in SIGMOD ’96: Proceedings of the 1996 ACM SIGMOD international conference on Management of data. New York, NY, USA: ACM, 1996, pp. 1–12. T. Calders, B. Goethals, and S. Jaroszewicz, “Mining rank-correlated sets of numerical attributes,” in KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. New York, NY, USA: ACM, 2006, pp. 96–105. E. Achtert, C. B¨ohm, H.-P. Kriegel, P. Kr¨oger, and A. Zimek, “Deriving quantitative models for correlation clusters,” in KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. New York, NY, USA: ACM, 2006, pp. 4–13. C. F. Eick, B. Vaezian, D. Jiang, and J. Wang, “Discovery of interesting regions in spatial datasets using supervised clustering,” in Proceedings of the 10th European conference on Principles of Data Mining and Knowledge discovery, 2006. ACM-GIS08 Data Mining & Machine Learning Group CS@UH Region Discovery Framework Objective: Develop and implement an integrated framework to automatically discover interesting regional patterns in spatial datasets. Treats region discovery as a clustering problem. ACM-GIS08 Data Mining & Machine Learning Group CS@UH Region Discovery Framework Continued The clustering algorithms we currently investigate solve the following problem: Given: A dataset O with a schema R A distance function d defined on instances of R A fitness function q(X) that evaluates clustering X={c1,…,ck} as follows: q(X)= cX reward(c)=cX interestingness(c)*size(c)b with b>1 Objective: Find c1,…,ck O such that: 1. cicj= if ij 2. X={c1,…,ck} maximizes q(X) 3. All cluster ciX are contiguous in the spatial subspace 4. c1,…,ck O 5. c1,…,ck are usually ranked based on the reward each cluster receives, and low reward clusters are frequently not reported ACM-GIS08 Data Mining & Machine Learning Group CS@UH