Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
High Performance Data Mining Vipin Kumar Army High Performance Computing Research Center Department of Computer Science University of Minnesota http://www.cs.umn.edu/~kumar Research sponsored by AHPCRC/ARL, DOE, NASA, and NSF © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Overview Introduction to Data Mining (What, Why, and How?) Issues and Challenges in Designing Parallel Data Mining Algorithms Case Study: Discovery of Patterns in Global Climate Data using Data Mining Summary © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› What is Data Mining? Many Definitions – Non-trivial extraction of implicit, previously unknown and potentially useful information from data – Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns What is (not) Data Mining? What is not Data Mining? What is Data Mining? – Look up phone number in phone directory – Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area) – Query a Web search engine for information about “Amazon” – Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,) © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Why Mine Data? Commercial Viewpoint Lots of data is being collected and warehoused – Web data, e-commerce – purchases at department/ grocery stores – Bank/Credit Card transactions Computers have become cheaper and more powerful Competitive Pressure is Strong – Provide better, customized services for an edge (e.g. in Customer Relationship Management) © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Why Mine Data? Scientific Viewpoint Data collected and stored at enormous speeds (GB/hour) – remote sensors on a satellite – telescopes scanning the skies – microarrays generating gene expression data – scientific simulations generating terabytes of data Traditional techniques infeasible for raw data Data mining may help scientists – in classifying and segmenting data – in Hypothesis Formation Mining Large Data Sets - Motivation There is often information “hidden” in the data that is not readily evident Human analysts may take weeks to discover useful information Much of the data is never analyzed at all 4,000,000 3,500,000 The Data Gap 3,000,000 2,500,000 2,000,000 1,500,000 Total new disk (TB) since 1995 1,000,000 Number of analysts 500,000 0 1995 1996 1997 1998 1999 From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications” Origins of Data Mining Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems Traditional Techniques may be unsuitable due to Statistics/ Machine Learning/ – Enormity of data AI Pattern Recognition – High dimensionality of data Data Mining – Heterogeneous, distributed nature Database systems of data © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Role of Parallel and Distributed Computing Many algorithms use computation time more than O(n) High Performance Computing (HPC) is often critical for scalability to large data sets Sequential computers have limited memory – This may required multiple, expensive I/O passes over data Data may be distributed – due to privacy reasons – physically dispersed over many different geographic locations © Vipin Kumar Statistics/ AI Machine Learning/ Pattern Recognition Data Mining Database systems Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 High Performance Computing ‹#› Data Mining Tasks... Data 10 Milk Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 11 No Married 60K No 12 Yes Divorced 220K No 13 No Single 85K Yes 14 No Married 75K No 15 No Single 90K Yes 60K Predictive Modeling Find a model for class attribute as a function of the values of other attributes Model for predicting tax evasion Married Yes Tid Refund Marital Status Taxable Evade Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K NO Income100K Yes Yes Yes Learn Classifier No Income 80K NO NO No YES 10 © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Predictive Modeling: Applications Targeted Marketing Customer Attrition/Churn Classifying Galaxies Early Class: • Stages of Formation Intermediate Attributes: • Image features, • Characteristics of light waves received, etc. Late Sky Survey Data Size: • 72 million stars, 20 million galaxies • Object Catalog: 9 GB • Image Database: 150 GB Courtsey: http://aps.umn.edu Clustering Given a set of data points, find groupings such that – Data points in one cluster are more similar to one another – Data points in separate clusters are less similar to one another © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Clustering: Applications Market Segmentation Gene expression clustering Document Clustering Category Total Articles Correctly Placed 555 364 Foreign 341 260 National 273 36 Metro 943 746 Sports 738 573 Entertainment 354 278 Financial © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Association Rule Discovery Given a set of records, find dependency rules which will predict occurrence of an item based on occurrences of other items in the record TID Items 1 Bread, Coke, Milk 2 3 4 5 Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk Applications – Marketing and Sales Promotion – Supermarket shelf management – Inventory Management © Vipin Kumar Rules Discovered: {Milk} --> {Coke} (s=0.6, c=0.75) {Diaper, Milk} --> {Beer} (s=0.4, c=0.67) Support, s # transacti ons that contain X and Y Total transacti ons Confidence , c # transacti ons that contain X and Y # transacti ons that contain X Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Deviation/Anomaly Detection Detect significant deviations from normal behavior Applications: – Credit Card Fraud Detection – Network Intrusion Detection Typical network traffic at University level may reach over 100 million connections per day © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› General Issues and Challenges in Parallel Data Mining Dense vs. Sparse Structured versus Unstructured Static vs. Dynamic Data mining computations tend to be unstructured, sparse and dynamic. © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Specific Issues and Challenges in Parallel Data Mining Disk I/O – Data is often too large to fit in main memory – Spatial locality is critical Hash Tables – Many efficient data mining algorithms require fast access to large hash tables. © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Constructing a Decision Tree Tid Refund Marital Status Taxable Income Evade 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K Marital Status Single/Divorced Pay: 3 Evade: 3 Refund Married Yes Pay: 4 Evade: 0 Pay: 3 Evade: 0 Pay 10 No Pay: 4 Evade: 3 Evade Refund 3 0 No Refund 4 3 Key Computation © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Constructing a Decision Tree Tid Refund Marital Status Taxable Income Evade 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 60K 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Refund: Yes Tid Refund Marital Status Taxable Income Evade 1 Yes Single 125K No 4 Yes Married 120K No 7 Yes Divorced 220K No 10 Refund: No 10 Tid Refund Marital Status Taxable Income Evade 2 No Married 100K No 3 No Single 70K No 5 No Divorced 95K Yes 6 No Married 60K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Constructing a Decision Tree in Parallel m categorical attributes n records Partitioning of data only – global reduction per node is required – large number of classification tree nodes gives high communication cost Pay © Vipin Kumar Evade Refund 3 0 No Refund 4 3 Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Constructing a Decision Tree in Parallel 10,000 training records 7,000 records 2,000 3,000 records 5,000 2,000 1,000 Partitioning of classification tree nodes – natural concurrency – load imbalance as the amount of work associated with each node varies – child nodes use the same data as used by parent node © Vipin Kumar – loss of locality – high data movement cost Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Challenges in Constructing Parallel Classifier Partitioning of data only – large number of classification tree nodes gives high communication cost Partitioning of classification tree nodes – natural concurrency – load imbalance as the amount of work associated with each node varies – child nodes use the same data as used by parent node • – loss of locality – high data movement cost Hybrid algorithms: partition both data and tree © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Experimental Results (Srivastava, Han, Kumar, and Singh, 1999) Data set – function 2 data set discussed in SLIQ paper (Mehta, Agrawal and Rissanen, EDBT’96) – 2 class labels, 3 categorical and 6 continuous attributes IBM SP2 with 128 processors – 66.7 MHz CPU with 256 MB real memory – AIX version 4 – high performance switch © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Speedup Comparison of the Three Parallel Algorithms 0.8 million examples © Vipin Kumar 1.6 million examples Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Splitting Criterion Verification in the Hybrid Algorithm Splitting Criterion Ratio Moving Cost Load Balancing 0.8 million examples on 8 processors © Vipin Kumar Communication Cost 1.6 million examples on 16 processors Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Speedup of the Hybrid Algorithm with Different Size Data Sets © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Scaleup of the Hybrid Algorithm © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Hash Table Access Some efficient decision tree algorithms require random access to large data structures. Example: SPRINT (Shafer, Agrawal, Mehta) Hash Table ID Income ID Age ID Left/ Right 0 25K 2 25 1 Left 2 28K 5 31 2 Left 8 30K 8 33 3 Right 4 30K 1 37 4 Right 5 35K 3 41 5 Right 1 50K 6 52 6 Left 3 520K 4 55 7 Right 6 55K 7 60 8 Left 7 70K 0 61 9 Left 10 10 10 Storing the entire has table on one processor makes the algorithm unscalable. © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› ScalParC (Joshi, Karypis, Kumar, 1998) ScalParC is a scalable parallel decision tree construction algorithm – Scales to large number of processors – Scales to large training sets ScalParC is memory efficient – The hash-table is distributed among the processors ScalParC performs minimum amount of communication © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› This Design is Inspired by.. Communication Structure of Parallel Sparse Matrix-Vector Algorithms. © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Parallel Runtime (Joshi, Karypis, Kumar, 1998) Runtime (seconds) 120 100 0.2M 0.4M 0.8M 1.6M 3.2M 6.4M 80 60 40 20 0 0 50 100 150 Processors 128 Processor Cray T3D © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Computing Association Patterns 1. Market-basket transactions TID Items 1 Bread, Diaper,Milk 2 3 4 5 Beer, Diaper, Bread, Eggs Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Bread, Diaper, Milk 3. Generate association rules © Vipin Kumar 2. Find item combinations (itemsets) that occur frequently in data Item Combination Bread Coke Milk … Bread & Coke Bread & Milk … Bread & Milk & Diaper … Count 4 2 4 … 1 3 … 3 … {Diaper, Milk} {Beer} {Bread} {Diaper} Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Computing Association Require Exponential Computation {a} {a,b} {b} {a,c} {a,b,c} {c} {a,d} {a,b,d} {b,c} {d} {b,d} {a,c,d} {c,d} {b,c,d} {a,b,c,d} Given m items, there are 2m-1 possible item combinations © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Handling Exponential Complexity Given n transactions and m different items: m 1 – number of possible association rules: O(m2 ) m O ( nm 2 ) – computation complexity: Systematic search for all patterns, based on support constraint [Agarwal & Srikant]: – If {A,B} has support at least a, then both A and B have support at least a. – If either A or B has support less than a, then {A,B} has support less than a. – Use patterns of n-1 items to find patterns of n items. © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Illustrating Apriori Principle (Agrawal and Srikant, 1994) Item Bread Coke Milk Beer Diaper Eggs Count 4 2 4 3 5 1 Items (1-itemset candidates) Minimum Support = 3 If every subset is considered, 6C + 6C + 6C = 41 1 2 3 With support-based pruning, 6 + 6 + 1 = 13 © Vipin Kumar Itemset {Bread,Milk} {Bread,Beer} {Bread,Diaper} {Milk,Beer} {Milk,Diaper} {Beer,Diaper} Count 4 2 3 2 4 3 Pairs (2-itemset candidates) Triplets (3-itemset candidates) Itemset {Bread,Milk,Diaper} Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 Count 3 ‹#› Counting Candidates Frequent Itemsets are found by counting candidates. Simple way: – Search for each candidate in each transaction. Expensive!!! Transactions N © Vipin Kumar Candidates M Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Parallel Formulation of Association Rules (Han, Karypis, and Kumar, 2000) Need: – Huge Transaction Datasets (10s of TB) – Large Number of Candidates. How? – Partition the Transaction Database among processors communication needed for global counts local memory on each processor should be large enough to store the entire hash tree – Partition the Candidates among processors redundant I/O for transactions – Partition both Candidates and Transaction Database © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Parallel Association Rules: Scaleup Results (100K,0.25%) © Vipin Kumar (Han, Karypis, and Kumar, 2000) Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Parallel Association Rules: Response Time (np=64,50K) © Vipin Kumar (Han, Karypis, and Kumar, 2000) Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Discovery of Patterns in the Global Climate System Ocean and Land Temperature (Jan 1982) Research Goals: NPP . Pressure © Vipin Kumar Longitude Find global climate patterns of interest to Earth Scientists Global snapshots of values for a number of variables on land surfaces or water. Monthly over a range of 10 to 50 years. NPP . Pressure . Precipitation Precipitation SST SST # grid points: 67K Land, 40K Ocean Current data size range: 20 – 400 MB Latitude grid cell Time zone Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Importance of Global Climate Patterns and NPP Net Primary Production (NPP) is the net assimilation of atmospheric carbon dioxide (CO2) into organic matter by plants. Keeping track of NPP is important because it includes the food source of humans and all other organisms. NPP is impacted by global climate patterns. © Vipin Kumar Image from http://www.pmel.noaa.gov/co2/gif/globcar.png Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Patterns of Interest Zone Formation – Find regions of the land or ocean which have similar behavior. Associations – Find relations between climate events and land cover. Teleconnections – Teleconnections are the simultaneous variation in climate and related processes over widely separated points on the Earth. – El Nino associated with droughts in Australia and Southern Africa and heavy rainfall along the western coast of South America. Sea Surface Temperature Anomalies off Peru (ANOM 1+2) © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Clustering of Raw NPP and Raw SST (Num clusters = 2) Clusters for Raw SST and Raw NPP 90 60 Land Cluster 2 latitude 30 Land Cluster 1 0 Ice or No NPP -30 Sea Cluster 2 -60 Sea Cluster 1 -90 -180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180 longitude © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 Cluster ‹#› K-Means Clustering of Raw NPP and Raw SST (Num clusters = 2) Land Cluster Cohesion: North = 0.78 South = 0.59 Ocean Cluster Cohesion: North = 0.77 South = 0.80 © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Ocean Climate Indices: Connecting the Ocean and the Land An OCI is a time series of temperature or pressure – Based on Sea Surface Temperature (SST) or Sea Level Pressure (SLP) OCIs are important because – They distill climate variability at a regional or global scale into a single time series. – They are related to well-known climate phenomena such as El Niño. ©©Vipin V. Kumar Kumar Keynote Discovery Talk of Patterns at VECPAR-2002, in the Global Porto, Climate Portugal, System June using 27, 2002 Data Mining ‹#› 46 Ocean Climate Indices – ANOM 1+2 ANOM 1+2 is associated with El Niño and La Niña. El Nino Events Defined as the Sea Surface Temperature (SST) anomalies in a regions off the coast of Peru El Nino is associated with – Droughts in Australia and Southern Africa – Heavy rainfall along the western coast of South America – Milder winters in the Midwest © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Connection of ANOM 1+2 to Land Temp Correlation Between ANOM 1+2 and Land Temp (>0.2) 90 0.8 0.6 60 0.4 30 latitude 0.2 0 0 -0.2 -30 -0.4 -60 -0.6 -0.8 -90 -180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180 longitude OCIs capture teleconnections, i.e., the simultaneous variation in climate and related processes over widely separated points on the Earth. © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Ocean Climate Indices - NAO The North Atlantic Oscillation (NAO) is associated with climate variation in Europe and North America. North Atlantic Oscillation 3 Iceland 2 Azores 1 0 -1 -2 82 83 84 85 86 87 88 89 90 91 92 93 94 Normalized pressure differences between Ponta Delgada, Azores and Stykkisholmur, Iceland. Associated with warm and wet winters in Europe and in cold and dry winters in northern Canada and Greenland The eastern US experiences mild and wet winter conditions. © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Connection of NAO to Land Temp Correlation Between NAO and Land Temperature (>0.3) 90 1 0.8 60 0.6 0.4 latitude 30 0.2 0 0 -0.2 -30 -0.4 -0.6 -60 -0.8 -1 -90 -180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180 Correlation longitude © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Discovery of Ocean Climate Indices Use clustering to find areas of the oceans that have high density, I.e., relatively homogeneous behavior. – Cluster centroids are potential OCIs. – For SLP pairs of cluster centroids are potential OCIs. Evaluate the “influence” of potential OCIs on land points. Determine if the potential OCI matches a known OCI. For potential OCIs that are not well-known, conduct further evaluation. – Are there land points that have higher correlation for the potential OCI than for known indices? © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› SNN Clustering - Advantages Finding clusters of different shapes and sizes, especially in the presence of noise, is a difficult clustering problem. SNN clustering – Handles problems of varying density, shape and size. – Is resistant to noise. Earth Science data is noisy – SNN clustering finds the number of clusters automatically. Requires O(n2) time – Need to calculate the pairwise similarity matrix – This is a highly parallel operation. © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› SST Clusters 107 SNN Clusters for Detrended Monthly Z SST (1958-1998) 90 60 latitude 30 0 -30 -60 -90 -180 © Vipin Kumar -150 -120 -90 -60 -30 0 longitude 30 60 90 Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 120 150 180 ‹#› SST Clusters that Correspond to El Nino Climate Indices EL Nino Related SST Clusters 90 60 latitude 30 0 75 78 67 94 Niño Region Range Longitude Range Latitude 1+2 (94) 90°W-80°W 10°S-0° 3 (67) 150°W-90°W 5°S-5°N 3.4 (78) 170°W-120°W 5°S-5°N 4 (75) 160°E-150°W 5°S-5°N El Nino Regions Defined by Earth Scientists -30 -60 -90 -180 -150 -120 -90 -60 -30 0 30 longitude 60 90 120 150 180 SNN clusters of SST that are highly correlated with El Nino indices, ~ 0.93 correlation. © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› SST Clusters Highly Correlated to Known Indices … Examples of some SST clusters that are highly correlated to known OCIs and have high area weighted correlation with land temperature. These indices have a significant correlation with El Nino indices. Cluster 29 Cluster 11 90 90 70 70 50 50 30 30 10 10 -10 -10 -30 -30 -50 -50 -70 -70 -90 -180 -140 -100 -60 -20 20 60 100 140 180 -90 -180 -140 -100 -60 Cluster 31 90 70 70 50 50 30 30 10 10 -10 -10 -30 -30 -50 -50 -70 -70 © Vipin Kumar -140 -100 -60 -20 20 20 60 100 140 180 60 100 140 180 Cluster 37 90 -90 -180 -20 60 100 140 180 -90 -180 -140 -100 -60 -20 20 Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› SST Clusters Highly Correlated to Known Indices… However, there are areas (yellow) where these clusters correlate better. Cluster 11 - SOI ANOM12 ANOM3 ANOM4 ANOM34 (mincorr = 0.20) Cluster 29 - SOI ANOM12 ANOM3 ANOM4 ANOM34 (mincorr = 0.20) 90 90 0.6 0.5 70 70 0.4 50 0.3 0.2 30 0.4 50 30 0.2 0.1 10 10 0 -10 0 -10 -0.1 -30 -0.2 -0.3 -50 -0.2 -30 -50 -0.4 -0.4 -70 -70 -0.5 -0.6 -90 -180 -140 -100 -60 -20 20 60 100 140 -90 -180 180 Cluster 31 - SOI ANOM12 ANOM3 ANOM4 ANOM34 (mincorr = 0.20) -140 -100 -60 -20 20 60 100 140 180 Cluster 37 - SOI ANOM12 ANOM3 ANOM4 ANOM34 (mincorr = 0.20) 90 90 0.4 70 0.4 70 50 50 0.2 30 0.2 30 10 10 0 -10 0 -10 -30 -30 -0.2 -0.2 -50 -50 -70 -70 -0.4 -0.4 -90 -180 -140 © Vipin Kumar -100 -60 -20 20 60 100 140 180 -90 -180 -140 -100 -60 -20 20 Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 60 100 140 180 ‹#› SST Cluster Moderately Correlated to Known Indices Cluster 62 Cluster 62 - SOI ANOM12 ANOM3 ANOM4 ANOM34 (mincorr = 0.20) 90 90 70 70 50 50 30 30 10 10 -10 -10 -30 -30 -50 -50 -70 -70 -90 -180 -90 -180 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -140 -100 © Vipin Kumar -60 -20 20 60 100 140 180 -140 -100 -60 -20 Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 20 60 100 140 180 ‹#› Mining Associations in Earth Science Data (Tan, Steinbach, Kumar, Potter, Klooster, Torregrosa, 2001) (Lat,Long,time) Events (10N,10E,1) {Temp-Hi, Prec-Lo} (10N,10E,2) (10N,11E,2) (10N,11E,5) (10N,11E,10) {Temp-Hi,Prec-Lo,NPP-Lo} {Temp-Hi, NPP-Lo} {Solar-Hi, NPP-Lo} {Prec-Hi, PET-LO} First, transform Earth Science data into transactions. Find patterns using association discovery algorithms. 1 FPAR-HI PET-HI PREC-HI SOLAR-HI TEMP-HI ==> NPP-HI (support count=145, confidence=100%) 2 FPAR-HI PET-HI PREC-HI TEMP-HI ==> NPP-HI (support count=933, confidence=99.3%) 3 FPAR-HI PET-HI PREC-HI ==> NPP-HI (support count=1655, confidence=98.8%) 4 FPAR-HI PET-HI PREC-HI SOLAR-HI ==> NPP-HI (support count=268, confidence=98.2%) … 75 FPAR-HI ==> NPP-HI (support count = 216924, confidence = 55.7%) © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Example of Interesting Association Rules FPAR-Hi ==> NPP-Hi (sup=5.9%, conf=55.7%) © Vipin Kumar Shrubland areas Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Land Cover Types Barren Croplands 90 90 70 70 50 50 30 30 10 10 -10 -10 -30 -30 -50 -50 -70 -70 -90 -180 -140 -100 -60 -20 20 60 100 140 180 -90 -180 -140 -100 -60 Shrublands/ Grasslands 90 70 70 50 50 30 30 10 10 -10 -10 -30 -30 -50 -50 -70 -70 -20 20 60 20 60 100 140 180 60 100 140 180 Forests 90 -90 -180 -140 -100 -60 -20 100 140 180 -90 -180 -140 -100 -60 -20 20 Example of Interesting Association Rules… Support Count Land Cover • Temp-Hi NPP-Hi tends to occur in the forest and cropland regions in the northern hemisphere (Forests (33.5%), Grassland(8.7%), Cropland (24.5%), Desert (0.4%) ) © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Need for Parallel Computing Satellites are providing measurements of finer granularity. Earth Observing System - EOS AM 1 – Finer spatial grids 1 by 1 grid produces 64,800 data points 0.1 by 0.1 grid produces 6,480,000 data points – More frequent measurements Daily measurements multiply monthly data by a factor of 30 Looking at weather instead of climate requires finer resolution – Detection of movement of fronts © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Need for Parallel Computing SNN clustering analyses require O(n2) comparisons. – Evaluate correlation of every ocean point with every land point. Association rule algorithms can also be very compute intensive. – Potentially very much greater than O(n2) Amount of memory required exceeds for clustering and association rule algorithms can exceed 4GB of traditional sequential servers. © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Conclusion Data mining techniques are increasingly being used for discovering useful and previously unknown information from data HPC holds the promise of making data mining applicable for massive datasets HPC challenges – Parallelization of existing data mining algorithms – Development of novel parallel/distributed formulations – Efficient implementation of hash tables in parallel/distributed environment © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Bibliography Large-Scale Parallel Data Mining. Mohammed J. Zaki and Ching-Tien Ho (editors), Springer-Verlag, 2000. Advances in Distributed and Parallel Knowledge Discovery. Hillol Kargupta and Philip Chan (editors), AAAI Press/ MIT Press, 2000. Data Mining for Scientific and Engineering Applications. Robert L. Grossman, Chandrika Kamath, Philip Kegelmeyer, Vipin Kumar, and Raju Namburu, Kluwer Academic Publishers, October 2001. “Parallel Formulations of Decision-Tree Classification Algorithms,” Anurag Srivastava, Eui-Hong (Sam) Han, Vipin Kumar, and Vineet Singh, Data Mining and Knowledge Discovery: An International Journal, vol. 3, no. 3, pp 237-261, September 1999. “Data Mining for the Discovery of Ocean Climate Indices”, Michael Steinbach, PangNing Tan, Vipin Kumar, Chris Potter, Steven Klooster, Workshop on Mining Scientific Data, SDM 2002 © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#› Bibliography … “ScalParC: A New Scalable and Efficient Parallel Classification Algorithm for Mining Large Datasets.” Mahesh V. Joshi, George Karypis and Vipin Kumar, Proc. of 1998 International Parallel Processing Symposium, April 1998. “Scalable Parallel Data Mining for Association Rules,” Eui-Hong (Sam) Han, George Karypis and Vipin Kumar, IEEE Transactions on Knowledge and Data Engineering, Vol. 12, No. 3, May/June 2000. “Finding Spatio-Temporal Patterns in Earth Science Data,” Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Steven Klooster, Christopher Potter, Alicia Torregrosa, KDD 2001 Workshop on Temporal Data Mining. “Fast algorithms for mining association rules,” R. Agrawal and R. Srikant, In Proc. of the 20th International Conference Very Large Data Bases, pages 487--499. Morgan Kaufmann, 1994. © Vipin Kumar Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002 ‹#›