Download edge06

Scalable Benchmarks and Kernels for Data Mining and Analytics Vipin Kumar University of Minnesota [email protected] www.cs.umn.edu/~kumar  Joint work with Alok Choudhary and Gokhan Memik (Northwestern) and Michael Steinbach (University of Minnesota)  Research funded by NSF Need for High Performance Data Mining  Today’s digital society has seen enormous data growth in both commercial and scientific databases Biomedical Data  Data Mining is becoming a commonly used tool to extract information from large and complex datasets  Advances in computing capabilities and technological innovation needed to harvest the available wealth of data Homeland Security Internet Geo-spatial data Sensor Networks Computational Simulations Data Mining for Climate Data NASA ESE questions:  How is the global Earth system changing?  What are the primary forcings?  How does Earth system respond to natural & human-induced changes?  What are the consequences of changes in the Earth system?  How well can we predict future changes?  Global snapshots of values for a number of variables on land surfaces or water NPP . Pressure NPP . Pressure . Precipitation Precipitation SST SST Latitude grid cell Longitude NASA DATA MINING REVEALS A NEW HISTORY OF NATURAL DISASTERS NASA is using satellite data to paint a detailed global picture of the interplay among natural disasters, human activities and the rise of carbon dioxide in the Earth's atmosphere during the past 20 years…. http://www.nasa.gov/centers/ames/news/releases/2003/03_51AR.html Detection of Ecosystem Disturbances: This interactive module displays the locations on the earth surface where significant disturbance events have been detected. Disturbance Viewer Time zone High Resolution EOS Data: • EOS satellites provide high resolution measurements • Finer spatial grids • 1 km  1 km grid produces 694,315,008 data points • Going from 0.5º  0.5º degree data to 1 km  1 km data results in a 2500fold increase in the data size • More frequent measurements • Multiple instruments • High resolution data allows us to answer more detailed questions: • Detecting patterns such as trajectories, fronts, and movements of regions with uniform properties • Finding relationships between leaf area index (LAI) and topography of a river drainage basin • Finding relationships between fire frequency and elevation as well as topographic position • Leads to substantially high computational and memory requirements Data Mining for Cyber Security 120000 • Due to proliferation of Internet, more and more organizations are becoming vulnerable to sophisticated cyber attacks Traditional Intrusion Detection Systems (IDS) have well-known limitations • – – – • Too many false alarms Unable to detect sophisticated and novel attacks Unable to detect insider abuse/ policy abuse 100000 80000 60000 40000 20000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Data Mining is well suited to address these challenges MINDS – Minnesota Intrusion Detection System Large Scale Data Analysis is needed for • Correlation of suspicious events across network sites – • Incorporated into Interrogator architecture at ARL Center for Intrusion Monitoring and Protection (CIMP) • • Helps analyze data from multiple sensors at DoD sites around the country Routinely detects Insider Abuse / Policy Violations / Worms / Scans • Helps detect sophisticated attacks not identifiable by single site analyses Analysis of long term data (months/years) – Uncover suspicious stealth activities (e.g. insiders leaking/modifying information) Data Mining for Biomedical Informatics  Recent technological advances are helping to generate large amounts of both medical and genomic data • High-throughput experiments/techniques - Gene and protein sequences - Gene-expression data - Biological networks and phylogenetic profiles • Electronic Medical Records - IBM-Mayo clinic partnership has created a DB of 5 million patients - NIH Roadmap  Data mining offers potential solution for analysis of large-scale data • • • Automated analysis of patients history for customized treatment Design of drugs/chemicals Prediction of the functions of anonymous genes Protein Interaction Network Role of Benchmarks in Architecture Design  Benchmarks guide the development of new processor architectures in addition to measuring the relative performance of different systems • SPEC: General purpose architecture (“Advances in the microprocessor industry would not have been possible without the SPEC benchmarks” - David Patterson) • TPC: Database Systems • SPLASH: Parallel machine architectures • Mediabench: Media and Communication Processors • NetBench: Network/Embedded processors Do We Need Benchmarks Specific to Data Mining?  Performance metrics of several benchmarks gathered from Vtune • Cache miss ratios, Bus usage, Page faults etc. 11 10 9 8 7 6 5 4 3 2 1 0 SPEC INT SPEC FP MediaBench TPC-H MineBench gcc bzip2 gzip mcf twolf vortex vpr parser apsi art equake lucas mesa mgrid swim wupwise rawcaudio epic encode cjpeg mpeg2 pegwit gs toast Q17 Q3 Q4 Q6 apriori bayesian birch eclat hop scalparc kMeans fuzzy rsearch semphy snp genenet svm-rfe Cluster Number  Benchmark applications were grouped using Kohenen clustering to spot trends: Reference: [Pisharath J., Zambreno J., Ozisikyilmaz B., Choudhary A., 2006] Recently funded NSF project: Scalable Benchmarks, Software and Data for Data Mining, Analytics and Scientific Discoveries PIs: A. Choudhary and Gokhan Memik (NW) , V. Kumar and M. Steinbach (UM) Motivate the development of new processor architectures and system design for data mining Scalability (data-level, processor)   Motivate the implementation of more sophisticated data mining algorithms that can work with the constraints imposed by current architecture designs Improvement the productivity of scientists and engineers using data mining application in a wide variety of domains Performance (execution time, cache behavior, …) Profiling Types of storage (memory, disks, …)  Types of data (streaming, file I/O) Goal: Establish a comprehensive benchmarking suite for data mining applications. Types of applications (scientific, bioinformatics, security, …) Data Mining Tasks … Data 10 Milk Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 11 No Married 60K No 12 Yes Divorced 220K No 13 No Single 85K Yes 14 No Married 75K No 15 No Single 90K Yes 60K Key Data Mining Algorithms  Clustering • • • K-means, EM, SOM Single link / Group Average hierarchical clustering DBSCAN, SNN  Classification • • • Bayes SVM Decision trees, Rule based systems  Association Rule Mining • Apriori, FP-Growth  Anomaly Detection • • • Statistical methods Distance-based Clustering-based  Preprocessing • SVD, PCA Major Data Mining Kernels  Counting • Given a set of data records, count types of different categories to build a contingency table • Count the occurrence of a set of items in a set of transactions  Pairwise computations • Given a set of data records, perform pairwise distane/similarity computations  Linear Algebra operations • SVD, PCA General Characteristics of Data Mining Algorithms  Dense/Sparse data  Hash table / Hash tree  Linked Lists  Iterative nature  Data often too large to fit in main memory • Spatial locality is critical Constructing a Decision Tree Employed 10 10 1 Yes Graduate # years at present address 5 2 Yes High School 2 No 3 No Undergrad 1 No 4 Yes High School 10 Yes 5 Yes Graduate 2 No 6 No High School 2 No 7 Yes Undergrad 3 No 8 Yes Graduate 8 Yes 9 Yes High School 4 Yes 10 No Graduate 1 No Tid Employed Level of Education Credit Worthy Yes Yes Worthy: 4 Not Worthy: 3 No Education Worthy: 0 Not Worthy: 3 Graduate Worthy: 2 Not Worthy: 2 Not Worthy Worthy Key Computation Employed = Yes 4 3 Employed = No 0 3 High School/ Undergrad Worthy: 2 Not Worthy: 4 Constructing a Decision Tree 10 Tid Employed 10 Tid Employed Level of Education # years at Credit present Worthy address 5 Yes 1 Yes Graduate 2 Yes High School 2 No 3 No Undergrad 1 No 4 Yes High School 10 Yes 5 Yes Graduate 2 No 6 No High School 2 No 7 Yes Undergrad 3 No Employed = Yes Level of Education # years at Credit present Worthy address 5 Yes 1 Yes Graduate 2 Yes High School 2 No 4 Yes High School 10 Yes 5 Yes Graduate 2 No 7 Yes Undergrad 3 No 8 Yes Graduate 8 Yes 9 Yes High School 4 Yes 10 8 Yes Graduate 8 Yes 9 Yes High School 4 Yes 10 No Graduate 1 No Tid Employed Employed = No # years at Credit present Worthy address Undergrad 1 No Level of Education 3 No 6 No High School 2 No 10 No Graduate 1 No Constructing a Decision Tree in Parallel m categorical attributes n records Yes No Yes No Yes No Worthy Not Worthy 4 0 3 3 Worthy Not Worthy 2 1 5 2 Worthy Not Worthy 6 1 1 2 Partitioning of data only – global reduction per node is required – large number of classification tree nodes gives high communication cost Constructing a Decision Tree in Parallel Partitioning of classification tree nodes 10,000 training records 7,000 records 2,000 3,000 records 5,000 2,000 1,000 – natural concurrency – load imbalance – the amount of work associated with each node varies – limited concurrency on the upper portion of the tree – child nodes use the same data as used by parent node – loss of locality – high data movement cost Speedup Comparison of the Three Parallel Algorithms  Data set used in SLIQ paper (Ref: Mehta, Agrawal and Rissanen, 1996)  IBM SP2 with 128 processors 0.8 million examples hybrid Data partitioning Tree partitioning 1.6 million examples hybrid Data partitioning Tree partitioning  Dynamic load balancing inspired by parallel sparse Cholesky factorization and parallel tree search Speedup of the Hybrid Algorithm with Different Size Data Sets Hash Table Access • Some efficient decision tree algorithms require random access to large data structures. • Example: SPRINT (Ref: Shafer, Agrawal, Mehta, 1996) Hash Table 10 ID 0 Income ID 2 25K Age ID 0 25 Left/ Right Left Processor 2 P0 28K 5 31 1 Left 8 30K 8 33 2 Right 4 30K 1 37 3 Right Processor 5 P1 35K 3 41 4 Right 1 50K 6 52 5 Left 3 52K 4 55 6 Right Level of Education Tid Employed 1 Yes Graduate 2 Yes High School 7 10 55K 70K 10 7 60 0 61 7 Left 8 Left 2 No 3 No Undergrad 1 No 4 Yes High School 10 Yes 5 Yes Graduate 2 No 6 No High School 2 No 7 Yes Undergrad 3 No 8 Yes Graduate 8 Yes 10 No Graduate 1 No Left Right 10 Processor 6 P2 # years at Credit present Worthy address 5 Yes 10 Tid Employed Level of Education # years at Credit present Worthy address 2 No 5 Yes Graduate 6 No High School 2 7 Yes Undergrad 8 Yes Graduate 10 No Graduate Tid Employed # years at Credit present Worthy address High School 2 No Level of Education 6 No No 7 Yes Undergrad 3 No 3 No 8 Yes Graduate 8 Yes 8 Yes 10 No Graduate 1 No 1 No 10 Storing the entire has table on one processor makes the algorithm unscalable ScalParC (Ref: Joshi, Karypis, Kumar, 1998)  ScalParC is a scalable parallel decision tree construction algorithm • Scales to large number of processors • Scales to large training sets  ScalParC is memory efficient • The hash-table is distributed among the processors  ScalParC performs minimum amount of communication This ScalParC Design is Inspired by..  Communication Structure of Parallel Sparse Matrix-Vector Algorithms Processor P0 Processor P1 Processor P2 Hash Table Entries Parallel Runtime (Ref: Joshi, Karypis, Kumar, 1998) Runtime (seconds) 120 100 0.2M 0.4M 0.8M 1.6M 3.2M 6.4M 80 60 40 20 0 0 50 100 Processors 128 Processor Cray T3D 150 Computing Association Patterns 1. Market-basket transactions TID Items 1 Bread, Diaper,Milk 2 3 4 5 Beer, Diaper, Bread, Eggs Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Bread, Diaper, Milk 3. Generate association rules 2. Find item combinations (itemsets) that occur frequently in data Item Combination Bread Coke Milk … Bread & Coke Bread & Milk … Bread & Milk & Diaper … Count 4 2 4 … 1 3 … 3 … {Diaper, Milk }  {Beer} {Bread}  {Diaper} Counting Candidates  Frequent Itemsets are found by counting candidates  Simple way: • Search for each candidate in each transaction Transactions Candidates Count N ABC D ACE BCD ABD E BCE BD M AB AC A AD D A AE E BC BC BD BD ABE ABE BCD BCD ABDE AABBCDDEE ABCDE 2 0 1 1 2 0 1 2 0 0 1 2 0 1 3 0 1 4 0 0 2 0 1 2 0 0 1 0 0 0 Naïve approach requires O(NM) comparisons Reduce the number of comparisons (NM) by using hash tables to store the candidate itemsets Parallel Association Rules: Scaleup Results (100K,0.25%) (Ref: Han, Karypis, and Kumar, 2000) DD (Agrawal & Shafer, 1996) Efficient implementation of collective communication IDD (Han, Karypis, Kumar, 2000) Dynamic restructuring of computation HD (Han, Karypis, Kumar, 2000) Candidates for MineBench Category Algorithms Description PCA Preprocessing Principal component analysis ABB LVF Normalization ScalParC Preprocessing Preprocessing Preprocessing Predictive Modeling Naïve Bayesian Predictive Modeling RIPPER SVMlight K-means Bisecting K-means Fuzzy K-means EM Clustering MAFIA(N) BIRCH AHC DBSCAN HOP LOF Outlier Detection Predictive Modeling Predictive Modeling Clustering Automatic Branch and Bound A probabilistic feature selection algorithm Variable transformation Decision tree classifier Statistical classifier based on class conditional independence Rule-based predictive modeling Support Vector Machines Partitioning method Clustering Partitioning method Clustering Clustering Clustering Clustering Clustering Clustering Clustering Anomaly Detection Anomaly Detection Apriori ARM MAFIA(C) ARM Eclat ARM FP-growth ARM Fuzzy logic based K-means Partitioning method Multidimensional Clustering Hierarchical method Agglomerative Hierarchical Clustering Density-based method Density-based method Local Outlier Factor Distance-based outlier detection Horizontal database, level-wise mining based on Apriori property Maximal frequent itemset mining Vertical database, break large search space into equivalence classes Encodes database into a compact FP-tree Lang. C/C++/ FORT. C/C++ C/C++ C/C++ C Parallel C++ N C/C++ C/C++ C Y N Y C Y C C/C++ C C++ C/C++ C/C++ C C/C++ C/C++ Y Y Y N N Y Y Y Y C/C++ Y C/C++ N C++ N C/C++ N Y N N Y Y Analysis of Benchmark Algorithms  Explore the bottlenecks associated with the current general purpose sequential and parallel machines  Explore how different architectural features impact the performance of data mining algorithms Preliminary Evaluation of Some Sample Data Sets  Example small (S), medium (M), and large (L) data set Dataset Small Medium Large Classification DB Size(MB) Parameter F26-A32-D125K 27 F26-A32-D250K 54 F26-A64-D250K 108 Association Rule Mining (ARM) Parameter DB Size(MB) T10-I4-D1000K 47 T20-I6-D2000K 175 T20-I6-D4000K 350  Execution time for some algorithms in the MineBench suite. Program Data set = S Data set = M Data set = L P1 P4 P8 P1 P4 P8 P1 P4 P8 HOP 6.3 1.8 1.2 52.7 27.4 18.7 435.3 128.0 81.5 K-means 5.7 2.0 1.3 12.9 3.3 2.6 - - - Fuzzy K-means BIRCH ScalParC Bayesian Apriori Eclat 164.1 3.5 51.0 12.6 6.1 11.8 54.6 13.5 3.0 - 26.4 10.4 2.6 - 146.8 31.7 110.6 25.1 102.7 81.5 42.7 28.5 38.6 - 27.1 21.6 30.5 - 172.6 225.9 51.5 200.2 127.8 56.2 72.6 - 36.5 63.0 - Reference: [Liu Y., Pisharath J., Liao W., Memik G., Choudhary A., Dubey P., 2004] Designing Efficient Kernels for Data Mining  Understanding of the bottlenecks in executing DM algorithms on current architectures will help design new, more efficient algorithms  Focus will be on design frequently used kernels that dominates the execution time of most DM algorithms Application kMeans  Both sequential and parallel versions will be developed Top 3 Kernels (%) Sum Kernel 1 (%) Kernel 2 (%) Kernel 3 (%) % distance(68%) center(21%) minDist(10%) 99 Fuzzy kMeans center(58%) distance(39%) fuzzySum(1%) 98 BIRCH HOP distance(54%) variance(22%) redist.(10%) 86 density(39%) search(30%) 92 gather(23%) Naïve Bayesian probCal(49) variance(38%) dataRead(10%) 97 ScalParC Apriori Eclat classify(37%) giniCalc(36%) compare(24%) 97 subset(58%) dataRead(14%) increment(8%) 80 intersect(39%) addClass(23%) invertC (10%) 72 Frequency of Kernel Operations in Representative Applications Reference: [Pisharath J., Zambreno J., Ozisikyilmaz B., Choudhary A., 2006] Conclusions  Data mining applications are becoming increasingly important  Current systems design approach not adequate for DM applications  MineBench – a new benchmark suite which encompasses many algorithms found in data mining  Initial findings: • Data mining applications are unique in terms of performance characteristics • There exists much room for optimization with regards to data mining workloads Bibliography       • • • Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Addison-Wesley April 2005 Introduction to Parallel Computing, (Second Edition) by Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. Addison-Wesley, 2003 Data Mining for Scientific and Engineering Applications, edited by R. Grossman, C. Kamath, W. P. Kegelmeyer, V. Kumar, and R. Namburu, Kluwer Academic Publishers, 2001 J. Han, R. B. Altman, V. Kumar, H. Mannila, and D. Pregibon, "Emerging Scientific Applications in Data Mining", Communications of the ACM Volume 45, Number 8, pp 54-58, August 2002 C. Potter, P. Tan, M. Steinbach, S. Klooster, V. Kumar, R. Myneni, V. Genovese, Major Disturbance Events in Terrestrial Ecosystems Detected using global Satellite Data Sets, Global Change Biology 9 (7), 1005-1021, 2003 Vipin Kumar, “Parallel and Distributed Computing for Cyber Security". An article based on the keynote talk by the author at 17th International Conference on Parallel and Distributed Computing Systems (PDCS-2004). DS Online Journal, OLUME 6, NUMBER 10, October 2005 Ying Liu, Jayaprakash Pisharath, Wei-keng Liao, Gokhan Memik, Alok Choudhary, and Pradeep Dubey. Performance Evaluation and Characterization of Scalable Data Mining Algorithms. In Proceedings of the 16th International Conference on Parallel and Distributed Computing and Systems (PDCS), November 2004. Joseph Zambreno, Berkin Ozisikyilmaz, Jayaprakash Pisharath, Gokhan Memik, and Alok Choudhary. Performance Characterization of Data Mining Applications using MineBench. In Proceedings of the 9th Workshop on Computer Architecture Evaluation using Commercial Workloads (CAECW-9), February 2006. Jayaprakash Pisharath, Joseph Zambreno, Berkin Ozisikyilmaz, and Alok Choudhary. Accelerating Data Mining Workloads: Current Approaches and Future Challenges in System Architecture Design. In Proceedings of the 9th International Workshop on High Performance and Distributed Mining (HPDM), April 2006

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download edge06