Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
System Support for High Performance Scientific Data Mining Gagan Agrawal Ruoming Jin Raghu Machiraju S. Parthasarathy Department of Computer and Information Sciences Ohio State University Scientific Data Mining Problem Datasets used for scientific data mining are large – particularly from simulations Our understanding of what algorithms and parameters will give desired insights is limited Time required for implementing different algorithms and running them with different parameters on large datasets slows down the scientific data mining process Project Overview FREERIDE (Framework for Rapid Implementation of datamining engines) as the base system Already demonstrated for a variety of standard mining algorithms Working for feature analysis and mining of simulation data currently FREERIDE offers: The ability to rapidly prototype a highperformance mining implementation Distributed memory parallelization Shared memory parallelization Ability to process large and disk-resident datasets Only modest modifications to a sequential implementation for the above three Key Observation from Mining Algorithms While( ) { Popular algorithms have a common canonical loop Can be used as the basis for supporting a common middleware forall( data instances d) { I = process(d) R(I) = R(I) op d } ……. } Performance of Shared Memory Parallelization K-means clustering 1600 1400 1200 full repl 1000 800 opt full locks 600 400 cache sens. Locks 200 0 1 thread 4 threads 16 threads Performance on Cluster of SMPs Apriori Association Mining 70000 60000 50000 40000 1 thread 2 threads 3 threads 30000 20000 10000 0 1 node 2 nodes 4 nodes 8 nodes SPIES On (a) FREERIDE Developed a new communication efficient decision tree construction algorithm – Statistical Pruning of Intervals for Enhanced Scalability (SPIES) Combines RainForest with statistical pruning of intervals of numerical attributes to reduce memory requirements and communication volume Does not require sorting of data, or partitioning and writing-back of records 7000 6000 5000 1 thread 2 threads 3 threads 4000 3000 2000 1000 0 1 node 8 nodes Broader Research Agenda Applying FREERIDE for Scientific Data Mining Focusing on feature extraction, tracking, and mining approach developed by Machiraju et al. A feature is a region of interest in a dataset A suite of algorithms for extracting and tracking them A Feature Analysis Algorithm Data Transform Tour Grid Aggregate Denoise Operator Classify Points Rank Track Catalog ROIs Classify-Aggregate Ongoing Work – Parallelization Using FREERIDE Most of the steps involve generalized reductions supported well in FREERIDE Extensions to FREERIDE required for aggregation and tracking steps Overall, FREERIDE can allow rapid implementation of scalable versions of a variety of steps and algorithms that are part of the feature mining paradigm