Download edge06

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Scalable Benchmarks and Kernels for Data
Mining and Analytics
Vipin Kumar
University of Minnesota
[email protected]
www.cs.umn.edu/~kumar
 Joint work with Alok Choudhary and Gokhan Memik (Northwestern)
and Michael Steinbach (University of Minnesota)
 Research funded by NSF
Need for High Performance Data Mining
 Today’s digital society has
seen enormous data growth
in both commercial and
scientific databases
Biomedical Data
 Data Mining is becoming a
commonly used tool to
extract information from large
and complex datasets
 Advances in computing
capabilities and technological
innovation needed to harvest
the available wealth of data
Homeland Security
Internet
Geo-spatial data
Sensor Networks
Computational Simulations
Data Mining for Climate Data
NASA ESE questions:

How is the global Earth system changing?

What are the primary forcings?

How does Earth system respond to natural & human-induced
changes?

What are the consequences of changes in the Earth system?

How well can we predict future changes?

Global snapshots of values for a number of variables on
land surfaces or water
NPP
.
Pressure
NPP
.
Pressure
.
Precipitation
Precipitation
SST
SST
Latitude
grid cell
Longitude
NASA DATA MINING REVEALS A NEW HISTORY OF NATURAL DISASTERS
NASA is using satellite data to paint a detailed global picture of the interplay among natural disasters,
human activities and the rise of carbon dioxide in the Earth's atmosphere during the past 20 years….
http://www.nasa.gov/centers/ames/news/releases/2003/03_51AR.html
Detection of Ecosystem Disturbances:
This interactive module
displays the locations on the
earth surface where
significant disturbance
events have been detected.
Disturbance Viewer
Time
zone
High Resolution EOS Data:
• EOS satellites provide high resolution measurements
• Finer spatial grids
• 1 km  1 km grid produces 694,315,008 data points
• Going from 0.5º  0.5º degree data to 1 km  1 km data results in a 2500fold increase in the data size
• More frequent measurements
• Multiple instruments
• High resolution data allows us to answer more detailed questions:
• Detecting patterns such as trajectories, fronts, and movements of regions with
uniform properties
• Finding relationships between leaf area index (LAI) and topography of a river
drainage basin
• Finding relationships between fire frequency and elevation as well as
topographic position
• Leads to substantially high computational and memory requirements
Data Mining for Cyber Security
120000
•
Due to proliferation of Internet, more and more organizations are becoming
vulnerable to sophisticated cyber attacks
Traditional Intrusion Detection Systems (IDS) have well-known limitations
•
–
–
–
•
Too many false alarms
Unable to detect sophisticated and novel attacks
Unable to detect insider abuse/ policy abuse
100000
80000
60000
40000
20000
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Data Mining is well suited to address these challenges
MINDS – Minnesota Intrusion Detection System
Large Scale Data Analysis is
needed for
•
Correlation of suspicious events
across network sites
–
•
Incorporated into Interrogator architecture at ARL Center for Intrusion
Monitoring and Protection (CIMP)
•
•
Helps analyze data from multiple sensors at DoD sites around the country
Routinely detects Insider Abuse / Policy Violations / Worms / Scans
•
Helps detect sophisticated
attacks not identifiable by single
site analyses
Analysis of long term data
(months/years)
–
Uncover suspicious stealth
activities (e.g. insiders
leaking/modifying information)
Data Mining for Biomedical Informatics
 Recent technological advances are
helping to generate large amounts of
both medical and genomic data
• High-throughput experiments/techniques
- Gene and protein sequences
- Gene-expression data
- Biological networks and phylogenetic profiles
• Electronic Medical Records
- IBM-Mayo clinic partnership has created a DB of 5
million patients
- NIH Roadmap
 Data mining offers potential solution for
analysis of large-scale data
•
•
•
Automated analysis of patients history for customized
treatment
Design of drugs/chemicals
Prediction of the functions of anonymous genes
Protein Interaction Network
Role of Benchmarks in Architecture Design
 Benchmarks guide the development of new processor
architectures in addition to measuring the relative
performance of different systems
• SPEC: General purpose architecture
(“Advances in the microprocessor industry would not have been
possible without the SPEC benchmarks” - David Patterson)
• TPC: Database Systems
• SPLASH: Parallel machine architectures
• Mediabench: Media and Communication Processors
• NetBench: Network/Embedded processors
Do We Need Benchmarks Specific to Data Mining?
 Performance metrics of several benchmarks gathered from Vtune
• Cache miss ratios, Bus usage, Page faults etc.
11
10
9
8
7
6
5
4
3
2
1
0
SPEC INT
SPEC FP
MediaBench
TPC-H
MineBench
gcc
bzip2
gzip
mcf
twolf
vortex
vpr
parser
apsi
art
equake
lucas
mesa
mgrid
swim
wupwise
rawcaudio
epic
encode
cjpeg
mpeg2
pegwit
gs
toast
Q17
Q3
Q4
Q6
apriori
bayesian
birch
eclat
hop
scalparc
kMeans
fuzzy
rsearch
semphy
snp
genenet
svm-rfe
Cluster Number
 Benchmark applications were grouped using Kohenen clustering to
spot trends:
Reference: [Pisharath J., Zambreno J., Ozisikyilmaz B., Choudhary A., 2006]
Recently funded NSF project: Scalable Benchmarks, Software and Data
for Data Mining, Analytics and Scientific Discoveries
PIs: A. Choudhary and Gokhan Memik (NW) , V. Kumar and M. Steinbach (UM)
Motivate the development of new
processor architectures and system
design for data mining
Scalability
(data-level, processor)


Motivate the implementation of more
sophisticated data mining algorithms that
can work with the constraints imposed by
current architecture designs
Improvement the productivity of scientists
and engineers using data mining
application in a wide variety of domains
Performance
(execution time,
cache behavior, …)
Profiling
Types of storage
(memory, disks, …)

Types of data
(streaming, file I/O)
Goal: Establish a comprehensive
benchmarking suite for data mining
applications.
Types of applications
(scientific,
bioinformatics,
security, …)
Data Mining Tasks …
Data
10
Milk
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
11
No
Married
60K
No
12
Yes
Divorced 220K
No
13
No
Single
85K
Yes
14
No
Married
75K
No
15
No
Single
90K
Yes
60K
Key Data Mining Algorithms
 Clustering
•
•
•
K-means, EM, SOM
Single link / Group Average hierarchical clustering
DBSCAN, SNN
 Classification
•
•
•
Bayes
SVM
Decision trees, Rule based systems
 Association Rule Mining
•
Apriori, FP-Growth
 Anomaly Detection
•
•
•
Statistical methods
Distance-based
Clustering-based
 Preprocessing
•
SVD, PCA
Major Data Mining Kernels
 Counting
• Given a set of data records, count types of different
categories to build a contingency table
• Count the occurrence of a set of items in a set of
transactions
 Pairwise computations
• Given a set of data records, perform pairwise
distane/similarity computations
 Linear Algebra operations
• SVD, PCA
General Characteristics of Data Mining Algorithms
 Dense/Sparse data
 Hash table / Hash tree
 Linked Lists
 Iterative nature
 Data often too large to fit in main memory
• Spatial locality is critical
Constructing a Decision Tree
Employed
10
10
1
Yes
Graduate
# years at
present
address
5
2
Yes
High School
2
No
3
No
Undergrad
1
No
4
Yes
High School
10
Yes
5
Yes
Graduate
2
No
6
No
High School
2
No
7
Yes
Undergrad
3
No
8
Yes
Graduate
8
Yes
9
Yes
High School
4
Yes
10
No
Graduate
1
No
Tid Employed
Level of
Education
Credit
Worthy
Yes
Yes
Worthy: 4
Not Worthy: 3
No
Education
Worthy: 0
Not Worthy: 3
Graduate
Worthy: 2
Not Worthy: 2
Not
Worthy Worthy
Key Computation
Employed = Yes
4
3
Employed = No
0
3
High School/
Undergrad
Worthy: 2
Not Worthy: 4
Constructing a Decision Tree
10
Tid Employed
10
Tid Employed
Level of
Education
# years at
Credit
present
Worthy
address
5
Yes
1
Yes
Graduate
2
Yes
High School
2
No
3
No
Undergrad
1
No
4
Yes
High School
10
Yes
5
Yes
Graduate
2
No
6
No
High School
2
No
7
Yes
Undergrad
3
No
Employed
= Yes
Level of
Education
# years at
Credit
present
Worthy
address
5
Yes
1
Yes
Graduate
2
Yes
High School
2
No
4
Yes
High School
10
Yes
5
Yes
Graduate
2
No
7
Yes
Undergrad
3
No
8
Yes
Graduate
8
Yes
9
Yes
High School
4
Yes
10
8
Yes
Graduate
8
Yes
9
Yes
High School
4
Yes
10
No
Graduate
1
No
Tid Employed
Employed
= No
# years at
Credit
present
Worthy
address
Undergrad
1
No
Level of
Education
3
No
6
No
High School
2
No
10
No
Graduate
1
No
Constructing a Decision Tree in Parallel
m categorical attributes
n records
Yes
No
Yes
No
Yes
No
Worthy
Not Worthy
4
0
3
3
Worthy
Not Worthy
2
1
5
2
Worthy
Not Worthy
6
1
1
2
Partitioning of data
only
– global reduction per
node is required
– large number of
classification tree
nodes gives high
communication cost
Constructing a Decision Tree in Parallel
Partitioning of
classification tree nodes
10,000 training records
7,000 records
2,000
3,000 records
5,000
2,000
1,000
– natural concurrency
– load imbalance
– the amount of work associated
with each node varies
– limited concurrency on the upper
portion of the tree
– child nodes use the same
data as used by parent node
– loss of locality
– high data movement cost
Speedup Comparison of the Three Parallel Algorithms
 Data set used in SLIQ paper (Ref: Mehta, Agrawal and Rissanen, 1996)
 IBM SP2 with 128 processors
0.8 million
examples
hybrid
Data
partitioning
Tree
partitioning
1.6 million
examples
hybrid
Data
partitioning
Tree
partitioning
 Dynamic load balancing inspired by parallel sparse Cholesky
factorization and parallel tree search
Speedup of the Hybrid Algorithm with Different Size
Data Sets
Hash Table Access
• Some efficient decision tree algorithms require
random access to large data structures.
• Example: SPRINT (Ref: Shafer, Agrawal, Mehta, 1996)
Hash Table
10
ID
0
Income
ID
2
25K
Age
ID
0
25
Left/
Right
Left
Processor
2
P0
28K
5
31
1
Left
8
30K
8
33
2
Right
4
30K
1
37
3
Right
Processor 5
P1
35K
3
41
4
Right
1
50K
6
52
5
Left
3
52K
4
55
6
Right
Level of
Education
Tid Employed
1
Yes
Graduate
2
Yes
High School
7
10
55K
70K
10
7
60
0
61
7
Left
8
Left
2
No
3
No
Undergrad
1
No
4
Yes
High School
10
Yes
5
Yes
Graduate
2
No
6
No
High School
2
No
7
Yes
Undergrad
3
No
8
Yes
Graduate
8
Yes
10
No
Graduate
1
No
Left
Right
10
Processor 6
P2
# years at
Credit
present
Worthy
address
5
Yes
10
Tid Employed
Level of
Education
# years at
Credit
present
Worthy
address
2
No
5
Yes
Graduate
6
No
High School
2
7
Yes
Undergrad
8
Yes
Graduate
10
No
Graduate
Tid Employed
# years at
Credit
present
Worthy
address
High School
2
No
Level of
Education
6
No
No
7
Yes
Undergrad
3
No
3
No
8
Yes
Graduate
8
Yes
8
Yes
10
No
Graduate
1
No
1
No
10
Storing the entire has table on one processor makes the algorithm unscalable
ScalParC (Ref: Joshi, Karypis, Kumar, 1998)
 ScalParC is a scalable parallel decision tree
construction algorithm
• Scales to large number of processors
• Scales to large training sets
 ScalParC is memory efficient
• The hash-table is distributed among the processors
 ScalParC performs minimum amount of
communication
This ScalParC Design is Inspired by..
 Communication Structure of Parallel Sparse Matrix-Vector
Algorithms
Processor
P0
Processor
P1
Processor
P2
Hash Table Entries
Parallel Runtime (Ref: Joshi, Karypis, Kumar, 1998)
Runtime (seconds)
120
100
0.2M
0.4M
0.8M
1.6M
3.2M
6.4M
80
60
40
20
0
0
50
100
Processors
128 Processor Cray T3D
150
Computing Association Patterns
1. Market-basket transactions
TID
Items
1
Bread, Diaper,Milk
2
3
4
5
Beer, Diaper, Bread, Eggs
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Bread, Diaper, Milk
3. Generate association
rules
2. Find item combinations (itemsets)
that occur frequently in data
Item Combination
Bread
Coke
Milk
…
Bread & Coke
Bread & Milk
…
Bread & Milk & Diaper
…
Count
4
2
4
…
1
3
…
3
…
{Diaper, Milk }  {Beer}
{Bread}  {Diaper}
Counting Candidates
 Frequent Itemsets are found by counting candidates
 Simple way:
• Search for each candidate in each transaction
Transactions
Candidates Count
N
ABC
D
ACE
BCD
ABD
E
BCE
BD
M
AB
AC
A
AD
D
A
AE
E
BC
BC
BD
BD
ABE
ABE
BCD
BCD
ABDE
AABBCDDEE
ABCDE
2
0
1
1
2
0
1
2
0
0
1
2
0
1
3
0
1
4
0
0
2
0
1
2
0
0
1
0
0
0
Naïve approach
requires O(NM)
comparisons
Reduce the number
of comparisons
(NM) by using hash
tables to store the
candidate itemsets
Parallel Association Rules: Scaleup Results
(100K,0.25%)
(Ref:
Han, Karypis, and Kumar, 2000)
DD (Agrawal & Shafer, 1996)
Efficient
implementation of
collective
communication
IDD (Han, Karypis, Kumar, 2000)
Dynamic
restructuring of
computation
HD (Han, Karypis, Kumar, 2000)
Candidates for MineBench
Category
Algorithms
Description
PCA
Preprocessing
Principal component analysis
ABB
LVF
Normalization
ScalParC
Preprocessing
Preprocessing
Preprocessing
Predictive Modeling
Naïve Bayesian
Predictive Modeling
RIPPER
SVMlight
K-means
Bisecting
K-means
Fuzzy K-means
EM Clustering
MAFIA(N)
BIRCH
AHC
DBSCAN
HOP
LOF
Outlier Detection
Predictive Modeling
Predictive Modeling
Clustering
Automatic Branch and Bound
A probabilistic feature selection algorithm
Variable transformation
Decision tree classifier
Statistical classifier based on class conditional
independence
Rule-based predictive modeling
Support Vector Machines
Partitioning method
Clustering
Partitioning method
Clustering
Clustering
Clustering
Clustering
Clustering
Clustering
Clustering
Anomaly Detection
Anomaly Detection
Apriori
ARM
MAFIA(C)
ARM
Eclat
ARM
FP-growth
ARM
Fuzzy logic based K-means
Partitioning method
Multidimensional Clustering
Hierarchical method
Agglomerative Hierarchical Clustering
Density-based method
Density-based method
Local Outlier Factor
Distance-based outlier detection
Horizontal database, level-wise mining based
on Apriori property
Maximal frequent itemset mining
Vertical database, break large search space into
equivalence classes
Encodes database into a compact FP-tree
Lang.
C/C++/
FORT.
C/C++
C/C++
C/C++
C
Parallel
C++
N
C/C++
C/C++
C
Y
N
Y
C
Y
C
C/C++
C
C++
C/C++
C/C++
C
C/C++
C/C++
Y
Y
Y
N
N
Y
Y
Y
Y
C/C++
Y
C/C++
N
C++
N
C/C++
N
Y
N
N
Y
Y
Analysis of Benchmark Algorithms
 Explore the bottlenecks associated with
the current general purpose sequential and
parallel machines
 Explore how different architectural features
impact the performance of data mining
algorithms
Preliminary Evaluation of Some Sample Data Sets
 Example small (S), medium (M), and large (L) data set
Dataset
Small
Medium
Large
Classification
DB Size(MB)
Parameter
F26-A32-D125K
27
F26-A32-D250K
54
F26-A64-D250K
108
Association Rule Mining (ARM)
Parameter
DB Size(MB)
T10-I4-D1000K
47
T20-I6-D2000K
175
T20-I6-D4000K
350
 Execution time for some algorithms in the MineBench suite.
Program
Data set = S
Data set = M
Data set = L
P1
P4
P8
P1
P4
P8
P1
P4
P8
HOP
6.3
1.8
1.2
52.7
27.4
18.7
435.3
128.0
81.5
K-means
5.7
2.0
1.3
12.9
3.3
2.6
-
-
-
Fuzzy K-means
BIRCH
ScalParC
Bayesian
Apriori
Eclat
164.1
3.5
51.0
12.6
6.1
11.8
54.6
13.5
3.0
-
26.4
10.4
2.6
-
146.8
31.7
110.6
25.1
102.7
81.5
42.7
28.5
38.6
-
27.1
21.6
30.5
-
172.6
225.9
51.5
200.2
127.8
56.2
72.6
-
36.5
63.0
-
Reference: [Liu Y., Pisharath J., Liao W., Memik G., Choudhary A., Dubey P., 2004]
Designing Efficient Kernels for Data Mining
 Understanding of the bottlenecks in executing DM algorithms
on current architectures will help design new, more efficient
algorithms
 Focus will be on design frequently used kernels that dominates
the execution time of most DM algorithms
Application
kMeans
 Both sequential
and parallel
versions will be
developed
Top 3 Kernels (%)
Sum
Kernel 1 (%) Kernel 2 (%) Kernel 3 (%) %
distance(68%) center(21%) minDist(10%) 99
Fuzzy kMeans center(58%) distance(39%) fuzzySum(1%) 98
BIRCH
HOP
distance(54%) variance(22%) redist.(10%)
86
density(39%) search(30%)
92
gather(23%)
Naïve Bayesian probCal(49) variance(38%) dataRead(10%) 97
ScalParC
Apriori
Eclat
classify(37%) giniCalc(36%) compare(24%)
97
subset(58%) dataRead(14%) increment(8%) 80
intersect(39%) addClass(23%) invertC (10%)
72
Frequency of Kernel Operations in Representative Applications
Reference: [Pisharath J., Zambreno J., Ozisikyilmaz B., Choudhary A., 2006]
Conclusions
 Data mining applications are becoming
increasingly important
 Current systems design approach not adequate
for DM applications
 MineBench – a new benchmark suite which
encompasses many algorithms found in data
mining
 Initial findings:
• Data mining applications are unique in terms of
performance characteristics
• There exists much room for optimization with regards
to data mining workloads
Bibliography






•
•
•
Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Addison-Wesley April
2005
Introduction to Parallel Computing, (Second Edition) by Ananth Grama, Anshul Gupta, George
Karypis, and Vipin Kumar. Addison-Wesley, 2003
Data Mining for Scientific and Engineering Applications, edited by R. Grossman, C. Kamath, W. P.
Kegelmeyer, V. Kumar, and R. Namburu, Kluwer Academic Publishers, 2001
J. Han, R. B. Altman, V. Kumar, H. Mannila, and D. Pregibon, "Emerging Scientific Applications in
Data Mining", Communications of the ACM
Volume 45, Number 8, pp 54-58, August 2002
C. Potter, P. Tan, M. Steinbach, S. Klooster, V. Kumar, R. Myneni, V. Genovese, Major Disturbance
Events in Terrestrial Ecosystems Detected using global Satellite Data Sets, Global Change Biology
9 (7), 1005-1021, 2003
Vipin Kumar, “Parallel and Distributed Computing for Cyber Security". An article based on the
keynote talk by the author at 17th International Conference on Parallel and Distributed Computing
Systems (PDCS-2004). DS Online Journal, OLUME 6, NUMBER 10, October 2005
Ying Liu, Jayaprakash Pisharath, Wei-keng Liao, Gokhan Memik, Alok Choudhary, and Pradeep
Dubey. Performance Evaluation and Characterization of Scalable Data Mining Algorithms. In
Proceedings of the 16th International Conference on Parallel and Distributed Computing and
Systems (PDCS), November 2004.
Joseph Zambreno, Berkin Ozisikyilmaz, Jayaprakash Pisharath, Gokhan Memik, and Alok
Choudhary. Performance Characterization of Data Mining Applications using MineBench. In
Proceedings of the 9th Workshop on Computer Architecture Evaluation using Commercial
Workloads (CAECW-9), February 2006.
Jayaprakash Pisharath, Joseph Zambreno, Berkin Ozisikyilmaz, and Alok Choudhary. Accelerating
Data Mining Workloads: Current Approaches and Future Challenges in System Architecture
Design. In Proceedings of the 9th International Workshop on High Performance and Distributed
Mining (HPDM), April 2006