Download Computer Science and Engineering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Computer Science and Engineering
FREERIDE-G: Framework for Developing GridBased Data Mining Applications
L. Glimcher, R. Jin, G. Agrawal
Presented by: Leo Glimcher
{[email protected]}
ICPP’06
FREERIDE-G: Enabling Remote DataMining
Leonid Glimcher
P. 1
Computer Science and Engineering
Distributed Data-Intensive Science
Compute Cluster
User
?
Data Repository Cluster
ICPP’06
FREERIDE-G: Enabling Remote DataMining
Leonid Glimcher
P. 2
Computer Science and Engineering
Challenges for Application Development
• Analysis of large amounts of disk resident data
• Incorporating parallel processing into analysis
• Processing needs to be independent of other
elements and easy to specify
• Coordination of storage, network and computing
resources required
• Transparency of data retrieval, staging and
caching is desired
ICPP’06
FREERIDE-G: Enabling Remote DataMining
Leonid Glimcher
P. 3
Computer Science and Engineering
FREERIDE-G Goals
• Support High-End Processing
– Enable efficient processing of large scale data mining
computations
• Ease Use of Parallel Configurations
– Support shared and distributed memory parallelization
starting from a common high-level interface
• Hide Details of Data Movement and Caching
– Data staging and caching (when feasible/appropriate)
needs to be transparent to application developer
ICPP’06
FREERIDE-G: Enabling Remote DataMining
Leonid Glimcher
P. 4
Computer Science and Engineering
Presentation Road Map
•
•
•
•
•
•
Motivation and goals
System architecture and overview
Applications used for evaluation
Experimental evaluation
Related work in distributed data-intensive science
Conclusions and future work
ICPP’06
FREERIDE-G: Enabling Remote DataMining
Leonid Glimcher
P. 5
Computer Science and Engineering
FREERIDE-G Architecture
Data Repository
Data Retrieval
Data Distribution
Communication
User cluster
Data Processing
Caching Retrieval
Communication
Compute Nodes
Data server
Data Processing
Data server
Data Processing
Caching Retrieval
Data Retrieval
Communication
Data Distribution
Communication
Caching Retrieval
Communication
Compute Nodes
Data Processing
Caching Retrieval
Communication
ICPP’06
FREERIDE-G: Enabling Remote DataMining
Leonid Glimcher
P. 6
Computer Science and Engineering
Data Server Functionality
• Data retrieval:
– data chunks read from repository disks
• Data distribution:
– each chunk assigned a processing node
destination in user cluster
• Data communication:
– each chunk forwarded to destination
processing node
Data server runs on every on-line data repository
node, automating data delivery to the end-user
ICPP’06
FREERIDE-G: Enabling Remote DataMining
Leonid Glimcher
P. 7
Computer Science and Engineering
Compute Node Functionality
• Data communication:
– data chunks received from corresponding data node
• Computation:
– application specific processing performed on each
chunk
• Data caching & retrieval:
– for multi-pass algorithms data cached locally on 1st
pass and retrieved locally for sub-sequent passes
Compute server runs on every processing node to receive
data and process it in an application specific way
ICPP’06
FREERIDE-G: Enabling Remote DataMining
Leonid Glimcher
P. 8
Computer Science and Engineering
Processing structure of FREERIDE-G
Built on FREERIDE
KEY observation: most
algorithms follow canonical
loop
Middleware API:
• Subset of data to be
processed
• Reduction object
• Local and global reduction
operations
• Iterator
Supports:
• Disk resident datasets
• Shared & Distributed Memory
While( ) {
forall( data instances d) {
I = process(d)
R(I) = R(I) op d
}
…….
}
ICPP’06
FREERIDE-G: Enabling Remote DataMining
Leonid Glimcher
P. 9
Computer Science and Engineering
Summary of implementation issues
• Managing and communicating remote data:
– 2-way coordination required
• Load distribution:
– if compute cluster bigger than data cluster
• Parallel processing on compute cluster:
– FREERIDE-G supports generalized reductions
• Caching:
– benefits multi-pass algorithms
ICPP’06
FREERIDE-G: Enabling Remote DataMining
Leonid Glimcher
P. 10
Computer Science and Engineering
Remote Data Issues
• Managing data communication:
– ADR library used for scheduling and
performing data retrieval at repository site
– communication timing coordinated between
source and destination
• Caching:
– local file system used for caching
– avoids redundant communication of data for
(P-1)/P iterations
ICPP’06
FREERIDE-G: Enabling Remote DataMining
Leonid Glimcher
P. 11
Computer Science and Engineering
Parallel data processing issues
• Load distribution:
– Needed when more compute nodes are available
then data nodes
– Hashing on unique chunk ID
• Parallel processing on compute cluster:
– After data is distributed, local reduction
performed on every node
– Reduction object gathered at Master node
– Global combination (reduction) performed on
Master node
ICPP’06
FREERIDE-G: Enabling Remote DataMining
Leonid Glimcher
P. 12
Computer Science and Engineering
Application Summary
• Data Mining:
– K-Nearest Neighbor
search
– K-means clustering
– EM clustering
• Scientific Feature Mining:
– Vortex detection in the
fluid flow dataset
– Molecular defect
detection in the
molecular dynamics
dataset
ICPP’06
FREERIDE-G: Enabling Remote DataMining
Leonid Glimcher
P. 13
Computer Science and Engineering
Goals for Experimental Evaluation
• Evaluation parallel scalability of applications
developed:
– Numbers of data and compute nodes kept
equal with variable parallel configurations
• Evaluating scalability of compute nodes:
– Number of compute nodes kept independent of
number of data nodes
• Evaluating benefits of caching:
– Multi-pass algorithms evaluated
ICPP’06
FREERIDE-G: Enabling Remote DataMining
Leonid Glimcher
P. 14
Computer Science and Engineering
Evaluating Overall Scalability
Vortex detection application
2000
500
0
Execution time
1.85 GB
1000
12000
ICPP’06
710 MB
1500
14000
Equal number of
repository and
compute nodes
260 MB
2500
Execution time
• Cluster of 700 MHz
Pentiums
• Connected through
Myrinet LANai 7.0 (no
access to high
bandwidth network)
3000
3-D Column
8
4
Parallel configuration
3-D Column
5
3-D Column
6
Defect detection application
3-D Column
7
130
3-D MB
Column
8
1
2
450 MB
10000
8000
6000
4000
2000
0
4
1.8 GB
3-D
Column 4
3-D
Column 5
3-D
1
2
4Column
8 6
3-D
Parallel configuration
Column 7
3-D
Leonid Glimcher
Column
8
FREERIDE-G: Enabling Remote DataMining
P. 15
Computer Science and Engineering
Overall Scalability
K-means clustering application
30000
350 MB
700 MB
1.4 GB
Execution time
25000
20000
15000
10000
5000
0
8
4
2
1
Paralle l configuration
EM clustering application
35000
350 MB
700 MB
1.4 GB
30000
Execution time
25000
20000
15000
10000
• All 5 applications
tested:
– High parallel
efficiency
– Good scalability
with respect to:
• problem size
• processing node
number
5000
0
8
4
2
1
Par alle l configur ation
ICPP’06
FREERIDE-G: Enabling Remote DataMining
Leonid Glimcher
P. 16
Computer Science and Engineering
Evaluating Scalability of Compute Nodes
KNN search application
2500
Execution time
Compute cluster size is greater
than data repository cluster
size.
Applications (single pass only):
1. kNN search,
2. molecular defect detection,
3. vortex detection (next slide),
1 cn
2 cn
4 cn
8 cn
16 cn
2000
1500
1000
500
0
1
2
4
8
Data node #
Defect detection application
14 0 0 0
1 cn
2 cn
4 cn
8 cn
16 cn
Parallel configurations:
• Data nodes: 1 to 8
• Compute nodes: 1 to 16.
Execution time
12 0 0 0
10 0 0 0
8000
6000
4000
2000
0
1
ICPP’06
2
4
8
Data node #
FREERIDE-G: Enabling Remote DataMining
Leonid Glimcher
P. 17
Computer Science and Engineering
Compute Node Scalability
Vortex detection application
3000
1 cn
2 cn
4 cn
8 cn
16 cn
Execution time
2500
2000
1500
1000
• Only data processing work
parallelized
• Data retrieval and
communication times not
effected
• Speedups are sub-linear
500
0
1
2
4
Data node #
8
Better resource utilization
leads to analysis time
decrease
ICPP’06
FREERIDE-G: Enabling Remote DataMining
Leonid Glimcher
P. 18
Computer Science and Engineering
Evaluating effects of caching
K-means clustering application
30000
Execution time
25000
20000
15000
10000
5000
0
1
2
4
8
Parallel configuration
EM clustering application
35000
Execution time
• Network bandwidth
simulated: 500 KB/sec
• Caching vs. non-caching
versions compared
Comparing data
communication times (P
passes):
– factor of P decrease
from caching
Caching benefit depends on:
• application
• network bandwidth
1.4 GB nc
1.4 GB c
700 MB nc
700 MB c
350 MB nc
350 MB c
1.4 GB nc
1.4 GB c
700 MB nc
700 MB c
350 MB nc
350 MB c
30000
25000
20000
15000
10000
5000
0
1
2
4
8
Parallel configuration
ICPP’06
FREERIDE-G: Enabling Remote DataMining
Leonid Glimcher
P. 19
Computer Science and Engineering
Related Work
• Support for grid-based data mining:
– Knowledge Grid toolset
– Grid-Miner toolkit
– Discovery Net layer
– DataMiningGrid framework
No interface for easing parallelization and abstracting data
movement
• GRIST – support for astronomy related mining on the
grid
Specific to the astronomical domain
FREERIDE-G is built directly on top of FREERIDE.
ICPP’06
FREERIDE-G: Enabling Remote DataMining
Leonid Glimcher
P. 20
Computer Science and Engineering
Conclusions
• FREERIDE-G supports remote data analysis from
high-level interface
• Evaluated on variety of algorithms
• Demonstrated scalability in terms of:
– Even data-compute scale-up
– Compute node scale-up (only processing time)
• Multi-pass algorithms benefit from data caching
ICPP’06
FREERIDE-G: Enabling Remote DataMining
Leonid Glimcher
P. 21
Computer Science and Engineering
Continuing Work on FREERIDE-G
•
•
•
•
High bandwidth network evaluation
Performance prediction based resource selection
Resource allocation
More sophisticated caching and data
communication mechanisms (SRB)
• Data format issues: wrapper integration
• Higher-level front-end to further ease
development of data analysis tools for the grid
ICPP’06
FREERIDE-G: Enabling Remote DataMining
Leonid Glimcher
P. 22
Computer Science and Engineering
ICPP’06
FREERIDE-G: Enabling Remote DataMining
Leonid Glimcher
P. 23
Related documents