Download Distributed Linear Programming and Resource Management for

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Distributed Linear Programming and Resource
Management for Data Mining in Distributed
Environments
Haimonti Dutta1 and Hillol Kargupta2
1Center for Computational Learning Systems (CCLS),
Columbia University, NY, USA.
2University of Maryland, Baltimore County, Baltimore, MD.
Also affiliated to Agnik, LLC, Columbia, MD.
Motivation
Support Vector (Kernel)
Regression
 Support Vector Kernel
Regression
 Find a function f(x)=y to fit a
set of example data points
 Problem can be phrased as
constrained optimization task
 Solved using a standard LP
solver
An illustration
Motivation contd .. Knowledge Based Kernel
Regression
 In addition to sample points,
give advice
 If (x ≥3) and (x ≤5) Then
(y≥5)
 Rules add constraints about
regions
 Constraints added to LP and a
new solution (with advice
constraints) can be
constructed
Fung, Mangasarian and Shavlik,”Knowledge
Based Support Vector Machine Classifiers”,
NIPS, 2002.
Mangasarian, Shavlik and Wild, “Knowledge
Based Kernel Approximation”, JMLR, 5,
1127 – 1141, 2005.
Figure adapted from McLain, Shavlik,
Walker and Torrey, “Knowledge-based
Support Vector Regression for
Reinforcement Learning”, IJCAI, 2005
Distributed Data Mining Applications – An
example of Scientific Data Mining in Astronomy
Distributed data and computing resources on
the National Virtual Observatory
Need for distributed optimization strategies
P2P Data Mining on
homogeneously partitioned
sky survey
H Dutta, Empowering Scientific Discovery by Distributed Data Mining on the
Grid Infrastructure, Ph.D Thesis, UMBC, Maryland, 2007.
Road Map
 Motivation
 Related Work
 Framing an Linear Programming problem
 The simplex algorithm
 The distributed simplex algorithm
 Experimental Results
 Conclusion and Directions of Future Work
Related Work
Resource Discovery in Distributed Environments
 Imantichi, “Resource Discovery in Large Resource Sharing
Experiments”, Ph.D. Thesis, University of Chicago, 2003.
 Livny and Solomon, “Matchmaking: Distributed Resource
Management for high throughput computing”, HPDC, 1998.
Optimization Techniques
 Yarmish, “Distributed Implementation of the Simplex Method”,
Ph.D. Thesis, CIS Polytechnic University, 2001.
 Hall and McKinnon, “Update procedures for parallel revised
simplex methods, Tech Report, University of Edinburg, UK, 1992
 Craig and Reed, “Hypercube Implementation of the Simplex
Algorithm”, ACM, pages 1473 – 1482, 1998.
The Optimization Problem
 Assumptions:
 n nodes in the network
 The network is static
 Dataset Di at node i
 Processing Cost at i-th node – νi per record
 Transportation Cost between i and j – μij
 Amount of Data Transferred between nodes – xij
 Cost Function Z = Σij μij xij + νi xij = Σij cij xij
7
Framing the Linear Programming Problem: An illustration
Objective Function
 z = 6.03x12 +9.04x23 +6.52x15
+8.28x14 +14.42x25 + 9.58x34 +
12.32x45
600 GB
1
300 GB
Constraints
 C(X) = ∑ijµijxij + νjxij = ∑ijcijxij , Cij =
µij + νij
300 GB
 x12 + x14 + x15 ≤ 300;
 x12 + x25 + x23 ≤ 600;
 x15+x25+x45 ≤ 300 ;
 x23+x34 ≤ 300;
 0 ≤ x12 ≤ D1;
0 ≤ x23 ≤ D2;
 0 ≤ x15 ≤ D1;
0 ≤ x14 ≤ D1;
 0 ≤ x25 ≤ D2;
0 ≤ x34 ≤ D3;
 0 ≤ x45 ≤ D4
3.8
2
6.1
6.5
3
2.5
10.4
5
8.3
4
300 GB
Node
V
1
1.23
2
2.23
3
2.94
4
1.78
5
4.02
7.8
300 GB
The Simplex Algorithm
 Find
 x1 ≥ 0, x2 ≥ 0, …. , xn ≥ 0 and
 Min z = c1 x1 + c2 x2 + …. + cn xn
 Satisfying Constraints
 A1 x1 + A2 x2 + ….. + An xn = B
 The Simplex Algorithm
a11
a12
….
a1n
b1
a21
a22
….
a2n
b2
….
….
….
….
….
am1
am2
…
amn
bm
c1
c2
…
cn
z
The simplex tableau
The Simplex Algorithm – Contd …
The Problem
 Maximize z = x1 + 2x2 – x3
 Subject to
 2x1+ x2+ x3 ≤ 14
 4x1+2x2+3x3 ≤ 28
 2x1+5 x2+5x3 ≤ 30
The Steps of the Simplex Algorithm (Dantzig)




10
Obtain a canonical representation (Introduce Slack Variables)
Find a Column Pivot
Find a Row Pivot
Perform Gauss Jordan Elimination
The simplex tableau and iterations
Canonical Representation
x1 x2
x3 s1
s2
s3
B
2
1
1
1
0
0
14
4
2
3
0
1
0
28
2
5
5
0
0
1
30
-1
-2
1
0
0
0
0
Pivot Row
Pivot Column
14/1= 14
28/2=14
30/5= 6
Simplex iterations contd …
 Perform Gauss Jordan
 The Final Tableau
Elimination
8/5
0
0
1
0
-1/5
8
0
0
-1/2
1
-1/2
0
0
16/5
0
1
0
1
-2/5
16
1
0
5/16
0
5/16
-1/8
5
2/5
1
1
0
0
1/5
6
0
1
7/8
0
-1/8
4
4
-1/5
0
3
0
0
2/5
12
0
0
49/16
0
1/16
3/8
13
Road Map
 Motivation
 Related Work
 Framing an Linear Programming problem
 The simplex algorithm
 The distributed simplex algorithm
 Experimental Results
 Conclusions and Future Work
The Distributed Problem – An Example
x12+x15+x14+2x25≤300
x12+x23+x25≤600
x12+2x15-x25=2
2x25-x12-x23=4
300 GB
Node1
Node 2
600 GB
x23+x34 ≤300
Node 3
Node 5
300 GB
x15+x25+x45≤300
x25-2x15-x45=5
Node 4
300 GB
300 GB
x34 +8 x25≤300
Each site observes different constraints, but wants to solve the same objective function
14
z = 6.03x12 + 9.04x23 + 6.52x15 + 8.28x14 + 14.42x25 + 9.58x34 + 12.32x45
Distributed Canonical Representation
 An initialization step
 No of basic variables to add = Total no of constraints in the
system
 Build a spanning tree in the network
 Perform a distributed sum estimation algorithm
 Builds a canonical representation exactly identical to the one
if data was centralized
15
The Distributed Algorithm for solving the LP problem
 Steps involved:
 Estimate Column pivot
 Estimate Row pivot (requires communication with
neighbors)
 Perform Gauss Jordan elimination
16
Illustration of the Distributed Algorithm
x12
x23
x15
x14
x25
x34
x45
s1
s2
s3
s4
s5
s6
s7
s8
B
1
0
1
1
2
0
0
1
0
0
0
0
0
0
0
300
1
0
2
0
-1
0
0
0
1
0
0
0
0
0
0
2
-6.03
-9.04
-6.52
-8.28
-14.42
-9.58
-12.32
0
0
0
0
0
0
0
0
0
Node1
Node 2
Node 3
Node 5
Node 4
x12
x23
x15
x14
x25
x34
x45
s1
s2
s3
s4
s5
s6
s7
s8
B
0
0
0
0
8
1
1
0
0
0
0
0
0
1
0
300
-6.03
-9.04
-6.52
-8.28
-14.42
-9.58
-12.32
0
0
0
0
0
0
0
0
0
Column pivot selection is done at each node
Distributed Row Pivot selection
 Protocol Push Min (gossip based)
 Minimum estimation problem
 Iteration t-1: {mr} values sent to node i
 mti = min {{mr} , current row pivot}
 Termination: All nodes have exactly the same minimum value
18
Analysis of Protocol Push Min
 Based on spread of an epidemic in a large population
 Suseptible, infected and dead nodes
 The “epidemic” spreads exponentially fast
Node1
Node 2
Node 3
Node 5
19
Node 4
Comments and Discussions
 Assume η no of nodes in the network
 Communication Complexity is
O(no of iterations of simplex X η)
 Worst case Simplex may require exponential no of iterations.
 For most practical purposes it is λ m (λ<4)
20
Road Map
 Motivation
 Related Work
 Framing an Linear Programming problem
 The simplex algorithm
 The distributed simplex algorithm
 Experimental Results
 Conclusion and Directions of Future Work
Experimental Results
 Artificial Data Set
 Simulated constraint matrices at each node
 Used Distributed Data Mining Toolkit (DDMT) developed at
University of Maryland, Baltimore County (UMBC) for
simulating the network structure
 Two different metrics for evaluation:
 TCC (Total Communication Cost in the network)
 Average Communication Cost per Node (ACCN)
Communication Cost
 Average Communication Cost Per Node versus Number of
Nodes in the network
More Experimental Results ….
TCC versus No of Variables
at each node
TCC versus No of
constraints at each node
Conclusions and Future Work
 Resource management and pattern recognition present
formidable challenges on distributed systems
 Present a distributed algorithm for resource management
based on the simplex algorithm
 Test our algorithm on simulated data
Future Work
 Incorporation of dynamics of the network
 Testing the algorithm on a real distributed network
 Effect of size and structure of network on the mining results
 Examine the trade-off between accuracy and communication
cost incurred before and after using distributed simplex on a
mining task like classification or clustering
Selected Bibliography
 G.B.Dantzig, “Linear Programming and Extensions”. Princeton




University Press, Princeton, NJ, 1963
Kargupta and Chan,”Advances in Distributed and Parallel
Knowledge Discovery”, AAAI Press, Menlo Park, CA, 2000.
A. L. Turinsky. “Balancing Cost and Accuracy in Distributed Data
Mining”. PhD thesis, University of Illinois at Chicago., 2002.
Haimonti Dutta, “Empowering Scientific Discovery by Distributed
Data Mining on the Grid Infrastructure”, Ph.D. Thesis, UMBC,
2007.
Mangasarian, “Mathematical Programming in Data Mining”,
DMKD, Vol 42, pg 183 – 201, 1997.
Questions ?
Related documents