Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Distributed Linear Programming and Resource
Management for Data Mining in Distributed
Environments
Haimonti Dutta1 and Hillol Kargupta2
1Center for Computational Learning Systems (CCLS),
Columbia University, NY, USA.
2University of Maryland, Baltimore County, Baltimore, MD.
Also affiliated to Agnik, LLC, Columbia, MD.
Motivation
Support Vector (Kernel)
Regression
Support Vector Kernel
Regression
Find a function f(x)=y to fit a
set of example data points
Problem can be phrased as
constrained optimization task
Solved using a standard LP
solver
An illustration
Motivation contd .. Knowledge Based Kernel
Regression
In addition to sample points,
give advice
If (x ≥3) and (x ≤5) Then
(y≥5)
Rules add constraints about
regions
Constraints added to LP and a
new solution (with advice
constraints) can be
constructed
Fung, Mangasarian and Shavlik,”Knowledge
Based Support Vector Machine Classifiers”,
NIPS, 2002.
Mangasarian, Shavlik and Wild, “Knowledge
Based Kernel Approximation”, JMLR, 5,
1127 – 1141, 2005.
Figure adapted from McLain, Shavlik,
Walker and Torrey, “Knowledge-based
Support Vector Regression for
Reinforcement Learning”, IJCAI, 2005
Distributed Data Mining Applications – An
example of Scientific Data Mining in Astronomy
Distributed data and computing resources on
the National Virtual Observatory
Need for distributed optimization strategies
P2P Data Mining on
homogeneously partitioned
sky survey
H Dutta, Empowering Scientific Discovery by Distributed Data Mining on the
Grid Infrastructure, Ph.D Thesis, UMBC, Maryland, 2007.
Road Map
Motivation
Related Work
Framing an Linear Programming problem
The simplex algorithm
The distributed simplex algorithm
Experimental Results
Conclusion and Directions of Future Work
Related Work
Resource Discovery in Distributed Environments
Imantichi, “Resource Discovery in Large Resource Sharing
Experiments”, Ph.D. Thesis, University of Chicago, 2003.
Livny and Solomon, “Matchmaking: Distributed Resource
Management for high throughput computing”, HPDC, 1998.
Optimization Techniques
Yarmish, “Distributed Implementation of the Simplex Method”,
Ph.D. Thesis, CIS Polytechnic University, 2001.
Hall and McKinnon, “Update procedures for parallel revised
simplex methods, Tech Report, University of Edinburg, UK, 1992
Craig and Reed, “Hypercube Implementation of the Simplex
Algorithm”, ACM, pages 1473 – 1482, 1998.
The Optimization Problem
Assumptions:
n nodes in the network
The network is static
Dataset Di at node i
Processing Cost at i-th node – νi per record
Transportation Cost between i and j – μij
Amount of Data Transferred between nodes – xij
Cost Function Z = Σij μij xij + νi xij = Σij cij xij
7
Framing the Linear Programming Problem: An illustration
Objective Function
z = 6.03x12 +9.04x23 +6.52x15
+8.28x14 +14.42x25 + 9.58x34 +
12.32x45
600 GB
1
300 GB
Constraints
C(X) = ∑ijµijxij + νjxij = ∑ijcijxij , Cij =
µij + νij
300 GB
x12 + x14 + x15 ≤ 300;
x12 + x25 + x23 ≤ 600;
x15+x25+x45 ≤ 300 ;
x23+x34 ≤ 300;
0 ≤ x12 ≤ D1;
0 ≤ x23 ≤ D2;
0 ≤ x15 ≤ D1;
0 ≤ x14 ≤ D1;
0 ≤ x25 ≤ D2;
0 ≤ x34 ≤ D3;
0 ≤ x45 ≤ D4
3.8
2
6.1
6.5
3
2.5
10.4
5
8.3
4
300 GB
Node
V
1
1.23
2
2.23
3
2.94
4
1.78
5
4.02
7.8
300 GB
The Simplex Algorithm
Find
x1 ≥ 0, x2 ≥ 0, …. , xn ≥ 0 and
Min z = c1 x1 + c2 x2 + …. + cn xn
Satisfying Constraints
A1 x1 + A2 x2 + ….. + An xn = B
The Simplex Algorithm
a11
a12
….
a1n
b1
a21
a22
….
a2n
b2
….
….
….
….
….
am1
am2
…
amn
bm
c1
c2
…
cn
z
The simplex tableau
The Simplex Algorithm – Contd …
The Problem
Maximize z = x1 + 2x2 – x3
Subject to
2x1+ x2+ x3 ≤ 14
4x1+2x2+3x3 ≤ 28
2x1+5 x2+5x3 ≤ 30
The Steps of the Simplex Algorithm (Dantzig)
10
Obtain a canonical representation (Introduce Slack Variables)
Find a Column Pivot
Find a Row Pivot
Perform Gauss Jordan Elimination
The simplex tableau and iterations
Canonical Representation
x1 x2
x3 s1
s2
s3
B
2
1
1
1
0
0
14
4
2
3
0
1
0
28
2
5
5
0
0
1
30
-1
-2
1
0
0
0
0
Pivot Row
Pivot Column
14/1= 14
28/2=14
30/5= 6
Simplex iterations contd …
Perform Gauss Jordan
The Final Tableau
Elimination
8/5
0
0
1
0
-1/5
8
0
0
-1/2
1
-1/2
0
0
16/5
0
1
0
1
-2/5
16
1
0
5/16
0
5/16
-1/8
5
2/5
1
1
0
0
1/5
6
0
1
7/8
0
-1/8
4
4
-1/5
0
3
0
0
2/5
12
0
0
49/16
0
1/16
3/8
13
Road Map
Motivation
Related Work
Framing an Linear Programming problem
The simplex algorithm
The distributed simplex algorithm
Experimental Results
Conclusions and Future Work
The Distributed Problem – An Example
x12+x15+x14+2x25≤300
x12+x23+x25≤600
x12+2x15-x25=2
2x25-x12-x23=4
300 GB
Node1
Node 2
600 GB
x23+x34 ≤300
Node 3
Node 5
300 GB
x15+x25+x45≤300
x25-2x15-x45=5
Node 4
300 GB
300 GB
x34 +8 x25≤300
Each site observes different constraints, but wants to solve the same objective function
14
z = 6.03x12 + 9.04x23 + 6.52x15 + 8.28x14 + 14.42x25 + 9.58x34 + 12.32x45
Distributed Canonical Representation
An initialization step
No of basic variables to add = Total no of constraints in the
system
Build a spanning tree in the network
Perform a distributed sum estimation algorithm
Builds a canonical representation exactly identical to the one
if data was centralized
15
The Distributed Algorithm for solving the LP problem
Steps involved:
Estimate Column pivot
Estimate Row pivot (requires communication with
neighbors)
Perform Gauss Jordan elimination
16
Illustration of the Distributed Algorithm
x12
x23
x15
x14
x25
x34
x45
s1
s2
s3
s4
s5
s6
s7
s8
B
1
0
1
1
2
0
0
1
0
0
0
0
0
0
0
300
1
0
2
0
-1
0
0
0
1
0
0
0
0
0
0
2
-6.03
-9.04
-6.52
-8.28
-14.42
-9.58
-12.32
0
0
0
0
0
0
0
0
0
Node1
Node 2
Node 3
Node 5
Node 4
x12
x23
x15
x14
x25
x34
x45
s1
s2
s3
s4
s5
s6
s7
s8
B
0
0
0
0
8
1
1
0
0
0
0
0
0
1
0
300
-6.03
-9.04
-6.52
-8.28
-14.42
-9.58
-12.32
0
0
0
0
0
0
0
0
0
Column pivot selection is done at each node
Distributed Row Pivot selection
Protocol Push Min (gossip based)
Minimum estimation problem
Iteration t-1: {mr} values sent to node i
mti = min {{mr} , current row pivot}
Termination: All nodes have exactly the same minimum value
18
Analysis of Protocol Push Min
Based on spread of an epidemic in a large population
Suseptible, infected and dead nodes
The “epidemic” spreads exponentially fast
Node1
Node 2
Node 3
Node 5
19
Node 4
Comments and Discussions
Assume η no of nodes in the network
Communication Complexity is
O(no of iterations of simplex X η)
Worst case Simplex may require exponential no of iterations.
For most practical purposes it is λ m (λ<4)
20
Road Map
Motivation
Related Work
Framing an Linear Programming problem
The simplex algorithm
The distributed simplex algorithm
Experimental Results
Conclusion and Directions of Future Work
Experimental Results
Artificial Data Set
Simulated constraint matrices at each node
Used Distributed Data Mining Toolkit (DDMT) developed at
University of Maryland, Baltimore County (UMBC) for
simulating the network structure
Two different metrics for evaluation:
TCC (Total Communication Cost in the network)
Average Communication Cost per Node (ACCN)
Communication Cost
Average Communication Cost Per Node versus Number of
Nodes in the network
More Experimental Results ….
TCC versus No of Variables
at each node
TCC versus No of
constraints at each node
Conclusions and Future Work
Resource management and pattern recognition present
formidable challenges on distributed systems
Present a distributed algorithm for resource management
based on the simplex algorithm
Test our algorithm on simulated data
Future Work
Incorporation of dynamics of the network
Testing the algorithm on a real distributed network
Effect of size and structure of network on the mining results
Examine the trade-off between accuracy and communication
cost incurred before and after using distributed simplex on a
mining task like classification or clustering
Selected Bibliography
G.B.Dantzig, “Linear Programming and Extensions”. Princeton
University Press, Princeton, NJ, 1963
Kargupta and Chan,”Advances in Distributed and Parallel
Knowledge Discovery”, AAAI Press, Menlo Park, CA, 2000.
A. L. Turinsky. “Balancing Cost and Accuracy in Distributed Data
Mining”. PhD thesis, University of Illinois at Chicago., 2002.
Haimonti Dutta, “Empowering Scientific Discovery by Distributed
Data Mining on the Grid Infrastructure”, Ph.D. Thesis, UMBC,
2007.
Mangasarian, “Mathematical Programming in Data Mining”,
DMKD, Vol 42, pg 183 – 201, 1997.
Questions ?