Download RIOT: I/O-Efficient Numerical Computing in R

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

C Sharp (programming language) wikipedia , lookup

Transcript
1
RIOT: I/O-Efficient
Numerical Computing in
Yi Zhang
Herodotos Herodotou
Jun Yang
What is R?
• R: an open-source language/environment
– Statistical computing, graphics
– Comprehensive R Archive
Network
• 1639 packages as of Dec 08
– Interpretive execution
– High-level constructs
• Arrays, matrices
• Code example:
a <- 1:100
…
d <- a+b^2+c
• Common to languages for numerical/statistical computing
3
Big-Data Challenge
• R assumes all data in main memory
– If not, VM starts swapping data from/to disk
x,y
– Excessive I/O, poor performance
– Example:
S(xs,ys)
# n
(1)
(2)
(3)
E(xe,ye)
points with coordinates stored in x[1:n], y[1:n]
d <- sqrt((x-xs)^2+(y-ys)^2)+sqrt((x-xe)^2+(y-ye)^2)
s <- sample(n, 100) # draw 100 samples from 1:n
z <- d[s] # extract elements of d whose indices are in s
memory
swap/ paging file
y
x
x-xs
y
x
(x-xs)^2
x-xs
x
y
y-ye
……
(x-xe)^2
1st sqrt
y
x
…
4
Opportunities
• Avoiding intermediate results
– Multiple large intermediate results are generated
– Can we avoid them without hand-coding loops?
• for (i in 1:n) { d[i] <- sqrt((x[i]-xs)^2+…)+… }
• Deferred and selective evaluation
– Each expression is evaluated in full immediately
– Can we defer evaluation until really necessary?
• Just compute the 100 elements from d picked by s
5
Existing Solutions
• Rewrite and hand-optimize code
– Tedious, not quite reusable
• Use I/O-efficient libraries
– SOLAR [Toledo’96], DRA [Nieplocha’96], etc.
– But efficient individual operations are not enough
• Build/extend a DB
– RasDaMan [Baumann’99], AML [Marathe’02], ASAP [Stonebraker’07], …
– Must rewrite using a new language (often SQL)
– Explicit boundary between DB and host language
6
R with I/O Transparency
• Attain I/O efficiency without explicit user
intervention
• Run legacy code with no or minimal
modification
• No need to learn new languages/libraries
• No boundary between host language and
backend processing
SQL
7
RIOT
• Implemented as an R package
– New types, same interfaces: dbvector, dbmatrix, …
– Uses R’s generics mechanism for transparency
1
3
New class definition:
setClass(“dbvector”,
representation(size=“numeric”,…))
Implementation:
SEXP add_dbvectors(SEXP e1, SEXP e2){
…
}
Method overloading:
setMethod(“+”,signature(e1=“dbvector”,e2=“dbvector”),
function(e1,e2) {
.Call(“add_dbvectors”,e1,e2)
}
)
2
8
RIOT-DB: Hidden DB Backend
• A strawman solution: Map large arrays to DB tables
– e.g. vector: V(i,v); matrix: M(i,j,v)
– Computation  query:
a+b  SELECT A.I,A.V+B.V FROM A,B WHERE A.I=B.I
– Leverages power of DB only at intra-operation level!
• Key: Translate operations to view definitions
d<-sqrt((x-xs)^2+(y-ys)^2)+…
…
z <- d[s]
–
–
–
–
SELECT
CREATE S.I,
VIEWSQRT(POW(X.V-xs,2)+POW(Y.V-ys,2))
T1(I,V) AS SELECT X.I,X.V-xs
+
SQRT(POW(X.V-xe,2)+POW(Y.V-ye,2))
FROM X;
FROM
X,Y,S
WHERE
X.I=Y.I
X.I=S.V
CREATE
VIEW
T2(I,V)
AS AND
SELECT
T1.I,
POW(T1.V,2) FROM T1;
…
CREATE VIEW D(I,V) AS SELECT T6.I,
T6.V+T12.V FROM T6,T12 WHERE T6.I=T12.I;
CREATE VIEW Z(I,V) AS SELECT S.I, D.V
FROM D,S WHERE D.I=S.V;
Build up larger and larger views a step at a time
Evaluate only when needed  deferred evaluation
Query optimization  selective evaluation + more
Iterator-style execution  no intermediate results
9
RIOT-DB Demo
• RIOT-DB built using
with MyISAM engine
10
Performance of RIOT-DB
• Plain R
• RIOT-DB variants
– RIOT-DB/Strawman: use DB to store arrays and execute individual ops;
no use of views to defer evaluation
– RIOT-DB/MatNamed: use views, but compute/materialize every named object
– RIOT-DB: full version; defer/optimize across statements
11
Lessons Learned
• DB-style inter-operation optimization is really the key!
• Can we do better?
– DB arrays carries too much overhead (ASAP [Stonebraker’07])
• Extra columns in V(i, v), M(i, j, v), …; more for higher dims
– SQL & relational algebra may not be the right abstraction
• Advanced data layouts and complex ops are awkward
RIOT: The Next Generation
–
–
–
–
A new expression algebra closer to numerical computation
Flexible array storage/layout options
Optimizations better tailored for numerical computation
… and more
12
RIOT Expression Algebra
• Analogous to the view mechanism, but more flexible
• Operators
– +, –, *, /, [, …
– A[idxRange]<-newVals: turn updates into functional ops
• Instead of in-place updates, log them & define Anew over (Aold,log)
– X%*%Y (matrix multiply) etc.: built-in, for high-level opt.
• E.g. matrix chain multiplication: (XY)Z or X(YZ)?
13
Processing/Layout Optimization
• Matrix multiplication T=A(n1xn2) B(n2xn3), with fixed memory size M
T
=
A
x
B
R: Plain algorithm
For each row i of A:
For each column j of B:
T[i,j] <- A[i,] * B[,j]
RIOT-DB
Hashjoin-sort-aggregate
T
T
=
=
A
A
x
x
B
B
Optimal I/O cost: n1n2n3/(BM1/2)
BNLJ-inspired algorithm
Read as many rows of A as possible:
Use one block to scan B in column-major order:
Update elements in T
Blocked algorithm
Divide memory into 3 equal parts
Divide each matrix into square blocks
For each chunk (i,j) in T:
For k=1…p:
Read chunk (i,k) from A and chunk (k,j) from B
chunk T(i,j) += A(i,k) %*% B(k,j)
Write chunk T(i,j)
14
Conclusion
• I/O efficiency can be added transparently
– Ditch SQL at user level for broader impact!
• DB-style inter-operation optimization is critical
– Need to go beyond developing I/O-efficient
algorithms and libraries
• Integration of DB and programming languages
– Lots of interesting analogies and new
opportunities
15
Q&A
RIOT photos by Zack Gold (www.zackgold.com)
16