Download slides - Parlearning 2015

Document related concepts

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
A Unified Programming Model
and Platform for Big Data Machine
Learning & Data Mining
Yihua Huang, Ph.D., Professor
Email:[email protected]
NJU-PASA Lab for Big Data Processing
Department of Computer Science and Technology
Nanjing University
May 29, 2015, India
PASA Big Data Lab at Nanjing University
Our lab studies on
Parallel
Algorithms
Systems, and
Applications
for Big Data Processing
 We are the earliest big data
lab in China, entering big data
research area since 2009
 Now we are contributor of
Apache Spark and Tachyon
Parallel Computing Models and Frameworks
& Hadoop/Spark Performance Optimization
Hadoop job and resource scheduling optimization
Spark RDD persisting optimization
Big Data Storage and Query
Tachyon Optimization
Performance Benchmarking Tools for Tachyon and DFS
HBase Secondary Indexing (HBase+In-memory) and query system
Large-Scale Semantic Data Storage and Query
Large-scale RDF semantic data storage and query system(HBase+In-memory)
RDFS/OWL semantic reasoning engines on Hadoop and Spark
Machine Learning Algorithms and Systems for Big Data Analytics
Parallel MLDM algorithm design with diversified parallel computing platforms
Unified programming model and platform for MLDM algorithm design
Contents
Part 1.
Parallel Algorithm Design for
Machine Learning and Data Mining
Part2.
Unified Programming Model and Platform
for Big Data Analytics
Part1.
Parallel Algorithm Design for
Machine Learning and Data Mining
 A variety of Big Data parallel computing platforms (Hadoop,
Spark, MPI, etc.) emerging…
 Serial machine learning algorithms not able to finish
computation upon large-scale dataset in acceptable time
 Do not fit any of existing parallel computing platforms and
thus need to rewrite them in parallel upon different parallel
computing platforms
 Our lab has entered into Big Data area since 2009,
starting from writing a variety of parallel Machine Learning
algorithms on Hadoop, Spark, etc.
• Frequent Itemset Mining is one of the most important
and often used algorithm for data mining
• Apriori algorithm is the most established algorithm for
finding frequent itemset from a transactional dataset
Tao Xiao, Shuai Wang, Chunfeng Yuan, Yihua Huang. PSON: A Parallelized SON
Algorithm with MapReduce for Mining Frequent Sets. The Fourth International
Symposium on Parallel Architectures, Algorithms and Programming, PAAP 2011, p
252-257, 2011
Hongjian Qiu, Rong Gu, Chunfeng Yuan and Yihua Huang. YAFIM: A Parallel
Frequent Itemset Mining Algorithm with Spark. The 3rd International Workshop
on Parallel and Distributed Computing for Large Scale Machine Learning and Big Data
Analytics, conjunction with IPDPS 2014, May 23, 2014. Phoenix, USA
Suppose I is an itemset consisting of items from the transaction
database D
Let N be the number of transactions D
Let M be the number of transactions that contain all the items of I
M /N is referred to as the support of I in D
Example
Here, N = 4, let I = {I1, I2}, than M = 2
because I = {I1, I2} is contained in transactions
T100 and T400
so the support of I is 0.5 (2/4 = 0.5)
If sup(I) is no less that an user-defined threshold, then I is referred to
as a frequent itemset
Goal of frequent sets mining
To find all frequent k-itemsets from a transaction database (k = 1, 2, 3, ....)
Apriori algorithm
• A classic frequent sets mining algorithm
• Needs multiple passes over the database
• In the first pass, all frequent 1-itemsets are discovered
• In each subsequent pass, frequent (k+1)-itemsets are discovered, with the frequent
k- itemsets found in the previous pass as the seed (referred to as candidate itemsets)
• Repeat until no more frequent itemsets can be found
• Apriori Algorithm [1]:
[1] Rakesh Agrawal, Ramakrishnan Srikant: Fast Algorithms for Mining Association Rules in Large Databases. VLDB 1994: 487-499
• The FIM process is both data-intensive and
computing-intensive.
– transactional dataset is become larger and larger
– Iteratively trying all combinations from 1-itemset to
k-itemset is time-consuming
– FIM needs to scan the datasets iteratively for many
times.
Apriori in MapReduce:
Experimental results
PSON achieves great speedup compared to SON algorithm
• Parallel Aprioir algorithm with MapReduce needs to run
the MapReduce job iteratively
• It need to scan the dataset iteratively and store all the
intermediate data in HDFS
• As a result, the parallel Apriori algorithm with
MapReduce is not efficient enough
• YAFIM, Apriori algorithm implemented in Spark
Model, can gain about 18x speedup in our
experiments
• Our YAFIM contains two phases to find all
frequent itemsets
– Phase Ⅰ: Load transaction datasets as a Spark RDD
object and generate 1-frequent itemsets;
– Phase Ⅱ: Iteratively generate (k+1)-frequent itemset
from k-frequent itemset.
Load all transaction
data into a RDD
All transaction data
reside in RDD
Phase Ⅰ
Phase ⅠI
• Methods to speedup performance
– In-memory computing with RDDs. We make full use
of RDDs and complete total computing in memory
– Share data with Broadcast. We adopt broadcast
variables abstraction in the Spark to reduce data
transformation in tasks
• We ran experiments with both programs on four
benchmarks [3] with different characteristics:
–
–
–
–
MushRoom
T10I4D100K
Chess
Pumsb_star
Achieving about 18x speedup with Spark
compared to the algorithm with MapReduce
We also apply our YAFIM in medical text semantic
analysis application and achieve 25x speedup.
Medicine text semantic analysis
40000
35000
Execution Time
30000
25000
20000
YAFIM
MApriori
15000
10000
5000
0
1
2
3
4
Passes of Iteration
5
6
Basic Algorithm
Input: A dataset of N data points that need to be clustered into K
clusters
Output:K clusters
Choose k cluster center Centers[K] as initial cluster centers
Loop:
for each data point P from dataset:
{
Calculate the distance between P and each of Centers[i] ;
Save p to the nearest cluster center
}
Recalculate the new Centers[K]
Go loop until cluster centers converge
Pseudo codes for MapReduce
class Mapper
setup(…)
{ read k cluster centers  Centers[K]; }
map(key, p) // p is a data point
{
minDis = Double.MAX VALUE;
index = -1;
for i=0 to Centers.length
{ dis= ComputeDist(p, Centers[i]);
if dis < minDis
{ minDis = dis;
index = i;
}
}
emit(Centers[i].ClusterID, (p,1));
}
Pseudo codes for MapReduce
To optimize the data I/O and network transfer, we can use Combiner to
reduce the number of key-value pairs from a Map node
class Combiner
reduce(ClusterID, [(p1,1), (p2,1), …])
{
pm = 0.0;
n = 数据点列表[(p1,1), (p2,1), …]中数据点的总个数;
for i=0 to n
pm += p[i];
pm = pm / n; // Calculate the average of points in the Cluster
emit(ClusterID, (pm, n)); // use it as new Center
}
Pseudo codes for MapReduce
class Reducer
reduce(ClusterID, valueList = [(pm1,n1),(pm2,n2) …])
{
pm = 0.0; n=0;
k = length of valuelist belonging to a ClusterID;
for i=0 to k
{ pm += pm[i]*n[i]; n+= n[i]; }
pm = pm / n; // calculate new center of the Cluster
emit(ClusterID, (pm,n)); // output new center of the Cluster
}
In main() function of the MapReduce Job, set a loop to run the
MapReduce job until converge
Scala codes
while(tempDist > convergeDist && tempIter < MaxIter)
{
var closest = data.map ( p => (closestPoint(p, kPoints), (p, 1))) // determine nearest center for each P
// calculate the average of all points in a cluster as new center
var pointStats = closest.reduceByKey{case ((x1, y1), (x2, y2)) => (x1 + x2, y1 + y2)}
var newPoints = pointStats.map {pair => (pair._1, pair._2._1 / pair._2._2)}.collectAsMap()
tempDist = 0.0
for (i <- 0 until K) // calculate tempDist to determine if converges
tempDist += kPoints(i).squaredDist(newPoints(i)) // calculate tempDist to determine if converges
for (newP <- newPoints)
kPoints(newP._1) = newP._2 // update new centers
tempIter=tempIter+1
}
Execution time(s)
Spark speedup about 4-5 times compared to MapReduce
Number of Nodes
1st iteration
next iteration
Peng Liu, Jiayu Teng, Yihua Huang.
Study of k-means algorithm parallelization performance based on spark.
CCF Big Data 2014
Basic Idea
Given m classes from training dataset: { C1,C2, …, Cm }
Predict which class a testing sample X will belong to.
cmap  arg max P  Ci |X 1  i  m
Ci C
P  Ci |X  
P  X |Ci  P  Ci 
P X 
=> Only need to calculate
Suppose xk is independent to each other => P  X |Ci  
Thus, we can count from training samples to get both
P  X |Ci  P Ci 

n
k 1
P( xk | Ci )
P  X |Ci  P Ci 
Training Map Pseudo Code to calculate P(X|Ci) and P(Ci)
class Mapper
map(key, tr) // tr is a training sample
{
tr  trid, X, Ci
emit(Ci, 1)
for j=0 to X.lenghth)
{ X[j]  xnj & xvj // xnj: name if xj, xvj: value of xj
emit(<Ci, xnj, xvj>, 1)
}
}
Training Reduce Pseudo Code to calculate P(xj|Ci) and P(Ci)
class Reducer
reduce(key, value_list) // key: either Ci or <Ci, xnj, xvj>
{
sum =0; // count for P(xj|Ci) and P(Ci)
while(value_list.hasNext())
sum += value_list.next().get();
emit(key, sum)
}
// Trim and save output as P(xj|Ci) and P(Ci) tables in HDFS
Predict Map Pseudo Code to Predict Test Sample
class Mapper
setup(…)
{ load P(xj|Ci) and P(Ci) data from training stage
FC = { (Ci, P(Ci)) }, FxC = { (<Ci, xnj, xvj>, P(xj|Ci)) }
}
map(key, ts) // ts is a test sample
{ ts  tsid, X
MaxF = MIN_VALUE; idx = -1;
for (i=0 to FC.length)
{ FXCi = 1.0;Ci = FC[i].Ci; FCi = FC[i].P(Ci)
for (j=0 to X.length)
{ xnj = X[j].xnj; xvj = X[j].xvj
Use <Ci, xnj, xvj> to scan FxC, get P(xj|Ci)
FXCi = FXCYi * P(xj|Ci);
}
if(FXCi* FCi >MaxF) { MaxF = FXCi*FCi; idx = i; }
}
emit(tsid, FC[idx].Ci)
}
Training SparkR Code to calculate P(xj|Ci) and P(Ci)
parseVector <- function(line) { # mapping line to list(Ci, list(1, features)) }
sc <- sparkR.init(master, “NaiveBayes”) # init Spark
file <- textFile(sc, dataFile)
# read training text data file => RDD
lines <- lapply(file, parseVector) # do map
# sum up to count the number of Ci and xj
aggre <- reduceByKey(lines, function(p1, p2) {
list(p1[[1]] + p2[[1]], p1[[2]] + p2[[2]]) }, 2L)
cltaggr <- collect(aggre)
C <- length(cltaggr)
# localize dataset
# Total number of Classes
Calculate total number of each Ci from cltaggr
lapply(cltaggr, function(p) { # calculate and save P(xj|Ci) and P(Ci) })
Predict SparkR Code
predict <- function(d) {
dataMatrix <- as.matrix(d)
result <- P(Ci) + P(xj|Ci) %*% dataMatrix
# return max one
which.max(result) – 1
# Ci starts from 0
predictRDD <- function(data) {map(data, predict) }
tFile <- textFile(sc, dataFile)
testData <- map(tFile, function(p){as.double(strsplit(p, " ")[[1]])})
Classlabel <- collect(predictRDD(testData))
Save predicted Class label to file
TrainingDataset
(thousand)
Hadoop
SparkR
Speedup
250
35 s
13 s
2.69
500
40 s
14 s
2.85
1000
49 s
16 s
3.06
2000
66 s
18 s
3.67
SVM and Logistic Regression with
MapReduce and SparkR
Iteration
Hadoop
SparkR
(no cache)
SparkR
(cache)
Speedup
10
374 s
103 s
43 s
8.7
20
720 s
183 s
68 s
10.6
30
1065s
274 s
94 s
11.3
Zhiqiang Liu, Rong Gu, Yihua Huang.
The Parallelization of Classification Algorithms Based on SparkR.
CCF Big Data 2014, Beijing
Large Scale Deep Learning on Intel Xeon Phi
Manycore Coprocessor with OpenMP
60 cores
30 cores
BaseLine
16024s
15960s
OpenMP
892s
2122s
OpenMP+MKL
97s
120s
Improved
OpenMP+MKL
53s
81s
Speedup (fullyoptimized compared
with baseline)
302
197
Lei Jin, Rong Gu, Chunfeng Yuan and Yihua Huang.
Large Scale Deep Learning On Xeon Phi Many-core
Coprocessor. The 3rd International Workshop on
Parallel and Distributed Computing for Large Scale
Machine Learning and Big Data Analytics, conjunction
with IPDPS 2014, May 23, 2014. Phoenix, USA
Large Scale Learning to Rank based on
Gradient Boosting Decision Tree(GBDT) with MPI
Research Grant from Baidu
Large Scale Learning to Rank based on
Gradient Boosting Decision Tree(GBDT) with MPI
Implemented parallel algorithm with MPI achieves 1.5 speedup
compared with existing GBDT algorithm from Baidu
Customized Light-weighted Parallel Computing Platform
for Large Scale Neural Network Training
Rong Gu, Furao Shen, and Yihua Huang. A Parallel
Computing Platform for Training Large Scale Neural
Networks. Proceedings of the IEEE International
Conference on Big Data (IEEE BigData 2013), pp. 376 384, Santa Clara, CA, USA, Oct. 6-9, 2013
 Existing parallel computing platforms provide useful means
for Big Data machine learning and data analytics. However,
they are not easy to learn and use for data analysts
 When choosing to use different parallel computing platform,
we need to rewrite all machine learning algorithms. This is
a lot of burden even for professional parallel programmers
 As a result, we need to find out an easy-to-use and unified
programming model and platform for Big Data machine
learning and data analytics
Part 2
Unified Programming Model and
Platform for Big Data Analytics
Motivation
Big Data Processing Platforms
From NA to
Available
From Slow
to Fast
From Not Easy to Use
To Easy to Use
Motivation
Modeling with Matrix
Data Analysts
Analytic Tools
a11 a12 ...
a21 a22 ...
...
[ a1, a2, a3, a4, ...]
A Big Gap!
Big Data Processing Platforms and Programming Models
MPI、Fortran/C++
ScaLAPACK;
GPU CUDA、BIDMach;
Scala、Spark RDD;
Hadoop MR;
Motivation
Problem for data analysts:
A big gap between data analysts and parallel computing platforms
……
MPI
Spark
MapReduce
Motivation
What we do for this?
We provide a unified programming model and platform to bridge
the gap between data analysts and parallel computing platforms
Unified & easy-to-use
Programming
……
MPI
Spark
MapReduce
Motivation
Problem for professional parallel programmers:
A number of parallel computing platforms multiplies
several dozens of machine learning algorithms generate
a lot of duplicated work and burden to rewrite all algorithms
across different platforms
……
MPI
Spark
MapReduce
Hundred of
ML
Algorithms
Lots of duplicated
work & burden to
rewrite all ML
algorithms
Motivation
What we do for this?
We provide a unified programming model and platform for
parallel programmers to write their MLDM algorithms once
but run anywhere!
……
MPI
Spark
MapReduce
Hundred of
ML
Algorithms
Lots of duplicated
work & burden to
rewrite all ML
algorithms
Octopus:
An Unified Programming Model and Platform
Basic Idea
• Most of machine learning & data mining(MLDM) algorithm can be
represented as the Matrix computation, thus we adopt matrix as
an unified abstraction to represent a variety of MLDM algorithms
• Provide a high-level MLDA programming model based on Matrix
• Provide an unified programming language and software
framework for MLDM programming
• Implement plug-ins for each of underlying parallel computing
platforms, mapping the high-level MLDM programs with matrix
computation to underlying platforms
• Implement and provide optimized computation for large-scale
matrix operations for each of underlying platforms to speedup the
computation and improve performance
Octopus:
An Unified Programming Model and Platform
• We initiate a research project, Octopus, to develop a crossplatform and unified MLDM programming model, framework
and platform
• A high-level and unified programming model and platform for
big data analytics and mining
• Allowing data analysts and big data application programmers to
easily design and implement machine learning and data mining
algorithms for big data analytics
• Transparently work on top of various distributed computing
frameworks
Octopus:
An Unified Programming Model and Platform
• Design and implement distributed
matrix computation packages with
Spark, MapReduce and MPI
• Adopt R as the unified programming
language for data analysts and
parallel programmers to use
• Design and implement whole
framework to transparently run matrixbased MLDM algorithms on top of
Spark, MapReduce and MPI, without
need to modify codes
• Design and provide parallel MLDM
algorithm library
…
…
Octopus:
An Unified Programming Model and Platform
Architectural Overview
Demo Applications
LR, SVM, Deep Learning, Other ML Algorithms
OctMatrix
(An R package and APIs for distributed matrix operations)
Matrix Execution
Connection Model for
Optimization Module
Underlying Matrix Lib
SparkMatrix
MR-Matrix
MPI-Matrix
R-Matrix
(Marlin)
Spark
MapReduce
MPI
Single
Node-R
Matrix Data Representation and Storage
Tachyon
HDFS
Developed
by us
Open Source
OctMatrix: Distributed Matrix Computation Lib
> OctMatrix is an R package to provide APIs for high-level
and platform-independent distributed matrix operations,
allowing Matrix Lib to be called from R language
> OctMatrix APIs ranging from
* Loading and managing large-scale matrix data
* Calling Matrix Lib for large scale matrix computation
with automated partitioning into sub-matrix and
scheduling for distributed execution
* Calling R-Matrix lib for small size matrix that can be
executed on a single machine.
OctMatrix: Distributed Matrix Computation Lib
Code Structure and API of OctMatrix
Methods:
Exposed Methods:
initialization();
//支持从(local,HDFS,Tachyon)文件、二维数组
初始化;支持特殊矩阵初始化(zeros,ones)
// initializations of matrix from local, HDFS,
Tachyon, two-dim Array, and also special
matrix
matrixOperations();
implement
// 支持各种矩阵函数,如分解、转置、求和等;
// provide matrix operations including
decompression, transformation, sum, etc.
enableNativeTachyon();
getSubMatrix();
getRow();
getElement();
…
MR_MatRef
MPI_MatRef
//支持矩阵运算的操作符,如各种类型的加、
减、乘、除;
// provide matrix operators including Add,
Sub, Mul, Div of matrices
Methods:
//支持从(local,HDFS,Tachyon)文
件、R矩阵、R向量初始化;支持
特殊矩阵初始化(zeros,ones)
// initializations of matrix from
local, HDFS, Tachyon, two-dim
Array, and also special matrix
matrixOperations();
matrixOperator();
apply();
saveToTachyon();
toArray ();
sample();
delete();
…
initialization();
Spark_MatRef
OctMatrix
R_MatRef
implement
NativeTachyon
_Ref
Support_
NativeTachyon
Mat_Type
Storage_
Location
//支持各种矩阵函数,如分解、转
置、求和等;
// provide matrix operations
including decompression,
transformation, sum, etc.
matrixOperator();
//支持矩阵运算的操作符,如各种
类型的加、减、乘、除;
// provide matrix operators
including Add, Sub, Mul, Div of
matrices
apply();
toLocalRMatrix();
sample();
dim();
getRow();
getElement();
getSubMatrix();
delete();
…
OctMatrix: Distributed Matrix Computation Lib
OctMatrix APIs provided to Users
•
Matrix Initialization/Exportation
–
–
–
–
–
•
Matrix Operators
–
–
–
–
•
initialize OctMatrix from Local File System/HDFS/Tachyon
save OctMatrix from/to Local File System/HDFS/Tachyon
convert OctMatrix from/to native R matrix;
construct special matrix,API: ones,zeros
…
elemwise/numeric matrix multiply,add,minus,division (API: *,+,-,/)
matrix multiply (API: %*%)
bind x and y via columns (API: cbind2)
…
Matrix Operations
–
–
–
–
–
–
–
get the rows and cols of matrix, API: dim ;
the inv of a OctMatrix, API: inv ;
statistical functions, API: max, min, mean, sum ;
matrix transposition, API: t ;
matrix decomposition, API: lu, svd, etc.
apply a function to matrix, API: apply(OctMatrix, MARGIN, FUN) ;
functions contained in R matrix, such as rep,split.
–
–
get sub-matrix;
…
OctMatrix: Distributed Matrix Computation Lib
Partitioning and Parallel Execution of Distributed Matrix
Automated Large Scale Matrix Partition and Optimized Execution
Schedule and Dispatch
Spark Cluster
Server Nodes
OctMatrix: Distributed Matrix Computation Lib
Optimized Distributed Matrix Multiplication
Three types of Matrix Representations
Local Matrix: a proper-sized matrix that can be stored and computed
at local machine
Broadcast Matrix: a small-sized matrix that can be broadcasted to each
machine node
Distributed Matrix: a large-sized matrix that needs to be partitioned and
stored in distributed machine nodes
Distributed Matrix is further divided
Into two types:
 Row Matrix
 Block Matrix
.
OctMatrix: Distributed Matrix Computation Lib
Optimized Distributed Matrix Multiplication
Execution Strategies
Define the following optimized execution strategies
for matrix multiplication:
OctMatrix: Distributed Matrix Computation Lib
Marlin: Optimized Distributed Matrix Multiplication with Spark
OctMatrix: Distributed Matrix Computation Lib
Optimized Distributed Matrix Multiplication
> For large-scale matrix multiplication, how to partition matrix is very
critical for the computation performance
> We developed an automatic matrix partitioning and optimized
execution algorithm in terms of the shapes and sizes of matrices and
then schedule them for execution in parallel
HAMA Blocking
CARMA Blocking
Broadcasting
OctMatrix: Distributed Matrix Computation Lib
Marlin: Optimized Distributed Matrix Multiplication with Spark
Multiply big and small
matrices
Multiply two big matrices
OctMatrix: Distributed Matrix Computation Lib
Marlin: Optimized Distributed Matrix Multiplication with Spark
OctMatrix: Distributed Matrix Computation Lib
Marlin: Optimized Distributed Matrix Multiplication with Spark
4~5x
Speedup
Compared
to SparkR
OctMatrix: Distributed Matrix Computation Lib
Marlin: Optimized Distributed Matrix Multiplication with Spark
Matrix Multiply , 96 partitions, executor memory 10GB, except that case 3_5 is 20GB
OctMatrix Data Representation and Storage
> Matrix data can be stored in local file, \Octopus_HOME
HDFS, and Tachyon, allowing to
\user-sesscion-id1\
\matrix-a
read from and write to these file
info
systems from R programs
row_index
> Matrix data is organized and stored
\row-data
in terms of certain structure
par1.data
…
parN.data
col_index
\col-data
par1.data
…
parN.data
\matrix-b
\matrix-c
\user-sesscion-id2\
…
\user-sesscion-id3\
…
Machine Learning Lib built with OctMatrix
• Classification and regression
– Linear Regression
– Logistic Regression
– Softmax
– Linear Support Vector Machine (SVM)
• Clustering
– K-Means
• Feature extraction
– Deep Neural Network(Auto Encoder)
• More MLDM algorithms to come
How Octopus Works
> Use standard R
programming
platform and allow
users to write and
implement codes for
a variety of MLDM
algorithms based on
large-scale matrix
computation model
> Have integrated
Octopus with Spark,
Hadoop MapReduce
and MPI,allowing
seamless switch and
execution on top of
underlying platforms
Octopus
Spark
Hadoop
MapReduce
MPI
Single
Machine
Octopus Features Summary
• Ease-to-use/High-level User APIs
– high-level matrix operators and operations APIs.
– similar to that of the Matrix/Vector operation APIs in the standard R
language.
– does not require the low-level knowledge for the distributed system
knowledge or programming skills.
• Write Once, Run Anywhere
– programs written with Octopus can transparently run on top of different
computing engines such as Spark, Hadoop MapReduce, or MPI.
– using OctMatrix APIs with small data running on a single-machine R engine
for test and run the program on large scale data without modifying the
codes.
– support a number of I/O sources including Tachyon, HDFS, and local file
systems.
Octopus Features Summary
• Distributed R apply Functions
– offers the apply() function on OctMatrix. The parameter function
will be executed on each element/row/column of the OctMatrix on
the cluster in parallel.
– parameter functions passed to apply() can be any R functions
including the UDFs.
• Machine Learning Algorithm Library
– Implemented a bunch of scalable machine learning algorithms and
demo applications built on top of OctMatrix.
• Seamless Integration with R Ecosystem
– offers its features in a R package called OctMatrix.
– naturally takes advantage of the rich resources of the R ecosystem
Demonstrations
Read/Write Octopus Matrix
Demonstrations
A Variety of R Functions on Octopus
Demonstrations
Logistic Regression
Training
Testing
Predicting
Change “enginetype” will be able to quickly switch to
and run on top of one of underlying platforms
without need to modify any other codes
Demonstrations
K-Means
Algorithm
Testing
Demonstrations
Linear Regression
Algorithm
Testing
Demonstrations
Code Style Comparison between R and Octopus
LR Codes with
Standard R
LR Codes with
Octopus
Demonstrations
Code Style Comparison between R and Octopus
K-Means Codes with Standard R
K-Means Codes with Octopus
Demonstrations
Algorithm with MPI and Hadoop MapReduce
Start a MPI Daemon to
run MPI-Matrix behind
Linear Algebra
running with MPI
Demonstrations
Algorithm with MPI and Hadoop MapReduce
Linear Algebra running
with Hadoop MapReduce
Octopus Project Website and Documents
http://pasa-bigdata.nju.edu.cn/octopus/
Project Team
Yihua Huang, Rong Gu, Zhaokang Wang,
Yun Tang, Haipeng Zhan
Contact Information
Dr.Yihua Huang, Professor
NJU-PASA Big Data Lab
http://pasa-bigdata.nju.edu.cn
Department of Computer Science and Technology
Nanjing University, Nanjing, P.R.China
Email:[email protected]