Download GigaTensor: Scaling Tensor Analysis Up By 100 Times – Algorithms

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
GigaTensor: Scaling Tensor Analysis Up By
100 Times – Algorithms and Discoveries
U Kang, Evangelos Papalexakis, Abhay Harpale,
Christos Faloutos
KDD '12 Proceedings of the 18th ACM SIGKDD international conference on Knowledge
discovery and data mining
A review presented by
Pavan Kumar Behara
Tensor Decomposition
• Tensor is an N-dimensional array.
• Most data is stored as tensors.
• Tensors allow representing higher order relationships.
Spearman’s two-factor theory of
intelligence (1927)†
• Multi-way data analysis requires tensor decomposition (TD):
• latent concept discovery
• trend analysis
• clustering
• anomaly detection
† Courtesy: Exploring temporal graph data with Python, Andre Panisson (url)
Example – Topic Modeling
• Goal: To characterize observed data in terms of a much smaller set of unobserved
topics.
• LDA/Hidden Markov/Gaussian mixture models can be used.
• Tensor structure in low-order observable moments (2nd or 3rd typically)†.
• Expectation maximization and Markov
chain monte carlo methods fail for large
datasets.
• TD helps in parameter estimation for these
models.
†A. Anandkumar, et al., Journal of Machine Learning Research 15 (2014).
Blei, David M. "Probabilistic topic models."
Communications of the ACM 55.4 (2012): 77-84.
Tools for Tensor Decomposition
• Tensor Toolbox for Matlab by Kolda, et al.
• N-way toolbox for Matlab by Rasmus bro, et al.
• Scikit-Tensor by Maximilian Nickel, et al.
• BigTensor for Hadoop by U Kang, et al (GigaTensor author).
• FlexiFact for Hadoop by Alex Beutel.
• FTensor, C++ library by W. Landry.
• ITensor, C++ library by Steven R. White, et al.
Handling billion-scale tensors
• Large scale TDs are limited by memory and compute times.
• GigaTensor (2012) is the first scalable distributed algorithm developed.
• GigaTensor is a distributed implementation of PARAFAC (for parallel factors)
decomposition on MapReduce.
• Current version of GigaTensor is further improved and implemented in BIGtensor
(2016), a Tensor mining package for Hadoop platform by the same authors.
Some preliminaries
• Kronecker Product
• Khatri-Rao product:
• Tensor unfolding/Matricization: reordering an N-way array into a matrix (here
only n-mode matricization considered)
• XIxJxK is divided into X(1) (I x JK) , X(2) (J x IK) , X(3) (K x IJ)
Matrix decom. to Tensor decom.
• Bilinear decomposition:
• Let X be an I x J matrix with rank R
• Writing X = a1b1T + a2b2T + … + aRbRT
= ABT, where columns of A and B are ar, br, 1 < r < R
• X = a1b1T + a2b2T + … + arbrT +…+ aRbRT , truncating to r ≪ R results in approximating
to a low rank matrix (Eckart & Young, 1936).
• PARAFAC is a higher order generalization of the above. It factorizes a tensor into a
sum of component rank-one tensors.
• Rank-one tensor: If an Nth order tensor can be expressed as outer product of N
vectors. Let YIxJxK = a ◦ b ◦ c (= a(i) b(j) c(k)), Y is a third order rank-one tensor.
• Let X be a three way tensor with dimensions I x J x K, PARAFAC formalism is
X = ∑r λr ar ◦ br ◦ cr
r = 1 to R
PARAFAC Decomposition
X = ∑r λr ar ◦ br ◦ cr
r = 1 to R
• The factor matrices A , B , C , refer to the combination of the vectors from the
rank-one components i.e., A = [a1, a2, …, aR] and so on.
• Using these X can be expressed in n-modes as
• The goal is to compute a PARAFAC decomposition with R components that best
approximates X, i.e., to find A, B, C that will minimize objective function
Alternating Least Squares approach
• Start with an initial guess for A, B, C. Fix two matrices and solve for the other.
• Having fixed all but one matrix, the problem reduces to a linear least-squares.
• For example, B and C are fixed, the problem reduces to

where,
• The optimal solution is then given by
• Khatri-Rao pseudoinverse has a special form which allows to rewrite this as
• So, the same procedure applies for B and C.
Algorithm for PARAFAC
1. Order of Matrix Multiplication
• Three matrix multiplication: Either (PQ)R or P(QR).
• Assuming we already have C ⊙ B, first way requires 2mR+2IR2 whereas second
way requires 2mR+2JKR2 flops (where, m = no. of non-zero elements).
• So, updating factor matrices is done as:
2. Intermediate Data Explosion
VERY LARGE MATRICES
• For NELL-1 knowledge base dataset with 26 million noun phrases the
intermediate matrix C ⊙ B explodes to 676 trillion rows.
• C is of size KxR, B is JxR and C ⊙ B would be JKxR (HUGE!!!!)
2. Solution for Data explosion
• X(1)(C ⊙ B) can be computed without explicit calculation of (C ⊙ B).
• Decoupling the above as follows results in the largest dense matrix being either B
or C and not (C ⊙ B) as in naïve case:
≈
• Where, 1p represent all-one vector of size p, ‘ * ’ is the element-wise product
(Hadamard product).
• Cost and intermediate size for computing X(1)(C ⊙ B) :
2. Algorithm for X(1)(C ⊙ B)
3. MapReduce for matrix operations
• MapReduce is a programming model in which users specify a map function that
processes a key/value pair to generate a set of intermediate key/value pairs.
• And, a reduce function that merges all intermediate values associated with the same
intermediate key.
• Computing decoupled
• Given: C  < j, r, C(j,r) >
• To calculate:
in a MapReduce way.
B  < j, r, B(j,r) >
X(1)  < i, j, X(1)(i, j) >
,
• Map: X(1) and C on j such that tuples with the same key are shuffled to the same reducer in the
form of < j, (C(j,r), {i, X(1)(i, j) ∀ i ∈ Qj}) > , Qi is the index of non-zero element in X(i, :)
• Reduce: take < j, (C(j,r), {i, X(1)(i, j) ∀ i ∈ Qj}) > and emit < i, j, X(1)(i, j)C(j,r) > for each i ∈ Qj
4. Parallel Outer Products
• We have talked about order of computation, and avoiding intermediate data explosion.
• Next step is efficient calculation of (CTC*BTB)†.
• CTC = ∑k C(k,:)T ◦ C(k,:) = sum of outer products of rows, and implemented in MapReduce.
• Comparing cost with the naïve implementation:
d is the no. of mappers, here they used d=50, R = 10, for NELL-1 dataset example
5. Final step – Distributed matrix multiplication
• First matrix is of I x R , second matrix is very small R x R.
• Broadcast second matrix to all mappers that process the first one to do
multiplication in a distributed way.
• In summary, GigaTensor is built upon
•
•
•
•
Careful choice of order of computations
Avoiding intermediate data explosion
Parallel outer products
Distributed cache multiplication
Scalability
• GigaTensor can handle tensor sizes of atleast 109 whereas Tensor Toolbox fails
beyond 107. Also, it can handle dense matrices with billions of non-zeros.
• Running-time scales up linearly with the number of machines.
• Few other distributed algorithms that came after this are Haten2 & SCouT from
the same authors (2015, 2016), FlexiFact (A. Beuter, et al, 2013), DFacTo (Choi, et
al, 2014).
BigTensor
• GigaTensor is further improved and distributed implementations for other tensor
decomposition algorithms like Tucker, etc., are packaged as BigTensor by U Kang’s
research group.
• Ease of use: Users do not need to know the map() and reduce() functions.
References
• Kolda, T. G. and B. W. Bader (2009). "Tensor Decompositions and Applications." SIAM Review 51(3): 455-500.
• BIGtensor: Mining Billion-Scale Tensor Made Easy, Namyong Park, Byungsoo Jeon, Jungwoo Lee, U Kang. 25th
ACM International Conference on Information and Knowledge Management (CIKM) 2016, Indianapolis, United
States.
• SCouT: Scalable Coupled Matrix-Tensor Factorization - Algorithms and Discoveries. ByungSoo Jeon, Inah Jeon, Sael
Lee, and U Kang. 32nd IEEE International Conference on Data Engineering (ICDE) 2016, Helsinki, Finland.
• HaTen2: Billion-scale Tensor Decompositions. Inah Jeon, Evangelos E. Papalexakis, U Kang, and Christos Faloutsos.
31st IEEE International Conference on Data Engineering (ICDE) 2015, Seoul, Korea.
• DFacTo: Distributed Factorization of Tensors, Joon Hee Choi, S. V. N. Vishwanathan, arXiv:1406.4519 [stat.ML].
• FlexiFaCT: Scalable Flexible Factorization of Coupled Tensors on Hadoop, Alex Beutel, Partha Pratim Talukdar,
Abhimanu Kumar, Christos Faloutsos, Evangelos E. Papalexakis, and Eric P. Xing, Proceedings of the 2014 SIAM
International Conference on Data Mining. 2014, 109-117.
Thank you