Download Factorization Models for Recommender Systems and Other

Tensor Factorization Time-aware Factorization Models Factorization Machines Factorization Models for Recommender Systems and Other Applications Part II Lars Schmidt-Thieme, Steffen Rendle Tutorial at KDD Conference, 12th August 2012, Beijing Steffen Rendle 1 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Outline Tensor Factorization Problem Setting Models Learning Examples for Applications Summary Time-aware Factorization Models Factorization Machines Steffen Rendle 2 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Problem Setting I Predictor variables: m variables of categorical domain I1 , . . . , Im . I Target y : Real-valued (regression), binary (classification), scores (ranking). I Supervised task: set of observations S = {(i1 , . . . , im , y ), . . .} Steffen Rendle 3 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Example: Social Tagging1 1 http://last.fm Steffen Rendle 4 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Example: Social Tagging1 User u1 t1 u2 t2 Tags t3 i1 i2 Items 1 http://last.fm Steffen Rendle 4 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Example: Social Tagging1 User u1 t1 u2 t2 t3 i1 i2 Tags t1 i1 i2 1 1 t2 t3 1 1 Items 1 http://last.fm Steffen Rendle 4 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Example: Social Tagging1 User u1 t1 u2 t2 t3 i1 i2 Tags t1 i1 i2 1 1 t2 t3 1 1 Items Tagging can be expressed as a function over three categorical domains: y : U × I × T → {0, 1} 1 http://last.fm Steffen Rendle 4 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Example: Querying Incomplete RDF-Graphs I Steffen Rendle Task: Answer queries about subject-predicate-pairs. E.g. What is McCartney member of? 5 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Example: Querying Incomplete RDF-Graphs An RDF-Graph can be expressed as a function over three categorical domains: y : S × P × O → {0, 1} Steffen Rendle 5 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Notation: Tensors and Functions Models in this setting are functions: ŷ : I1 × . . . × Im → Y Steffen Rendle 6 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Notation: Tensors and Functions Models in this setting are functions: ŷ : I1 × . . . × Im → Y All possible targets and predictions can be written equivalently as a m-order tensor / multiway array: Y ∈ Y |I1 |×...×|Im | , Ŷ ∈ Y |I1 |×...×|Im | where y (i1 , . . . , im ) = yi1 ,...,im , Steffen Rendle 6 / 75 ŷ (i1 , . . . , im ) = ŷi1 ,...,im Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Notation: Tensor-Matrix product I Let T ∈ Rk1 ×...×km be a m-order tensor and V ∈ Rn×kl be a matrix. I The mode-l tensor-matrix product ×l is defined as: (T ×l M)i1 ,...,il−1 ,j,il+1 ,...,im := kl X ti1 ,...,im mj,il il =1 kl kl Steffen Rendle T × n 7 / 75 M = n T* Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Notation: Tensor-Matrix product I Let T ∈ Rk1 ×...×km be a m-order tensor and V ∈ Rn×kl be a matrix. I The mode-l tensor-matrix product ×l is defined as: (T ×l M)i1 ,...,il−1 ,j,il+1 ,...,im := kl X ti1 ,...,im mj,il il =1 I The result is a tensor T ∗ of dimension Rk1 ×kl−1 ×n×kl+1 ×...×km kl kl Steffen Rendle T × n 7 / 75 M = n T* Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Notation: Tensor-Matrix product I Let T ∈ Rk1 ×...×km be a m-order tensor and V ∈ Rn×kl be a matrix. I The mode-l tensor-matrix product ×l is defined as: (T ×l M)i1 ,...,il−1 ,j,il+1 ,...,im := kl X ti1 ,...,im mj,il il =1 I The result is a tensor T ∗ of dimension Rk1 ×kl−1 ×n×kl+1 ×...×km I The size of the l-th mode changes from kl to n. kl kl Steffen Rendle T × n 7 / 75 M = n T* Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Notation: Tensor-Matrix product I Let T ∈ Rk1 ×...×km be a m-order tensor and V ∈ Rn×kl be a matrix. I The mode-l tensor-matrix product ×l is defined as: (T ×l M)i1 ,...,il−1 ,j,il+1 ,...,im := kl X ti1 ,...,im mj,il il =1 I The result is a tensor T ∗ of dimension Rk1 ×kl−1 ×n×kl+1 ×...×km I The size of the l-th mode changes from kl to n. × = t*1,1,1 Steffen Rendle 7 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Notation: Tensor-Matrix product I Let T ∈ Rk1 ×...×km be a m-order tensor and V ∈ Rn×kl be a matrix. I The mode-l tensor-matrix product ×l is defined as: (T ×l M)i1 ,...,il−1 ,j,il+1 ,...,im := kl X ti1 ,...,im mj,il il =1 I The result is a tensor T ∗ of dimension Rk1 ×kl−1 ×n×kl+1 ×...×km I The size of the l-th mode changes from kl to n. × = t*1,2,1 Steffen Rendle 7 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Notation: Tensor-Matrix product I Let T ∈ Rk1 ×...×km be a m-order tensor and V ∈ Rn×kl be a matrix. I The mode-l tensor-matrix product ×l is defined as: (T ×l M)i1 ,...,il−1 ,j,il+1 ,...,im := kl X ti1 ,...,im mj,il il =1 I The result is a tensor T ∗ of dimension Rk1 ×kl−1 ×n×kl+1 ×...×km I The size of the l-th mode changes from kl to n. × = t*2,2,1 Steffen Rendle 7 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Outline Tensor Factorization Problem Setting Models Learning Examples for Applications Summary Time-aware Factorization Models Factorization Machines Steffen Rendle 8 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Parallel Factor Analysis (PARAFAC) I3 k I1 = Y k 1 0 0 0 0 0 1 ×1 I1 V1 ×2 I V2 ×3 I V3 2 3 k I2 k k k m-order PARAFAC in tensor product notation: Ŷ := C ×1 V (1) ×2 . . . ×m V (m) with model parameters V (l) ∈ R|Il |×k , ∀l ∈ {1, . . . , m} and where C is the identity tensor: C ∈ Rk×...×k , cj1 ,...,jm := δ(j1 = . . . = jm ) [Harshman 1970, Carroll 1970] Steffen Rendle 9 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Parallel Factor Analysis (PARAFAC) I3 k I1 =k Y 1 0 0 0 0 0 1 ×1 I1 V1 ×2 I V2 ×3 I V3 2 3 k I2 k k k m-order PARAFAC in element-wise notation: ŷ (i1 , . . . , im ) := k X (1) vi1 ,f f =1 (m) . . . vim ,f = k Y m X (l) vil ,f f =1 l=1 with model parameters V (l) ∈ R|Il |×k , ∀l ∈ {1, . . . , m} [Harshman 1970, Carroll 1970] Steffen Rendle 9 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Parallel Factor Analysis (PARAFAC) Notes I For a m = 2-order tensor (i.e. a matrix), PARAFAC is the same as matrix factorization. [e.g. Kolda et al. 2009, Cichocki et al. 2009] Steffen Rendle 10 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Parallel Factor Analysis (PARAFAC) Notes I For a m = 2-order tensor (i.e. a matrix), PARAFAC is the same as matrix factorization. I Sometimes a modified PARAFAC with diagonal core cf ,...,f =: λf and factors of unit length is used: ŷ (i1 , . . . , im ) := k X f =1 (1) (m) λf vi1 ,f . . . vim ,f = k X f =1 λf m Y (l) vil ,f l=1 [e.g. Kolda et al. 2009, Cichocki et al. 2009] Steffen Rendle 10 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Parallel Factor Analysis (PARAFAC) Notes I For a m = 2-order tensor (i.e. a matrix), PARAFAC is the same as matrix factorization. I Sometimes a modified PARAFAC with diagonal core cf ,...,f =: λf and factors of unit length is used: ŷ (i1 , . . . , im ) := k X f =1 I (1) (m) λf vi1 ,f . . . vim ,f = k X f =1 λf m Y (l) vil ,f l=1 Other constraints, e.g. non-negativity or symmetry can be imposed. [e.g. Kolda et al. 2009, Cichocki et al. 2009] Steffen Rendle 10 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Parallel Factor Analysis (PARAFAC) Notes I For a m = 2-order tensor (i.e. a matrix), PARAFAC is the same as matrix factorization. I Sometimes a modified PARAFAC with diagonal core cf ,...,f =: λf and factors of unit length is used: ŷ (i1 , . . . , im ) := k X f =1 (1) (m) λf vi1 ,f . . . vim ,f = k X f =1 λf m Y (l) vil ,f l=1 I Other constraints, e.g. non-negativity or symmetry can be imposed. I PARAFAC is also called Canonical Decomposition (CANDECOMP). [e.g. Kolda et al. 2009, Cichocki et al. 2009] Steffen Rendle 10 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Tucker Decomposition (TD) I3 k3 I1 Y =k C 1 ×1 I1 V1 ×2 I V2 ×3 I V3 2 3 k2 I2 k1 k2 k3 m-order Tucker Decomposition in tensor product notation: Ŷ := C ×1 V (1) ×2 . . . ×m V (m) where V and C are model parameters: C ∈ Rk1 ×...×km , V (l) ∈ R|Il |×kl , ∀l ∈ {1, . . . , m} [Tucker 1966] Steffen Rendle 11 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Tucker Decomposition (TD) I3 k3 I1 =k C Y 1 ×1 I1 V1 ×2 I V2 ×3 I V3 2 3 k2 I2 k1 k2 k3 m-order Tucker Decomposition in element-wise notation: ŷ (i1 , . . . , im ) := k1 X f1 =1 km X ... (1) (m) cf1 ,...,fm vi1 ,f1 . . . vim ,fm = fm =1 k1 X f1 =1 ... km X fm =1 cf1 ,...,fm m Y (l) vil ,fl l=1 with model parameters: C ∈ Rk1 ×...×km , V (l) ∈ R|Il |×kl , ∀l ∈ {1, . . . , m} [Tucker 1966] Steffen Rendle 11 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Tucker Decomposition (TD) Notes I For a m = 2-order tensor (i.e. a matrix), TD is different from matrix factorization: ŷ TD (i1 , i2 ) = k1 X k2 X (1) (2) cf1 ,f2 vi1 ,f1 vi2 ,f2 6= f1 =1 f2 =1 k X (1) (2) vi1 ,f vi2 ,f = ŷ MF (i1 , i2 ) f =1 [e.g. Kolda et al. 2009, Cichocki et al. 2009] Steffen Rendle 12 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Tucker Decomposition (TD) Notes I For a m = 2-order tensor (i.e. a matrix), TD is different from matrix factorization: ŷ TD (i1 , i2 ) = k1 X k2 X f1 =1 f2 =1 I (1) (2) cf1 ,f2 vi1 ,f1 vi2 ,f2 6= k X (1) (2) vi1 ,f vi2 ,f = ŷ MF (i1 , i2 ) f =1 Sometimes orthogonality constraints on V are imposed. [e.g. Kolda et al. 2009, Cichocki et al. 2009] Steffen Rendle 12 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Tucker Decomposition (TD) Notes I For a m = 2-order tensor (i.e. a matrix), TD is different from matrix factorization: ŷ TD (i1 , i2 ) = k1 X k2 X (1) (2) cf1 ,f2 vi1 ,f1 vi2 ,f2 6= f1 =1 f2 =1 k X (1) (2) vi1 ,f vi2 ,f = ŷ MF (i1 , i2 ) f =1 I Sometimes orthogonality constraints on V are imposed. I Other constraints, e.g. non-negativity or symmetry can be imposed. [e.g. Kolda et al. 2009, Cichocki et al. 2009] Steffen Rendle 12 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines PARAFAC vs. TD I PARAFAC ŷ (i1 , . . . , im ) := k X (1) (m) vi1 ,f . . . vim ,f = f =1 I (l) vil ,f f =1 l=1 TD ŷ (i1 , . . . , im ) := k1 X f1 =1 Steffen Rendle k Y m X ... km X (1) (m) cf1 ,...,fm vi1 ,f1 . . . vim ,fm = k1 X ... km X fm =1 cf1 ,...,fm m Y (l) vil ,fl fm =1 f1 =1 l=1 13 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines PARAFAC vs. TD I PARAFAC ŷ (i1 , . . . , im ) := k X (1) (m) vi1 ,f . . . vim ,f = f =1 I k1 X f1 =1 Steffen Rendle (l) vil ,f f =1 l=1 TD ŷ (i1 , . . . , im ) := I k Y m X ... km X (1) (m) cf1 ,...,fm vi1 ,f1 . . . vim ,fm = fm =1 k1 X f1 =1 ... km X fm =1 cf1 ,...,fm m Y (l) vil ,fl l=1 TD is more general, as C is free. 13 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines PARAFAC vs. TD I PARAFAC ŷ (i1 , . . . , im ) := k X (1) (m) vi1 ,f . . . vim ,f = f =1 I (l) vil ,f f =1 l=1 TD ŷ (i1 , . . . , im ) := k1 X f1 =1 ... km X (1) fm =1 TD is more general, as C is free. I Computational complexity: I PARAFAC: O(k m). I TD: O(k m ) if k1 = . . . = km =: k. 13 / 75 (m) cf1 ,...,fm vi1 ,f1 . . . vim ,fm = I Steffen Rendle k Y m X k1 X f1 =1 ... km X fm =1 cf1 ,...,fm m Y (l) vil ,fl l=1 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Tensor Factorization as Machine Learning Models I I PARAFAC and TD model m-ary interactions directly. PARAFAC and TD have problems when the number of observations for some levels is small: I I Steffen Rendle E.g. assume that there are no observations for a level l, then for the estimated factors vl = 0 (in case of L2 regularization) and thus all predictions involving this level will always be 0 as well (for PARAFAC and TD). Similar problems can occur if the number of observations of a level is small. 14 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Tensor Factorization as Machine Learning Models I I PARAFAC and TD model m-ary interactions directly. PARAFAC and TD have problems when the number of observations for some levels is small: I I I Steffen Rendle E.g. assume that there are no observations for a level l, then for the estimated factors vl = 0 (in case of L2 regularization) and thus all predictions involving this level will always be 0 as well (for PARAFAC and TD). Similar problems can occur if the number of observations of a level is small. Standard L2-regularization alone cannot solve this problem. 14 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Tensor Factorization as Machine Learning Models I I PARAFAC and TD model m-ary interactions directly. PARAFAC and TD have problems when the number of observations for some levels is small: I I E.g. assume that there are no observations for a level l, then for the estimated factors vl = 0 (in case of L2 regularization) and thus all predictions involving this level will always be 0 as well (for PARAFAC and TD). Similar problems can occur if the number of observations of a level is small. I Standard L2-regularization alone cannot solve this problem. I If a m-ary interaction cannot be estimated reliably, often a lower-level interaction (e.g. (m − 1)-ary) can be estimated reliably. Steffen Rendle 14 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines TF with Lower-level Interactions Model equation of m-ary tensor factorization with nested lower-level interactions ŷ LLTF (i1 , . . . , im ) := c + m X (l) wil + l=1 m X m X ŷ TF (il1 , il2 ) + . . . + ŷ TF (i1 , . . . , im ) l1 =1 l2 >l1 [e.g. Rendle et al. 2010; Cai et al. 2011] Steffen Rendle 15 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines TF with Lower-level Interactions Model equation of m-ary tensor factorization with nested lower-level interactions ŷ LLTF (i1 , . . . , im ) := c + m X (l) wil + l=1 m X m X ŷ TF (il1 , il2 ) + . . . + ŷ TF (i1 , . . . , im ) l1 =1 l2 >l1 Model parameters c ∈ R, w(l) ∈ R|Il | , ..., V (l) ∈ R|Il |×k [e.g. Rendle et al. 2010; Cai et al. 2011] Steffen Rendle 15 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines TF with Lower-level Interactions Model equation of m-ary tensor factorization with nested lower-level interactions ŷ LLTF (i1 , . . . , im ) := c + m X (l) wil + l=1 m X m X ŷ TF (il1 , il2 ) + . . . + ŷ TF (i1 , . . . , im ) l1 =1 l2 >l1 Model parameters c ∈ R, w(l) ∈ R|Il | , ..., V (l) ∈ R|Il |×k I Estimating a lower level effect (e.g. a pairwise one) reliably is easier than estimating a higher level one. I Often lower level effects can explain the data sufficiently and higher level ones can be dropped completely. [e.g. Rendle et al. 2010; Cai et al. 2011] Steffen Rendle 15 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Outline Tensor Factorization Problem Setting Models Learning Examples for Applications Summary Time-aware Factorization Models Factorization Machines Steffen Rendle 16 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Standard Fitting Algorithms Standard algorithms assume: I Y is observed completely, i.e. for all combinations (i1 , . . . , im ) ∈ I1 × . . . × Im , yi1 ,...,im is known. I I Missing values are imputed. Optimization is done with respect to least squares: X argmin (yi1 ,...,im − ŷi1 ,...,im )2 Θ I Steffen Rendle (i1 ,...,im )∈I1 ×...×Im No regularization/ prior assumptions. 17 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Standard Fitting Algorithms Standard algorithms assume: I Y is observed completely, i.e. for all combinations (i1 , . . . , im ) ∈ I1 × . . . × Im , yi1 ,...,im is known. I I I Missing values are imputed. In ML problems most elements are missing (often > 99.9%). Optimization is done with respect to least squares: X argmin (yi1 ,...,im − ŷi1 ,...,im )2 Θ I I ML: Other losses are also of interest, e.g. classification, ranking, . . . . No regularization/ prior assumptions. I Steffen Rendle (i1 ,...,im )∈I1 ×...×Im ML: Prior knowledge should be included. 17 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Example: Higher-Order SVD (HOSVD) HOSVD is such an approximative fitting algorithm: I Loss: Least-squares loss without regularization; no missing value treatment. I Model: Tucker decomposition Algorithm: I I For each mode l I I I I Unfold Y to matrix form. Compute SVD. V (l) are the left singular vectors of the SVD. Compute core tensor C = Y ×1 (V (1) )T ×2 (V (2) )T ×3 . . . ×m (V (m) )T . [Tucker 66, Lathauwer et al. 2000] Steffen Rendle 18 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Example: Higher-Order SVD (HOSVD) HOSVD is such an approximative fitting algorithm: I Loss: Least-squares loss without regularization; no missing value treatment. I Model: Tucker decomposition I Algorithm: I For each mode l I I I I Unfold Y to matrix form. Compute SVD. V (l) are the left singular vectors of the SVD. Compute core tensor C = Y ×1 (V (1) )T ×2 (V (2) )T ×3 . . . ×m (V (m) )T . 2 3 5 Steffen Rendle [Tucker 66, Lathauwer et al. 2000] 18 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Example: Higher-Order SVD (HOSVD) HOSVD is such an approximative fitting algorithm: I Loss: Least-squares loss without regularization; no missing value treatment. I Model: Tucker decomposition I Algorithm: I For each mode l I I I I Unfold Y to matrix form. Compute SVD. V (l) are the left singular vectors of the SVD. Compute core tensor C = Y ×1 (V (1) )T ×2 (V (2) )T ×3 . . . ×m (V (m) )T . 2 3 5 Steffen Rendle [Tucker 66, Lathauwer et al. 2000] 18 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Example: Higher-Order SVD (HOSVD) HOSVD is such an approximative fitting algorithm: I Loss: Least-squares loss without regularization; no missing value treatment. I Model: Tucker decomposition I Algorithm: I For each mode l I I I I Unfold Y to matrix form. Compute SVD. V (l) are the left singular vectors of the SVD. Compute core tensor C = Y ×1 (V (1) )T ×2 (V (2) )T ×3 . . . ×m (V (m) )T . 2 3 3 5 5 Steffen Rendle 5 [Tucker 66, Lathauwer et al. 2000] 18 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Example: Higher-Order SVD (HOSVD) HOSVD is such an approximative fitting algorithm: I Loss: Least-squares loss without regularization; no missing value treatment. I Model: Tucker decomposition I Algorithm: I For each mode l I I I I Unfold Y to matrix form. Compute SVD. V (l) are the left singular vectors of the SVD. Compute core tensor C = Y ×1 (V (1) )T ×2 (V (2) )T ×3 . . . ×m (V (m) )T . 2 3 5 Steffen Rendle [Tucker 66, Lathauwer et al. 2000] 18 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Example: Higher-Order SVD (HOSVD) HOSVD is such an approximative fitting algorithm: I Loss: Least-squares loss without regularization; no missing value treatment. I Model: Tucker decomposition I Algorithm: I For each mode l I I I I Unfold Y to matrix form. Compute SVD. V (l) are the left singular vectors of the SVD. Compute core tensor C = Y ×1 (V (1) )T ×2 (V (2) )T ×3 . . . ×m (V (m) )T . 2 5 3 2 5 Steffen Rendle 2 2 [Tucker 66, Lathauwer et al. 2000] 18 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Example: Higher-Order SVD (HOSVD) HOSVD is such an approximative fitting algorithm: I Loss: Least-squares loss without regularization; no missing value treatment. I Model: Tucker decomposition I Algorithm: I For each mode l I I I I Unfold Y to matrix form. Compute SVD. V (l) are the left singular vectors of the SVD. Compute core tensor C = Y ×1 (V (1) )T ×2 (V (2) )T ×3 . . . ×m (V (m) )T . 2 3 5 Steffen Rendle [Tucker 66, Lathauwer et al. 2000] 18 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Example: Higher-Order SVD (HOSVD) HOSVD is such an approximative fitting algorithm: I Loss: Least-squares loss without regularization; no missing value treatment. I Model: Tucker decomposition I Algorithm: I For each mode l I I I I Unfold Y to matrix form. Compute SVD. V (l) are the left singular vectors of the SVD. Compute core tensor C = Y ×1 (V (1) )T ×2 (V (2) )T ×3 . . . ×m (V (m) )T . 2 2 3 3 5 Steffen Rendle 3 3 3 3 [Tucker 66, Lathauwer et al. 2000] 18 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Example: Higher-Order SVD (HOSVD) HOSVD is such an approximative fitting algorithm: I Loss: Least-squares loss without regularization; no missing value treatment. I Model: Tucker decomposition Algorithm: I I For each mode l I I I I Unfold Y to matrix form. Compute SVD. V (l) are the left singular vectors of the SVD. Compute core tensor C = Y ×1 (V (1) )T ×2 (V (2) )T ×3 . . . ×m (V (m) )T . Additional Alternating Least-Squares (ALS) steps can improve the fit. [Tucker 66, Lathauwer et al. 2000] Steffen Rendle 18 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Machine Learning with TF Models I Optimize only w.r.t. observed elements of Y . I I Choose loss/ likelihood according to the target variables/ task. I I E.g. logit for classification, pairwise classification for ranking, etc. Add priors / regularization to model parameters. I I Comparable to MF: Weighted Low-Rank Approximations [Srebro et al. 2003] E.g. L2/ Gaussian priors. Model lower-level interactions. I E.g. add factorized pairwise interactions [Rendle et al. 2010] TF models are multilinear ⇒ simple SGD or ALS algorithms can be used for optimization. Steffen Rendle 19 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Outline Tensor Factorization Problem Setting Models Learning Examples for Applications Summary Time-aware Factorization Models Factorization Machines Steffen Rendle 20 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Personalized Tag Recommendation [Hotho et al., 2006] Task: Recommend a user a (personalized) list of tags for a specific item. Steffen Rendle 21 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines + tag + + 1 + + + + + User 1 + + + + + tag us er Personalized Tag Recommendation + + User 2 + + User 3 + + + + item item + + + + item item I U ... users I I ... items I T ... tags I S ⊆ U × I × T ... observed tags I PS = {(u, i)|∃t ∈ T : (u, i, t) ∈ S} ... observed tagging posts [Hotho et al., 2006] Steffen Rendle 21 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Evaluation: Prediction Quality Last.fm 0.50 ● ● ● ● ● 0.40 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.4 ● 0.5 ● F−Measure ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.3 0.25 0.30 BPR−PITF 64 BPR−CD 64 RTF−TD 64 FolkRank PageRank HOSVD npmax ● ● ● ● 0.35 F−Measure 0.45 ● BPR−PITF 128 BPR−CD 128 RTF−TD 128 FolkRank PageRank HOSVD 0.6 0.55 BibSonomy 2 4 6 8 Top n 10 2 4 6 8 10 Top n I adapted PageRank for tag recommendation/ Folkrank [Hotho et al. 2006] I HOSVD: TD for least squares, no missing values, no reg. [Symeonidis et al. 2008] I RTF-TD: TD model optimized for regularized ranking [Rendle et al. 2009] I BPR-PITF, BPR-CD: PITF/ PARAFAC model optimized for regularized ranking [Rendle et al. 2010] [Rendle et. al 2010] Steffen Rendle 22 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Evaluation: Learning Runtime 0.6 0.5 0.4 BPR−PITF 64 BPR−CD 64 RTF−TD 64 0 5 10 15 20 25 Learning runtime in days 30 0.0 0.0 BPR−PITF 64 BPR−CD 64 RTF−TD 64 0.1 0.2 0.3 Top3 F−Measure 0.4 0.3 0.1 0.2 Top3 F−Measure 0.5 0.6 0.7 Last.fm: Prediction quality vs. learning runtime 0.7 Last.fm: Prediction quality vs. learning runtime 0 20 40 60 80 100 120 Learning runtime in minutes [Rendle et. al 2010] Steffen Rendle 22 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines ECML/PKDD Discovery Challenge 2009 Rank 1 2 3 4 5 6 ... Method BPR-PITF + adaptive list size BPR-PITF (not submitted) Relational Classification [Marinho et al. 09] Content-based [Lipczak et al. 09] Content-based [Zhang et al. 09] Content-based [Ju and Hwang 09] Personomy translation [Wetzker et al. 09] ... Top-5 F-Measure 0.35594 0.345 0.33185 0.32461 0.32230 0.32134 0.32124 ... Task 2: ECML/ PKDD Challenge 2009, http://www.kde.cs.uni-kassel.de/ws/dc09/results [Rendle et. al 2010] Steffen Rendle 22 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Querying Incomplete RDF-Graphs I Task: Answer queries about subject-predicate-pairs. E.g. What is McCartney member of? [Franz et al. 2009, Drumond et al. 2012] Steffen Rendle 23 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Querying Incomplete RDF-Graphs An RDF-Graph can be expressed as a function over three categorical domains: y : S × P × O → {0, 1} [Franz et al. 2009, Drumond et al. 2012] Steffen Rendle 23 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Prediction Quality I CD Dense: PARAFAC optimized for least-squares, no missing values, no reg. I CD-BPR: PARAFAC optimized for regularized ranking. I PITF-BPR: PITF (pairwise interactions) optimized for regularized ranking. [Drumond et al. 2012] Steffen Rendle 24 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Other Applications: Examples I Multiverse Recommendation [Karatzoglou et al. 2010] I I I I I I I CubeSVD [Sun et al. 2005] I I Steffen Rendle Task: Context-aware Rating prediction. Model: Tucker Decomposition. Missing values are handled. Loss: task dependent, e.g. MAE, RMSE. Regularization: L1, L2. Algorithm: Stochastic Gradient Descent (SGD). Task: Clickthrough prediction. Approach: HOSVD. 25 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Outline Tensor Factorization Problem Setting Models Learning Examples for Applications Summary Time-aware Factorization Models Factorization Machines Steffen Rendle 26 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Summary I Prediction functions with m categorical variables can be modeled with tensor factorization. I Parallel Factor Analysis (PARAFAC) generalizes matrix factorization to m modes. I Tucker Decomposition allows a free core tensor. (High computational complexity!) I Lower-order interactions, e.g. pairwise ones should be integrated for better prediction quality in sparse settings. I For learning: missing values, loss/likelihood and regularization/ priors should be considered. Steffen Rendle 27 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Summary I Prediction functions with m categorical variables can be modeled with tensor factorization. I Parallel Factor Analysis (PARAFAC) generalizes matrix factorization to m modes. I Tucker Decomposition allows a free core tensor. (High computational complexity!) I Lower-order interactions, e.g. pairwise ones should be integrated for better prediction quality in sparse settings. I For learning: missing values, loss/likelihood and regularization/ priors should be considered. Problem: Only categorical variables can be handled. Steffen Rendle 27 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Outline Tensor Factorization Time-aware Factorization Models Models Summary Factorization Machines Steffen Rendle 28 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Time-Aware: Problem Setting I 3 predictor variables: I I two variables of categorical domain I and J. one numerical variable (time), t ∈ R. Target y : Real-valued (regression), binary (classification), scores (ranking). I Supervised task: set of observations S = {(i, j, t, y ), . . .} I Modelling: function ŷ : I × J × R → Y. Observations over I and J Steffen Rendle 29 / 75 tim e I Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Tensor Factorization 1. Discretize time variable, e.g. by binning. ⇒ 3 cat. domains: I, J, T. b : R → T, e.g. b(t) := bt/(24 ∗ 60 ∗ 60)c [Xiong et al. 2010] Steffen Rendle 30 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Tensor Factorization 1. Discretize time variable, e.g. by binning. ⇒ 3 cat. domains: I, J, T. b : R → T, e.g. b(t) := bt/(24 ∗ 60 ∗ 60)c 2. Apply tensor factorization, e.g. Tucker Decomposition, PARAFAC. ŷ (i, j, t) := k X I J T vi,f vj,f vb(t),f f =1 [Xiong et al. 2010] Steffen Rendle 30 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Tensor Factorization 1. Discretize time variable, e.g. by binning. ⇒ 3 cat. domains: I, J, T. b : R → T, e.g. b(t) := bt/(24 ∗ 60 ∗ 60)c 2. Apply tensor factorization, e.g. Tucker Decomposition, PARAFAC. ŷ (i, j, t) := k X I J T vi,f vj,f vb(t),f f =1 3. Smooth time factors V T , s.th. nearby points in time have similar factors. E.g. by regularization: T T vt+1,f ∼ N (vt,f , 1/λT ), ∀t ∈ T , f ∈ {1, . . . , k} [Xiong et al. 2010] Steffen Rendle 30 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Tensor Factorization 1. Discretize time variable, e.g. by binning. ⇒ 3 cat. domains: I, J, T. b : R → T, e.g. b(t) := bt/(24 ∗ 60 ∗ 60)c 2. Apply tensor factorization, e.g. Tucker Decomposition, PARAFAC. ŷ (i, j, t) := k X I J T vi,f vj,f vb(t),f f =1 3. Smooth time factors V T , s.th. nearby points in time have similar factors. E.g. by regularization: T T vt+1,f ∼ N (vt,f , 1/λT ), ∀t ∈ T , f ∈ {1, . . . , k} For learning/ inference, e.g. a MCMC sampler can be used. [Xiong et al. 2010] Steffen Rendle 30 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Time-Aware Matrix Factorization Time-Aware Matrix Factorization ŷ (i, j, t) := k X wi,f (t) hj,f (t) f =1 where the factor matrices H and W depend on the time t: W : R → R|I |×k , H : R → R|J|×k [Koren 2009] Steffen Rendle 31 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Time-Aware Matrix Factorization Modeling time dependent factors, e.g. for W : I Constant wi,f (t) := w̃i,f , W̃ ∈ R|I |×k [Koren 2009] Steffen Rendle 31 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Time-Aware Matrix Factorization Modeling time dependent factors, e.g. for W : I Constant wi,f (t) := w̃i,f , I W̃ ∈ R|I |×k Linear wi,f (t) := w̃i,f + zi,f t, W̃ ∈ R|I |×k , Z ∈ R|I |×k [Koren 2009] Steffen Rendle 31 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Time-Aware Matrix Factorization Modeling time dependent factors, e.g. for W : I Constant wi,f (t) := w̃i,f , I Linear wi,f (t) := w̃i,f + zi,f t, I W̃ ∈ R|I |×k W̃ ∈ R|I |×k , Z ∈ R|I |×k Binning with function b wi,f (t) := w̃i,f ,b(t) , W̃ ∈ R|I |×k×|img(b)| [Koren 2009] Steffen Rendle 31 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Time-Aware Matrix Factorization Modeling time dependent factors, e.g. for W : I Constant wi,f (t) := w̃i,f , I Linear wi,f (t) := w̃i,f + zi,f t, I W̃ ∈ R|I |×k , Z ∈ R|I |×k Binning with function b wi,f (t) := w̃i,f ,b(t) , I W̃ ∈ R|I |×k W̃ ∈ R|I |×k×|img(b)| Spline with mi predefined control points at position ti,1 , . . . , ti,m Pmi w̃ exp(−γ|t − ti,l |) Pmi i,f ,l wi,f (t) := l=1 , W̃ ∈ R|I |×k×mi exp(−γ|t − t |) i,l l=1 [Koren 2009] Steffen Rendle 31 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Time-Aware Matrix Factorization Modeling time dependent factors, e.g. for W : I Constant wi,f (t) := w̃i,f , I Linear wi,f (t) := w̃i,f + zi,f t, I W̃ ∈ R|I |×k W̃ ∈ R|I |×k , Z ∈ R|I |×k Binning with function b wi,f (t) := w̃i,f ,b(t) , W̃ ∈ R|I |×k×|img(b)| I Spline with mi predefined control points at position ti,1 , . . . , ti,m Pmi w̃ exp(−γ|t − ti,l |) Pmi i,f ,l wi,f (t) := l=1 , W̃ ∈ R|I |×k×mi exp(−γ|t − t |) i,l l=1 I Linear combinations of the functions above. Steffen Rendle 31 / 75 [Koren 2009] Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Time-Aware Matrix Factorization Choices for the timeSVD++ model for the Netflix challenge: I User factors W: linear combination of I I I Item factors H: I I linear effect binning with bin size 1 constant Additional (time-unaware) implicit indicators (from SVD++ [Koren, 2008]) [Koren 2009] Steffen Rendle 31 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Time-Aware Matrix Factorization Choices for the timeSVD++ model for the Netflix challenge: I User factors W: linear combination of I I I Item factors H: I I linear effect binning with bin size 1 constant Additional (time-unaware) implicit indicators (from SVD++ [Koren, 2008]) For learning, e.g. a SGD algorithm can be used. [Koren 2009] Steffen Rendle 31 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Comparison I Time-aware MF with binning (TAMF) and tensor factorization with discretization (TF) treat the time variable similarly: ŷ TAMF (i, j, t) := k X wi,f ,b(t) hj,f f =1 ŷ TF (i, j, t) := k X wi,f hj,f zb(t),f f =1 I Main difference: I I Steffen Rendle In tensor factorization, the (i,t)-interaction is factorized. In time-aware MF, the (i,t)-interaction is modeled unfactorized. 32 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Discussion I Binning and splines cannot make use of time for future events. I I Future bins are empty and variables cannot be estimated. Variables in (future) control points of splines cannot be estimated. I Seasonal time indicators can help, e.g. weekday, holiday, Christmas, etc, I Other approach: use qualitative/ sequential information Steffen Rendle 33 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Sequential Prediction Bt3 a User 1 b c d Bt b a b ? c ? c c e a ? e c User 4 I Bt1 a User 2 User 3 Bt2 e ? Task: Which items will be selected next? [e.g. Zimdars et al. 2001, Rendle et al. 2010] Steffen Rendle 34 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Markov Chains Markov chain of order 1: p(jt |lt−1 ) I t is a sequential index. I lt−1 is the item selected previously. I The Markov chain is defined by a transition matrix A ∈ R|J|×|J| . B C A ? ? ? B ? ? ? C ? ? ? from item A B C A to item Steffen Rendle 35 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Markov Chains Markov chain of order 1: p(jt |lt−1 ) I t is a sequential index. I lt−1 is the item selected previously. I The Markov chain is defined by a transition matrix A ∈ R|J|×|J| . I Model is (weakly) personalized by taking the last item selected by a user into account. B C A ? ? ? B ? ? ? C ? ? ? from item A B C A to item Steffen Rendle 35 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Factorized Personalized Markov Chain Model equation ŷ (i, j, t) := ẑ(i, j, s(i, t)) where s(i, t) is the previously (w.r.t. t) selected entity (by i). I I ẑ can be modeled by TD, PARAFAC, PITF, . . . For product recommendation i is the user and j the current item. [Rendle et al. 2010] Steffen Rendle 36 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Factorized Personalized Markov Chain Model equation ŷ (i, j, t) := ẑ(i, j, s(i, t)) where s(i, t) is the previously (w.r.t. t) selected entity (by i). I I I ẑ can be modeled by TD, PARAFAC, PITF, . . . For product recommendation i is the user and j the current item. If a set of items can be selected previously, one can average over this set: X 1 ŷ (i, j, t) := ẑ(i, j, l) |s(i, t)| l∈s(i,t) For learning, e.g. a SGD algorithm can be used. [Rendle et al. 2010] Steffen Rendle 36 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Outline Tensor Factorization Time-aware Factorization Models Models Summary Factorization Machines Steffen Rendle 37 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Summary I Time can be taken into account by: I I I I With time-variables, the dataset split should be considered: I I Steffen Rendle Discretization and applying Tensor Factorization. Time-variant factors, e.g. binning, linear effects, splines, . . . Sequential indicators, e.g. last item selected. Random split: absolute time can be modeled. Time split: binning not effective, time transformation that are predictive for future points in time should be chosen; e.g. seasonal or sequential. 38 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Outline Tensor Factorization Time-aware Factorization Models Factorization Machines Problem Setting Standard Models Factorization Machines Applications Summary Steffen Rendle 39 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Motivation All the presented factorization models work empirically very well, but: I For each new problem a new model, a new learning algorithm and implementation is necessary. I For some of the models there are dozens of improved learning algorithms proposed (that work only with this particular model). I For non-experts in factorization models this is not applicable. I How does this relate to standard models? Steffen Rendle 40 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Data and Variable Representation Many standard ML approaches work with real valued input data (a design matrix). It allows to represent, e.g.: I any number of variables I categorical domains by using dummy indicator variables I numerical domains I set-categorical domains by using dummy indicator variables Using this representation allows to apply a wide variety of standard models (e.g. linear regression, SVM, etc.). Steffen Rendle 41 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Data and Variable Representation: Example User Alice Alice Alice Bob Bob Charlie Charlie ... Movie Titanic Notting Hill Star Wars Star Wars Star Trek Titanic Star Wars ... Rating 5 3 1 4 5 1 5 ... 2 categorical variables Steffen Rendle 42 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Data and Variable Representation: Example Feature vector x User Alice Alice Alice Bob Bob Charlie Charlie ... Movie Titanic Notting Hill Star Wars Star Wars Star Trek Titanic Star Wars ... Rating 5 3 1 4 5 1 5 ... 0 0 ... 1 0 0 0 ... 5 y(1) x(2) 1 0 0 ... 0 1 0 0 ... 3 y(2) x(3) 1 0 0 ... 0 0 1 0 ... 1 y(3) 0 1 0 ... 0 0 1 0 ... 4 y(4) x(5) 0 1 0 ... 0 0 0 1 ... 5 y(5) 0 1 ... 1 0 0 0 ... 1 y(6) 0 1 ... 0 0 1 0 ... 5 y(7) B C User ... TI NH SW ST Movie x x (4) (6) 0 x(7) 0 A ... |U| + |I | real valued variables 2 categorical variables Steffen Rendle Target y 1 x (1) 42 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Problem Setting I Predictor variables: p variables of real-valued domain X1 , . . . , Xp ∈ R. I Target y : Real-valued (regression), binary (classification), scores (ranking). I Supervised task: set of observations S = {(x1 , . . . , xp , y ), . . .} Steffen Rendle 43 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Problem Setting I Predictor variables: p variables of real-valued domain X1 , . . . , Xp ∈ R. I Target y : Real-valued (regression), binary (classification), scores (ranking). I Supervised task: set of observations S = {(x1 , . . . , xp , y ), . . .} This is the most common machine learning task. Steffen Rendle 43 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Outline Tensor Factorization Time-aware Factorization Models Factorization Machines Problem Setting Standard Models Factorization Machines Applications Summary Steffen Rendle 44 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Standard Machine Learning Models I Categorical variables can be represented with real-valued ones. I There are many well-studied standard ML models that can work with real-valued variables. I Why shouldn’t we work with them? Why do we need factorization models? Steffen Rendle 45 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Linear Regression I Let x ∈ Rp be an input vector with p predictor variables. I Model equation: ŷ (x) := w0 + p X w i xi i=1 I Model parameters: w0 ∈ R, w ∈ Rp O(p) model parameters. Steffen Rendle 46 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Polynomial Regression I Let x ∈ Rp be an input vector with p predictor variables. I Model equation (degree 2): ŷ (x) := w0 + p X w i xi + i=1 I p X p X wi,j xi xj i=1 j≥i Model parameters: w0 ∈ R, w ∈ Rp , W ∈ Rp×p O(p 2 ) model parameters. Steffen Rendle 47 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Application to Large Categorical Domains Feature vector x User Alice Alice Alice Bob Bob Charlie Charlie ... Movie Titanic Notting Hill Star Wars Star Wars Star Trek Titanic Star Wars ... Rating 5 3 1 4 5 1 5 ... Target y 1 0 0 ... 1 0 0 0 ... 5 y(1) x(2) 1 0 0 ... 0 1 0 0 ... 3 y(2) x(3) 1 0 0 ... 0 0 1 0 ... 1 y(3) 0 1 0 ... 0 0 1 0 ... 4 y(4) x(5) 0 1 0 ... 0 0 0 1 ... 5 y(5) 0 0 1 ... 1 0 0 0 ... 1 y(6) x(7) 0 0 1 ... 0 0 1 0 ... 5 y(7) B C User ... TI NH SW ST Movie x x x (1) (4) (6) A ... Applying regression models to this data leads to: Steffen Rendle 48 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Application to Large Categorical Domains Feature vector x User Alice Alice Alice Bob Bob Charlie Charlie ... Movie Titanic Notting Hill Star Wars Star Wars Star Trek Titanic Star Wars ... Rating 5 3 1 4 5 1 5 ... Target y 1 0 0 ... 1 0 0 0 ... 5 y(1) x(2) 1 0 0 ... 0 1 0 0 ... 3 y(2) x(3) 1 0 0 ... 0 0 1 0 ... 1 y(3) 0 1 0 ... 0 0 1 0 ... 4 y(4) x(5) 0 1 0 ... 0 0 0 1 ... 5 y(5) 0 0 1 ... 1 0 0 0 ... 1 y(6) x(7) 0 0 1 ... 0 0 1 0 ... 5 y(7) B C User ... TI NH SW ST Movie x x x (1) (4) (6) A ... Applying regression models to this data leads to: Linear regression: ŷ (x) = w0 + wu + wi Steffen Rendle 48 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Application to Large Categorical Domains Feature vector x User Alice Alice Alice Bob Bob Charlie Charlie ... Movie Titanic Notting Hill Star Wars Star Wars Star Trek Titanic Star Wars ... Rating 5 3 1 4 5 1 5 ... Target y 1 0 0 ... 1 0 0 0 ... 5 y(1) x(2) 1 0 0 ... 0 1 0 0 ... 3 y(2) x(3) 1 0 0 ... 0 0 1 0 ... 1 y(3) 0 1 0 ... 0 0 1 0 ... 4 y(4) x(5) 0 1 0 ... 0 0 0 1 ... 5 y(5) 0 0 1 ... 1 0 0 0 ... 1 y(6) x(7) 0 0 1 ... 0 0 1 0 ... 5 y(7) B C User ... TI NH SW ST Movie x x x (1) (4) (6) A ... Applying regression models to this data leads to: Linear regression: ŷ (x) = w0 + wu + wi Polynomial regression: Steffen Rendle ŷ (x) = w0 + wu + wi + wu,i 48 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Application to Large Categorical Domains Feature vector x User Alice Alice Alice Bob Bob Charlie Charlie ... Movie Titanic Notting Hill Star Wars Star Wars Star Trek Titanic Star Wars ... Rating 5 3 1 4 5 1 5 ... Target y 1 0 0 ... 1 0 0 0 ... 5 y(1) x(2) 1 0 0 ... 0 1 0 0 ... 3 y(2) x(3) 1 0 0 ... 0 0 1 0 ... 1 y(3) 0 1 0 ... 0 0 1 0 ... 4 y(4) x(5) 0 1 0 ... 0 0 0 1 ... 5 y(5) 0 0 1 ... 1 0 0 0 ... 1 y(6) x(7) 0 0 1 ... 0 0 1 0 ... 5 y(7) B C User ... TI NH SW ST Movie x x x (1) (4) (6) A ... Applying regression models to this data leads to: Linear regression: ŷ (x) = w0 + wu + wi Steffen Rendle Polynomial regression: ŷ (x) = w0 + wu + wi + wu,i Matrix factorization (with biases): ŷ (u, i) = w0 + wu + hi + hwu , hi i 48 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Application to Large Categorical Domains For the recommender data of the example: I Steffen Rendle Linear regression has no user-item interaction. 49 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Application to Large Categorical Domains For the recommender data of the example: I Linear regression has no user-item interaction. I Steffen Rendle ⇒ Linear regression is not expressive enough. 49 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Application to Large Categorical Domains For the recommender data of the example: I Linear regression has no user-item interaction. I I Steffen Rendle ⇒ Linear regression is not expressive enough. Polynomial regression includes pairwise interactions but cannot estimate them from the data. 49 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Application to Large Categorical Domains For the recommender data of the example: I Linear regression has no user-item interaction. I I Polynomial regression includes pairwise interactions but cannot estimate them from the data. I Steffen Rendle ⇒ Linear regression is not expressive enough. n p 2 : number of cases is much smaller than number of model parameters. 49 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Application to Large Categorical Domains For the recommender data of the example: I Linear regression has no user-item interaction. I I Polynomial regression includes pairwise interactions but cannot estimate them from the data. I I Steffen Rendle ⇒ Linear regression is not expressive enough. n p 2 : number of cases is much smaller than number of model parameters. Max.-likelihood estimator for a pairwise effect is: ( y − w0 − wi − wu , if (i, j, y ) ∈ S. wi,j = not defined, else 49 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Application to Large Categorical Domains For the recommender data of the example: I Linear regression has no user-item interaction. I I Polynomial regression includes pairwise interactions but cannot estimate them from the data. I I I Steffen Rendle ⇒ Linear regression is not expressive enough. n p 2 : number of cases is much smaller than number of model parameters. Max.-likelihood estimator for a pairwise effect is: ( y − w0 − wi − wu , if (i, j, y ) ∈ S. wi,j = not defined, else Polynomial regression cannot generalize to any unobserved pairwise effect. 49 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Factorization Models and Real-valued Variables I Factorization models work well for categorical variables of large domain. I Standard Models are more flexible as they allow real-valued predictor variables that can be used for encoding several kind of variables. Steffen Rendle 50 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Factorization Models and Real-valued Variables I Factorization models work well for categorical variables of large domain. I Standard Models are more flexible as they allow real-valued predictor variables that can be used for encoding several kind of variables. I How can these advantages be combined? Steffen Rendle 50 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Outline Tensor Factorization Time-aware Factorization Models Factorization Machines Problem Setting Standard Models Factorization Machines Applications Summary Steffen Rendle 51 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Factorization Machine (FM) I Let x ∈ Rp be an input vector with p predictor variables. I Model equation (degree 2): ŷ (x) := w0 + p X p X p X w i xi + hvi , vj i xi xj i=1 I i=1 j>i Model parameters: w0 ∈ R, w ∈ Rp , V ∈ Rp×k [Rendle 2010, Rendle 2012] Steffen Rendle 52 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Factorization Machine (FM) I I Let x ∈ Rp be an input vector with p predictor variables. Model equation (degree 2): ŷ (x) := w0 + p X p X p X w i xi + hvi , vj i xi xj i=1 I i=1 j>i Model parameters: w0 ∈ R, w ∈ Rp , V ∈ Rp×k Compared to Polynomial regression: I Model equation (degree 2): ŷ (x) := w0 + p X wi xi + i=1 I wi,j xi xj i=1 j≥i Model parameters: w0 ∈ R, Steffen Rendle p X p X 52 / 75 w ∈ Rp , W ∈ Rp×p [Rendle 2010, Rendle 2012] Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Factorization Machine (FM) I Let x ∈ Rp be an input vector with p predictor variables. I Model equation (degree 2): ŷ (x) := w0 + p X p X p X w i xi + hvi , vj i xi xj i=1 I i=1 j>i Model parameters: w0 ∈ R, w ∈ Rp , V ∈ Rp×k [Rendle 2010, Rendle 2012] Steffen Rendle 52 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Factorization Machine (FM) I Let x ∈ Rp be an input vector with p predictor variables. I Model equation (degree 3): ŷ (x) := w0 + p X w i xi + i=1 + p X p X hvi , vj i xi xj i=1 j>i p X p X p X k X (3) (3) (3) vi,f vj,f vl,f xi xj xl i=1 j>i l>j f =1 I Model parameters: w0 ∈ R, w ∈ Rp , V ∈ Rp×k , V(3) ∈ Rp×k [Rendle 2010, Rendle 2012] Steffen Rendle 52 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Factorization Machines: Discussion I FMs work with real valued input. I FMs include variable interactions like polynomial regression. I Model parameters for interactions are factorized. I Number of model parameters is O(k p) (instead of O(p 2 ) for poly. regr.). Steffen Rendle 53 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Factorization Machines: Discussion I FMs work with real valued input. I FMs include variable interactions like polynomial regression. I Model parameters for interactions are factorized. I Number of model parameters is O(k p) (instead of O(p 2 ) for poly. regr.). I How are FMs related to the factorization models we have seen so far? Steffen Rendle 53 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Matrix Factorization and Factorization Machines Two categorical variables encoded with real valued predictor variables: Feature vector x x(1) 1 0 0 ... 1 0 0 0 ... x(2) 1 0 0 ... 0 1 0 0 ... x(3) 1 0 0 ... 0 0 1 0 ... x(4) 0 1 0 ... 0 0 1 0 ... x(5) 0 1 0 ... 0 0 0 1 ... x(6) 0 0 1 ... 1 0 0 0 ... x(7) 0 0 1 ... 0 0 1 0 ... B C User ... TI NH SW ST Movie A ... With this data, the FM is identical to MF with biases: ŷ (x) = w0 + wu + wi + hvu , vi i | {z } MF Steffen Rendle 54 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Tag-Recommendation with Factorization Machines Three categorical variables encoded with real valued predictor variables: Feature vector x x(1) 1 0 0 ... 1 0 0 0 ... 1 0 0 0 ... x(2) 1 0 0 ... 0 1 0 0 ... 0 1 0 0 ... x(3) 1 0 0 ... 0 0 1 0 ... 0 0 0 1 ... x(4) 0 1 0 ... 0 0 1 0 ... 0 0 1 0 ... x(5) 0 1 0 ... 0 0 0 1 ... 0 0 1 0 ... x(6) 0 0 1 ... 1 0 0 0 ... 1 0 0 0 ... x(7) 0 0 1 ... 0 0 1 0 ... 0 0 0 1 ... A B C User ... S1 S2 S3 S4 ... T1 T2 T3 T4 ... Song Tag With this data, the FM is a tensor factorization model with lower-level interactions (here up to pairwise ones): ŷ (x) := w0 + wi + wu + wt + hvu , vt i + hvi , vt i + hvu , vi i Steffen Rendle 55 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Time with Factorization Machines Two categorical variables and time as linear predictor: Feature vector x x(1) 1 0 0 ... 1 0 0 0 ... 0.2 x(2) 1 0 0 ... 0 1 0 0 ... 0.6 x(3) 1 0 0 ... 0 0 1 0 ... 0.61 x(4) 0 1 0 ... 0 0 1 0 ... 0.3 x (5) 0 1 0 ... 0 0 0 1 ... 0.5 x (6) 0 0 1 ... 1 0 0 0 ... 0.1 x(7) 0 0 1 ... 0 0 1 0 ... 0.8 B C User ... TI NH SW ST Movie A ... Time The FM model would correspond to: ŷ (x) := w0 + wi + wu + t wtime + hvu , vi i + t hvu , vtime i + t hvi , vtime i Steffen Rendle 56 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Time with Factorization Machines Two categorical variables and time discretized in bins (b(t)): Feature vector x 1 0 0 ... 1 0 0 0 ... 1 0 0 x(2) 1 0 0 ... 0 1 0 0 ... 0 1 0 x(3) 1 0 0 ... 0 0 1 0 ... 0 1 0 x(4) 0 1 0 ... 0 0 1 0 ... 1 0 0 x(5) 0 1 0 ... 0 0 0 1 ... 0 1 0 x(6) 0 0 1 ... 1 0 0 0 ... 1 0 0 x(7) 0 0 1 ... 0 0 1 0 ... 0 0 1 B C User ... TI NH SW ST Movie x (1) A ... T1 T2 T3 Time With this data, a three-order FM includes the time-aware tensor factorization model described before: ŷ (x) := w0 + wi + wu + wb(t) + hvu , vi i + hvu , vb(t) i + hvi , vb(t) i + k X (3) (3) (3) vu,f vi,f vb(t),f f =1 | {z } Time Tensor Factorization Model Steffen Rendle 56 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Time with Factorization Machines Two categorical variables and time discretized in bins (b(t)): Feature vector x x (1) 1 x (2) 0 0 0 0 0 0 0 0 ... 1 0 0 0 ... 0 1 0 0 0 0 0 0 0 ... 0 1 0 0 ... x(3) 0 1 0 0 0 0 0 0 0 ... 0 0 1 0 ... x(4) 0 0 0 1 0 0 0 0 0 ... 0 0 1 0 ... x(5) 0 0 0 0 1 0 0 0 0 ... 0 0 0 1 ... x(6) 0 0 0 0 0 0 1 0 0 ... 1 0 0 0 ... x(7) 0 0 0 0 0 0 0 0 1 ... 0 0 1 0 ... AT1 AT2 AT3 BT1 BT2 BT3 CT1 CT2 CT3 ... UserTime TI NH SW ST Movie ... With this data, an FM includes the time-aware matrix factorization model with binned user-time interactions: ŷ (x) := w0 + wi + wu,b(t) + hvu,b(t) , vi i | {z } MF with time variant factors [Koren, 2009] Steffen Rendle 56 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines SVD++ Feature vector x (1) 1 0 0 ... 1 0 0 0 ... 0.3 0.3 0.3 0 ... x(2) 1 0 0 ... 0 1 0 0 ... 0.3 0.3 0.3 0 ... x(3) 1 0 0 ... 0 0 1 0 ... 0.3 0.3 0.3 0 ... x(4) 0 1 0 ... 0 0 1 0 ... 0 0 0.5 0.5 ... x(5) 0 1 0 ... 0 0 0 1 ... 0 0 0.5 0.5 ... x(6) 0 0 1 ... 1 0 0 0 ... 0.5 0 0.5 0 ... x(7) 0 0 1 ... 0 0 1 0 ... 0.5 0 0.5 0 ... B C User ... TI NH SW ST Movie x A ... TI NH SW ST ... Other Movies rated With this data, the FM is identical to: SVD++ z }| 1 { X ŷ (x) = w0 + wu + wi + hvu , vi i + p hvi , vl i |Nu | l∈Nu   X X 1 1 wl + hvu , vl i + p +p hvl , vl0 i |Nu | l∈Nu |Nu | l 0 ∈Nu ,l 0 >l [Koren, 2008] Steffen Rendle 57 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Factorization Machines: Discussion II I Representing categorical variables with real-valued variables and applying FMs is comparable to the factorization models that have been derived individually before (e.g. (bias) MF, tensor factorization, SVD++). I FMs are much more flexible and can handle also non-categorical variables. I Applying FMs is simple, as only data preprocessing has to be done (defining the real-valued predictor variables). Steffen Rendle 58 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Computation Complexity Factorization Machine model equation: ŷ (x) := w0 + p X w i xi + i=1 I Steffen Rendle p X p X hvi , vj i xi xj i=1 j>i Trivial computation: O(p 2 k) 59 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Computation Complexity Factorization Machine model equation: ŷ (x) := w0 + p X w i xi + i=1 p X p X hvi , vj i xi xj i=1 j>i I Trivial computation: O(p 2 k) I Efficient computation can be done in: O(p k) Steffen Rendle 59 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Computation Complexity Factorization Machine model equation: ŷ (x) := w0 + p X w i xi + i=1 p X p X hvi , vj i xi xj i=1 j>i I Trivial computation: O(p 2 k) I Efficient computation can be done in: O(p k) Making use of many zeros in x even in: O(Nz (x) k), where Nz (x) is the number of non-zero elements in vector x. I Steffen Rendle 59 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Efficient Computation The model equation of an FM can be computed in O(p k). Steffen Rendle 60 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Efficient Computation The model equation of an FM can be computed in O(p k). Proof: ŷ (x) := w0 + p X w i xi + i=1 = w0 + p X i=1 Steffen Rendle p X p X hvi , vj i xi xj i=1 j>i   !2 p p k X 1 X X 2 xi vi,f − (xi vi,f )  w i xi + 2 f =1 60 / 75 i=1 i=1 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Efficient Computation The model equation of an FM can be computed in O(p k). Proof: ŷ (x) := w0 + p X w i xi + i=1 = w0 + p X i=1 p X p X hvi , vj i xi xj i=1 j>i   !2 p p k X 1 X X 2 xi vi,f − (xi vi,f )  w i xi + 2 f =1 i=1 i=1 I In the sums over i, only non-zero xi elements have to be summed up ⇒ O(Nz (x) k). I (The complexity of polynomial regression is O(Nz (x)2 ).) Steffen Rendle 60 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Multilinearity FMs are multilinear: ∀θ ∈ Θ = {w0 , w, V} : ŷ (x, θ) = h(θ) (x) θ + g(θ) (x) where g(θ) and h(θ) do not depend on the value of θ. Steffen Rendle 61 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Multilinearity FMs are multilinear: ∀θ ∈ Θ = {w0 , w, V} : ŷ (x, θ) = h(θ) (x) θ + g(θ) (x) where g(θ) and h(θ) do not depend on the value of θ. E.g. for second order effects (θ = vl,f ): g(v z ŷ (x, vl,f ) := w0 + p X wi xi + (x) l,f ) p p X X i=1 }| i=1 j=i+1 k X { vi,f 0 vj,f 0 xi xj f 0 =1 (f 0 6=f )∨(l6∈{i,j}) + vl,f xl X vi,f xi i=1,i6=l | Steffen Rendle 61 / 75 h(v {z (x) l,f ) } Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Learning Using these properties, learning algorithms can be developed: I L2-regularized regression and classification: I I I I Stochastic gradient descent [Rendle, 2010] Alternating least squares/ Coordinate Descent [Rendle et al., 2011, Rendle 2012] Markov Chain Monte Carlo (for Bayesian FMs) [Freudenthaler et al. 2011, Rendle 2012] L2-regularized ranking: I Stochastic gradient descent [Rendle, 2010] All the proposed learning algorithms have a runtime of O(k Nz (X ) i), where i is the number of iterations and Nz (X ) the number of non-zero elements in the design matrix X . Steffen Rendle 62 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Stochastic Gradient Descent (SGD) I For each training case (x, y ) ∈ S, SGD updates the FM model parameter θ using: θ0 = θ − α (ŷ (x) − y )h(θ) (x) + λ(θ) θ I α is the learning rate / step size. I λ(θ) is the regularization value of the parameter θ. I SGD can easily be applied to other loss functions. [Rendle, 2010] Steffen Rendle 63 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Alternating Least Squares (ALS) I Elementwise ALS updates each FM model parameter θ using: P (x,y )∈S g(θ) (x) − y h(θ) (x) 0 P θ =− 2 (x,y )∈S h(θ) (x) + λ(θ) I Using caches of intermediate results, the runtime for updating all model parameters is O(k Nz (X )). I The advantage of ALS compared to SGD is that no learning rate has to be specified. I ALS can be extended to classification [Rendle, 2012]. [Rendle et al., 2011] Steffen Rendle 64 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Bayesian FMs (BFM) w , w  v , v Factorization Machines 0 , 0  ,  w v w j=1,...,p wj j=1,...,p vj vj wj x ij yi w0 w , w 0 0 x ij yi w0 i=1,...,n  v w , w 0 i=1,...,n 0   0 , 0 w0 ∼ N (µw0 , 1/λw0 ), µw ∼ N (µ0 , γ0 λw ), ∀j ∈ {1, . . . , p} : wj ∼ N (µw , 1/λw ), λw ∼ Γ(αλ , βλ ), µv ,f ∼ N (µ0 , γ0 λv ,f ), vj ∼ N (µv , Λ−1 v ) λv ,f ∼ Γ(αλ , βλ ) [Freudenthaler et al., 2011] Steffen Rendle 65 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Bayesian FMs (BFM) w , w  v , v wj vj 0 , 0  ,  w v wj vj w j=1,...,p j=1,...,p x ij yi w0 w , w 0 0 x ij yi w0 i=1,...,n  v w , w 0 i=1,...,n 0   0 , 0 I The SGD and ALS models correspond to the left model. I The right side is a two level model that integrates priors. [Freudenthaler et al., 2011] Steffen Rendle 65 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Bayesian FMs (BFM): Inference I For Bayesian inference an efficient Gibbs sampler can be derived. I The Gibbs posterior distribution for each model parameter θ is related to the ALS. I Sampling all model parameters once can be done in O(k Nz (X )) as well. I Introducing hyperpriors and integrating over priors has the advantage over ALS that the values of the priors are ‘automatically’ found. I BFMs can be extended to classification [Rendle, 2012]. [Freudenthaler et al., 2011] Steffen Rendle 66 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Outline Tensor Factorization Time-aware Factorization Models Factorization Machines Problem Setting Standard Models Factorization Machines Applications Summary Steffen Rendle 67 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Applications FMs are especially suited for ML problems: I Categorical variables of large domain. I Number of predictor variables is large. I Interactions between predictor variables are of interest. Several variables involved. I Steffen Rendle 68 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines (Context-aware) Recommender Systems I Main variables: I I I Additional variables: I I I I I I User ID (categorical) Item ID (categorical) time mood user profile item meta data ... Examples: Netflix prize, Movielens, KDDCup 2011 + User Steffen Rendle ♪ Song 69 / 75 + + Time Mood Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Clickthrough Prediction I Main variables: I I I I Additional variables: I I I I User ID Query ID Ad/ Link ID query tokens user profile ... Example: KDDCup 2012 Track 2 (FM placed 3rd/171) + Link 1 keyword... User Steffen Rendle Query 70 / 75 + Link 2 Link 3 Ad/ Link Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Student Performance Prediction I Main variables: I I I Additional variables: I I I I I Student ID Question ID question hierarchy sequence of questions skills required ... Examples: KDDCup 2010, Grockit Challenge2 (FM placed 1st/241) ? + Student Question 2 http://www.kaggle.com/c/WhatDoYouKnow Steffen Rendle 71 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Link Prediction in Social Networks I Main variables: I I I Additional variables: I I I I Actor A ID Actor B ID profiles actions ... Example: KDDCup 2012 Track 1 (FM placed 2nd/658) + Actor A Steffen Rendle 72 / 75 Actor B Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines libFM Software libFM is an implementation of FMs I Model: second-order FMs I Learning/ inference: SGD, ALS, MCMC I Classification and regression I Uses the same data format as LIBSVM, LIBLINEAR [Lin et. al], SVMlight [Joachims]. I Supports variable grouping. I Available with source code. [http://www.libfm.org/] Steffen Rendle 73 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Outline Tensor Factorization Time-aware Factorization Models Factorization Machines Problem Setting Standard Models Factorization Machines Applications Summary Steffen Rendle 74 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Summary I Real-valued predictor variables can encode information from variables of other domains, e.g. categorical variables. I Applying linear regression to large categorical domains results in too little expressiveness; applying polynomial regression results in too much expressiveness. I Factorization Machines (FM) are a polynomial regression model with factorized interaction parameters. I FMs bring together the generality of standard machine learning methods with the prediction quality of factorization models. I FMs are multilinear and can be computed efficiently. Steffen Rendle 75 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Y. Cai, M. Zhang, D. Luo, C. Ding, and S. Chakravarthy. Low-order tensor decompositions for social tagging recommendation. In Proceedings of the fourth ACM international conference on Web search and data mining, WSDM ’11, pages 695–704, New York, NY, USA, 2011. ACM. J. Carroll and J. Chang. Analysis of individual differences in multidimensional scaling via an n-way generalization of eckart-young decomposition. Psychometrika, 35:283–319, 1970. A. Cichocki, R. Zdunek, A. H. Phan, and S.-i. Amari. Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multi-way Data Analysis and Blind Source Separation. Wiley Publishing, 2009. L. Drumond, S. Rendle, and L. Schmidt-Thieme. Predicting rdf triples in incomplete knowledge bases with tensor factorization. Steffen Rendle 75 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines In Proceedings of the 27th Annual ACM Symposium on Applied Computing, SAC ’12, pages 326–331, New York, NY, USA, 2012. ACM. T. Franz, A. Schultz, S. Sizov, and S. Staab. Triplerank: Ranking semantic web data by tensor decomposition. In Proceedings of the 8th International Semantic Web Conference, ISWC ’09, pages 213–228, Berlin, Heidelberg, 2009. Springer-Verlag. C. Freudenthaler, L. Schmidt-Thieme, and S. Rendle. Bayesian factorization machines. In Workshop on Sparse Representation and Low-rank Approximation, NIPS 2011, 2011. R. A. Harshman. Foundations of the parafac procedure: models and conditions for an ’exploratory’ multimodal factor analysis. UCLA Working Papers in Phonetics, pages 1–84, 1970. A. Hotho, R. Jäschke, C. Schmitz, and G. Stumme. Information retrieval in folksonomies: Search and ranking. Steffen Rendle 75 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines In Y. Sure and J. Domingue, editors, The Semantic Web: Research and Applications, volume 4011 of Lecture Notes in Computer Science, pages 411–426, Heidelberg, June 2006. Springer. A. Karatzoglou, X. Amatriain, L. Baltrunas, and N. Oliver. Multiverse recommendation: n-dimensional tensor factorization for context-aware collaborative filtering. In RecSys ’10: Proceedings of the fourth ACM conference on Recommender systems, pages 79–86, New York, NY, USA, 2010. ACM. T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455–500, September 2009. Y. Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In KDD ’08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 426–434, New York, NY, USA, 2008. ACM. Steffen Rendle 75 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Y. Koren. Collaborative filtering with temporal dynamics. In KDD ’09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 447–456, New York, NY, USA, 2009. ACM. L. D. Lathauwer, B. D. Moor, and J. Vandewalle. A multilinear singular value decomposition. SIAM J. Matrix Anal. Appl., 21(4):1253–1278, 2000. S. Rendle. Factorization machines. In Proceedings of the 10th IEEE International Conference on Data Mining. IEEE Computer Society, 2010. S. Rendle. Factorization machines with libfm. ACM Trans. Intell. Syst. Technol., 3(3):57:1–57:22, May 2012. S. Rendle, C. Freudenthaler, and L. Schmidt-Thieme. Factorizing personalized markov chains for next-basket recommendation. Steffen Rendle 75 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines In WWW ’10: Proceedings of the 19th international conference on World wide web, pages 811–820, New York, NY, USA, 2010. ACM. S. Rendle, Z. Gantner, C. Freudenthaler, and L. Schmidt-Thieme. Fast context-aware recommendations with factorization machines. In Proceedings of the 34th ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2011. S. Rendle and L. Schmidt-Thieme. Pairwise interaction tensor factorization for personalized tag recommendation. In WSDM ’10: Proceedings of the third ACM international conference on Web search and data mining, pages 81–90, New York, NY, USA, 2010. ACM. J.-T. Sun, H.-J. Zeng, H. Liu, Y. Lu, and Z. Chen. Cubesvd: a novel approach to personalized web search. In WWW ’05: Proceedings of the 14th international conference on World Wide Web, pages 382–390, New York, NY, USA, 2005. ACM. L. Tucker. Some mathematical notes on three-mode factor analysis. Steffen Rendle 75 / 75 Social Network Analysis, University of Konstanz Tensor Factorization Time-aware Factorization Models Factorization Machines Psychometrika, 31:279–311, 1966. L. Xiong, X. Chen, T.-K. Huang, J. Schneider, and J. G. Carbonell. Temporal collaborative filtering with bayesian probabilistic tensor factorization. In Proceedings of SIAM Data Mining, 2010. A. Zimdars, D. M. Chickering, and C. Meek. Using temporal data for making recommendations. In UAI ’01: Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, pages 580–588, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. Steffen Rendle 75 / 75 Social Network Analysis, University of Konstanz

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Factorization Models for Recommender Systems and Other