Download Factorization Models for Recommender Systems and Other

Document related concepts
no text concepts found
Transcript
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Factorization Models for Recommender Systems
and Other Applications
Part II
Lars Schmidt-Thieme, Steffen Rendle
Tutorial at KDD Conference, 12th August 2012, Beijing
Steffen Rendle
1 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Outline
Tensor Factorization
Problem Setting
Models
Learning
Examples for Applications
Summary
Time-aware Factorization Models
Factorization Machines
Steffen Rendle
2 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Problem Setting
I
Predictor variables: m variables of categorical domain I1 , . . . , Im .
I
Target y : Real-valued (regression), binary (classification), scores
(ranking).
I
Supervised task: set of observations S = {(i1 , . . . , im , y ), . . .}
Steffen Rendle
3 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Example: Social Tagging1
1 http://last.fm
Steffen Rendle
4 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Example: Social Tagging1
User
u1
t1
u2
t2
Tags
t3
i1
i2
Items
1 http://last.fm
Steffen Rendle
4 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Example: Social Tagging1
User
u1
t1
u2
t2
t3
i1
i2
Tags
t1
i1
i2
1
1
t2
t3
1
1
Items
1 http://last.fm
Steffen Rendle
4 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Example: Social Tagging1
User
u1
t1
u2
t2
t3
i1
i2
Tags
t1
i1
i2
1
1
t2
t3
1
1
Items
Tagging can be expressed as a function over three categorical domains:
y : U × I × T → {0, 1}
1 http://last.fm
Steffen Rendle
4 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Example: Querying Incomplete RDF-Graphs
I
Steffen Rendle
Task: Answer queries about subject-predicate-pairs. E.g. What is
McCartney member of?
5 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Example: Querying Incomplete RDF-Graphs
An RDF-Graph can be expressed as a function over three categorical
domains: y : S × P × O → {0, 1}
Steffen Rendle
5 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Notation: Tensors and Functions
Models in this setting are functions:
ŷ : I1 × . . . × Im → Y
Steffen Rendle
6 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Notation: Tensors and Functions
Models in this setting are functions:
ŷ : I1 × . . . × Im → Y
All possible targets and predictions can be written equivalently as a
m-order tensor / multiway array:
Y ∈ Y |I1 |×...×|Im | ,
Ŷ ∈ Y |I1 |×...×|Im |
where
y (i1 , . . . , im ) = yi1 ,...,im ,
Steffen Rendle
6 / 75
ŷ (i1 , . . . , im ) = ŷi1 ,...,im
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Notation: Tensor-Matrix product
I
Let T ∈ Rk1 ×...×km be a m-order tensor and V ∈ Rn×kl be a matrix.
I
The mode-l tensor-matrix product ×l is defined as:
(T ×l M)i1 ,...,il−1 ,j,il+1 ,...,im :=
kl
X
ti1 ,...,im mj,il
il =1
kl
kl
Steffen Rendle
T
×
n
7 / 75
M
=
n
T*
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Notation: Tensor-Matrix product
I
Let T ∈ Rk1 ×...×km be a m-order tensor and V ∈ Rn×kl be a matrix.
I
The mode-l tensor-matrix product ×l is defined as:
(T ×l M)i1 ,...,il−1 ,j,il+1 ,...,im :=
kl
X
ti1 ,...,im mj,il
il =1
I
The result is a tensor T ∗ of dimension Rk1 ×kl−1 ×n×kl+1 ×...×km
kl
kl
Steffen Rendle
T
×
n
7 / 75
M
=
n
T*
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Notation: Tensor-Matrix product
I
Let T ∈ Rk1 ×...×km be a m-order tensor and V ∈ Rn×kl be a matrix.
I
The mode-l tensor-matrix product ×l is defined as:
(T ×l M)i1 ,...,il−1 ,j,il+1 ,...,im :=
kl
X
ti1 ,...,im mj,il
il =1
I
The result is a tensor T ∗ of dimension Rk1 ×kl−1 ×n×kl+1 ×...×km
I
The size of the l-th mode changes from kl to n.
kl
kl
Steffen Rendle
T
×
n
7 / 75
M
=
n
T*
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Notation: Tensor-Matrix product
I
Let T ∈ Rk1 ×...×km be a m-order tensor and V ∈ Rn×kl be a matrix.
I
The mode-l tensor-matrix product ×l is defined as:
(T ×l M)i1 ,...,il−1 ,j,il+1 ,...,im :=
kl
X
ti1 ,...,im mj,il
il =1
I
The result is a tensor T ∗ of dimension Rk1 ×kl−1 ×n×kl+1 ×...×km
I
The size of the l-th mode changes from kl to n.
×
=
t*1,1,1
Steffen Rendle
7 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Notation: Tensor-Matrix product
I
Let T ∈ Rk1 ×...×km be a m-order tensor and V ∈ Rn×kl be a matrix.
I
The mode-l tensor-matrix product ×l is defined as:
(T ×l M)i1 ,...,il−1 ,j,il+1 ,...,im :=
kl
X
ti1 ,...,im mj,il
il =1
I
The result is a tensor T ∗ of dimension Rk1 ×kl−1 ×n×kl+1 ×...×km
I
The size of the l-th mode changes from kl to n.
×
=
t*1,2,1
Steffen Rendle
7 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Notation: Tensor-Matrix product
I
Let T ∈ Rk1 ×...×km be a m-order tensor and V ∈ Rn×kl be a matrix.
I
The mode-l tensor-matrix product ×l is defined as:
(T ×l M)i1 ,...,il−1 ,j,il+1 ,...,im :=
kl
X
ti1 ,...,im mj,il
il =1
I
The result is a tensor T ∗ of dimension Rk1 ×kl−1 ×n×kl+1 ×...×km
I
The size of the l-th mode changes from kl to n.
×
=
t*2,2,1
Steffen Rendle
7 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Outline
Tensor Factorization
Problem Setting
Models
Learning
Examples for Applications
Summary
Time-aware Factorization Models
Factorization Machines
Steffen Rendle
8 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Parallel Factor Analysis (PARAFAC)
I3
k
I1
=
Y
k
1
0
0
0
0
0
1
×1
I1
V1 ×2 I V2 ×3 I V3
2
3
k
I2
k
k
k
m-order PARAFAC in tensor product notation:
Ŷ := C ×1 V (1) ×2 . . . ×m V (m)
with model parameters
V (l) ∈ R|Il |×k ,
∀l ∈ {1, . . . , m}
and where C is the identity tensor:
C ∈ Rk×...×k ,
cj1 ,...,jm := δ(j1 = . . . = jm )
[Harshman 1970, Carroll 1970]
Steffen Rendle
9 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Parallel Factor Analysis (PARAFAC)
I3
k
I1
=k
Y
1
0
0
0
0
0
1
×1
I1
V1 ×2 I V2 ×3 I V3
2
3
k
I2
k
k
k
m-order PARAFAC in element-wise notation:
ŷ (i1 , . . . , im ) :=
k
X
(1)
vi1 ,f
f =1
(m)
. . . vim ,f
=
k Y
m
X
(l)
vil ,f
f =1 l=1
with model parameters
V (l) ∈ R|Il |×k ,
∀l ∈ {1, . . . , m}
[Harshman 1970, Carroll 1970]
Steffen Rendle
9 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Parallel Factor Analysis (PARAFAC)
Notes
I
For a m = 2-order tensor (i.e. a matrix), PARAFAC is the same as
matrix factorization.
[e.g. Kolda et al. 2009, Cichocki et al. 2009]
Steffen Rendle
10 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Parallel Factor Analysis (PARAFAC)
Notes
I
For a m = 2-order tensor (i.e. a matrix), PARAFAC is the same as
matrix factorization.
I
Sometimes a modified PARAFAC with diagonal core cf ,...,f =: λf
and factors of unit length is used:
ŷ (i1 , . . . , im ) :=
k
X
f =1
(1)
(m)
λf vi1 ,f . . . vim ,f =
k
X
f =1
λf
m
Y
(l)
vil ,f
l=1
[e.g. Kolda et al. 2009, Cichocki et al. 2009]
Steffen Rendle
10 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Parallel Factor Analysis (PARAFAC)
Notes
I
For a m = 2-order tensor (i.e. a matrix), PARAFAC is the same as
matrix factorization.
I
Sometimes a modified PARAFAC with diagonal core cf ,...,f =: λf
and factors of unit length is used:
ŷ (i1 , . . . , im ) :=
k
X
f =1
I
(1)
(m)
λf vi1 ,f . . . vim ,f =
k
X
f =1
λf
m
Y
(l)
vil ,f
l=1
Other constraints, e.g. non-negativity or symmetry can be imposed.
[e.g. Kolda et al. 2009, Cichocki et al. 2009]
Steffen Rendle
10 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Parallel Factor Analysis (PARAFAC)
Notes
I
For a m = 2-order tensor (i.e. a matrix), PARAFAC is the same as
matrix factorization.
I
Sometimes a modified PARAFAC with diagonal core cf ,...,f =: λf
and factors of unit length is used:
ŷ (i1 , . . . , im ) :=
k
X
f =1
(1)
(m)
λf vi1 ,f . . . vim ,f =
k
X
f =1
λf
m
Y
(l)
vil ,f
l=1
I
Other constraints, e.g. non-negativity or symmetry can be imposed.
I
PARAFAC is also called Canonical Decomposition (CANDECOMP).
[e.g. Kolda et al. 2009, Cichocki et al. 2009]
Steffen Rendle
10 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Tucker Decomposition (TD)
I3
k3
I1
Y
=k C
1
×1
I1
V1 ×2 I V2 ×3 I V3
2
3
k2
I2
k1
k2
k3
m-order Tucker Decomposition in tensor product notation:
Ŷ := C ×1 V (1) ×2 . . . ×m V (m)
where V and C are model parameters:
C ∈ Rk1 ×...×km ,
V (l) ∈ R|Il |×kl ,
∀l ∈ {1, . . . , m}
[Tucker 1966]
Steffen Rendle
11 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Tucker Decomposition (TD)
I3
k3
I1
=k C
Y
1
×1
I1
V1 ×2 I V2 ×3 I V3
2
3
k2
I2
k1
k2
k3
m-order Tucker Decomposition in element-wise notation:
ŷ (i1 , . . . , im ) :=
k1
X
f1 =1
km
X
...
(1)
(m)
cf1 ,...,fm vi1 ,f1 . . . vim ,fm =
fm =1
k1
X
f1 =1
...
km
X
fm =1
cf1 ,...,fm
m
Y
(l)
vil ,fl
l=1
with model parameters:
C ∈ Rk1 ×...×km ,
V (l) ∈ R|Il |×kl ,
∀l ∈ {1, . . . , m}
[Tucker 1966]
Steffen Rendle
11 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Tucker Decomposition (TD)
Notes
I
For a m = 2-order tensor (i.e. a matrix), TD is different from matrix
factorization:
ŷ TD (i1 , i2 ) =
k1 X
k2
X
(1)
(2)
cf1 ,f2 vi1 ,f1 vi2 ,f2 6=
f1 =1 f2 =1
k
X
(1) (2)
vi1 ,f vi2 ,f = ŷ MF (i1 , i2 )
f =1
[e.g. Kolda et al. 2009, Cichocki et al. 2009]
Steffen Rendle
12 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Tucker Decomposition (TD)
Notes
I
For a m = 2-order tensor (i.e. a matrix), TD is different from matrix
factorization:
ŷ TD (i1 , i2 ) =
k1 X
k2
X
f1 =1 f2 =1
I
(1)
(2)
cf1 ,f2 vi1 ,f1 vi2 ,f2 6=
k
X
(1) (2)
vi1 ,f vi2 ,f = ŷ MF (i1 , i2 )
f =1
Sometimes orthogonality constraints on V are imposed.
[e.g. Kolda et al. 2009, Cichocki et al. 2009]
Steffen Rendle
12 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Tucker Decomposition (TD)
Notes
I
For a m = 2-order tensor (i.e. a matrix), TD is different from matrix
factorization:
ŷ TD (i1 , i2 ) =
k1 X
k2
X
(1)
(2)
cf1 ,f2 vi1 ,f1 vi2 ,f2 6=
f1 =1 f2 =1
k
X
(1) (2)
vi1 ,f vi2 ,f = ŷ MF (i1 , i2 )
f =1
I
Sometimes orthogonality constraints on V are imposed.
I
Other constraints, e.g. non-negativity or symmetry can be imposed.
[e.g. Kolda et al. 2009, Cichocki et al. 2009]
Steffen Rendle
12 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
PARAFAC vs. TD
I
PARAFAC
ŷ (i1 , . . . , im ) :=
k
X
(1)
(m)
vi1 ,f . . . vim ,f =
f =1
I
(l)
vil ,f
f =1 l=1
TD
ŷ (i1 , . . . , im ) :=
k1
X
f1 =1
Steffen Rendle
k Y
m
X
...
km
X
(1)
(m)
cf1 ,...,fm vi1 ,f1 . . . vim ,fm =
k1
X
...
km
X
fm =1
cf1 ,...,fm
m
Y
(l)
vil ,fl
fm =1
f1 =1
l=1
13 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
PARAFAC vs. TD
I
PARAFAC
ŷ (i1 , . . . , im ) :=
k
X
(1)
(m)
vi1 ,f . . . vim ,f =
f =1
I
k1
X
f1 =1
Steffen Rendle
(l)
vil ,f
f =1 l=1
TD
ŷ (i1 , . . . , im ) :=
I
k Y
m
X
...
km
X
(1)
(m)
cf1 ,...,fm vi1 ,f1 . . . vim ,fm =
fm =1
k1
X
f1 =1
...
km
X
fm =1
cf1 ,...,fm
m
Y
(l)
vil ,fl
l=1
TD is more general, as C is free.
13 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
PARAFAC vs. TD
I
PARAFAC
ŷ (i1 , . . . , im ) :=
k
X
(1)
(m)
vi1 ,f . . . vim ,f =
f =1
I
(l)
vil ,f
f =1 l=1
TD
ŷ (i1 , . . . , im ) :=
k1
X
f1 =1
...
km
X
(1)
fm =1
TD is more general, as C is free.
I
Computational complexity:
I PARAFAC: O(k m).
I TD: O(k m ) if k1 = . . . = km =: k.
13 / 75
(m)
cf1 ,...,fm vi1 ,f1 . . . vim ,fm =
I
Steffen Rendle
k Y
m
X
k1
X
f1 =1
...
km
X
fm =1
cf1 ,...,fm
m
Y
(l)
vil ,fl
l=1
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Tensor Factorization as Machine Learning Models
I
I
PARAFAC and TD model m-ary interactions directly.
PARAFAC and TD have problems when the number of observations
for some levels is small:
I
I
Steffen Rendle
E.g. assume that there are no observations for a level l, then for the
estimated factors vl = 0 (in case of L2 regularization) and thus all
predictions involving this level will always be 0 as well (for PARAFAC
and TD).
Similar problems can occur if the number of observations of a level is
small.
14 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Tensor Factorization as Machine Learning Models
I
I
PARAFAC and TD model m-ary interactions directly.
PARAFAC and TD have problems when the number of observations
for some levels is small:
I
I
I
Steffen Rendle
E.g. assume that there are no observations for a level l, then for the
estimated factors vl = 0 (in case of L2 regularization) and thus all
predictions involving this level will always be 0 as well (for PARAFAC
and TD).
Similar problems can occur if the number of observations of a level is
small.
Standard L2-regularization alone cannot solve this problem.
14 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Tensor Factorization as Machine Learning Models
I
I
PARAFAC and TD model m-ary interactions directly.
PARAFAC and TD have problems when the number of observations
for some levels is small:
I
I
E.g. assume that there are no observations for a level l, then for the
estimated factors vl = 0 (in case of L2 regularization) and thus all
predictions involving this level will always be 0 as well (for PARAFAC
and TD).
Similar problems can occur if the number of observations of a level is
small.
I
Standard L2-regularization alone cannot solve this problem.
I
If a m-ary interaction cannot be estimated reliably, often a
lower-level interaction (e.g. (m − 1)-ary) can be estimated reliably.
Steffen Rendle
14 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
TF with Lower-level Interactions
Model equation of m-ary tensor factorization with nested lower-level
interactions
ŷ LLTF (i1 , . . . , im ) := c +
m
X
(l)
wil +
l=1
m X
m
X
ŷ TF (il1 , il2 ) + . . . + ŷ TF (i1 , . . . , im )
l1 =1 l2 >l1
[e.g. Rendle et al. 2010; Cai et al. 2011]
Steffen Rendle
15 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
TF with Lower-level Interactions
Model equation of m-ary tensor factorization with nested lower-level
interactions
ŷ LLTF (i1 , . . . , im ) := c +
m
X
(l)
wil +
l=1
m X
m
X
ŷ TF (il1 , il2 ) + . . . + ŷ TF (i1 , . . . , im )
l1 =1 l2 >l1
Model parameters
c ∈ R,
w(l) ∈ R|Il | ,
...,
V (l) ∈ R|Il |×k
[e.g. Rendle et al. 2010; Cai et al. 2011]
Steffen Rendle
15 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
TF with Lower-level Interactions
Model equation of m-ary tensor factorization with nested lower-level
interactions
ŷ LLTF (i1 , . . . , im ) := c +
m
X
(l)
wil +
l=1
m X
m
X
ŷ TF (il1 , il2 ) + . . . + ŷ TF (i1 , . . . , im )
l1 =1 l2 >l1
Model parameters
c ∈ R,
w(l) ∈ R|Il | ,
...,
V (l) ∈ R|Il |×k
I
Estimating a lower level effect (e.g. a pairwise one) reliably is easier
than estimating a higher level one.
I
Often lower level effects can explain the data sufficiently and higher
level ones can be dropped completely.
[e.g. Rendle et al. 2010; Cai et al. 2011]
Steffen Rendle
15 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Outline
Tensor Factorization
Problem Setting
Models
Learning
Examples for Applications
Summary
Time-aware Factorization Models
Factorization Machines
Steffen Rendle
16 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Standard Fitting Algorithms
Standard algorithms assume:
I Y is observed completely, i.e. for all combinations
(i1 , . . . , im ) ∈ I1 × . . . × Im , yi1 ,...,im is known.
I
I
Missing values are imputed.
Optimization is done with respect to least squares:
X
argmin
(yi1 ,...,im − ŷi1 ,...,im )2
Θ
I
Steffen Rendle
(i1 ,...,im )∈I1 ×...×Im
No regularization/ prior assumptions.
17 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Standard Fitting Algorithms
Standard algorithms assume:
I Y is observed completely, i.e. for all combinations
(i1 , . . . , im ) ∈ I1 × . . . × Im , yi1 ,...,im is known.
I
I
I
Missing values are imputed.
In ML problems most elements are missing (often > 99.9%).
Optimization is done with respect to least squares:
X
argmin
(yi1 ,...,im − ŷi1 ,...,im )2
Θ
I
I
ML: Other losses are also of interest, e.g. classification, ranking, . . . .
No regularization/ prior assumptions.
I
Steffen Rendle
(i1 ,...,im )∈I1 ×...×Im
ML: Prior knowledge should be included.
17 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Example: Higher-Order SVD (HOSVD)
HOSVD is such an approximative fitting algorithm:
I
Loss: Least-squares loss without regularization; no missing value
treatment.
I
Model: Tucker decomposition
Algorithm:
I
I
For each mode l
I
I
I
I
Unfold Y to matrix form.
Compute SVD.
V (l) are the left singular vectors of the SVD.
Compute core tensor
C = Y ×1 (V (1) )T ×2 (V (2) )T ×3 . . . ×m (V (m) )T .
[Tucker 66, Lathauwer et al. 2000]
Steffen Rendle
18 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Example: Higher-Order SVD (HOSVD)
HOSVD is such an approximative fitting algorithm:
I Loss: Least-squares loss without regularization; no missing value
treatment.
I Model: Tucker decomposition
I Algorithm:
I
For each mode l
I
I
I
I
Unfold Y to matrix form.
Compute SVD.
V (l) are the left singular vectors of the SVD.
Compute core tensor
C = Y ×1 (V (1) )T ×2 (V (2) )T ×3 . . . ×m (V (m) )T .
2
3
5
Steffen Rendle
[Tucker 66, Lathauwer et al. 2000]
18 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Example: Higher-Order SVD (HOSVD)
HOSVD is such an approximative fitting algorithm:
I Loss: Least-squares loss without regularization; no missing value
treatment.
I Model: Tucker decomposition
I Algorithm:
I
For each mode l
I
I
I
I
Unfold Y to matrix form.
Compute SVD.
V (l) are the left singular vectors of the SVD.
Compute core tensor
C = Y ×1 (V (1) )T ×2 (V (2) )T ×3 . . . ×m (V (m) )T .
2
3
5
Steffen Rendle
[Tucker 66, Lathauwer et al. 2000]
18 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Example: Higher-Order SVD (HOSVD)
HOSVD is such an approximative fitting algorithm:
I Loss: Least-squares loss without regularization; no missing value
treatment.
I Model: Tucker decomposition
I Algorithm:
I
For each mode l
I
I
I
I
Unfold Y to matrix form.
Compute SVD.
V (l) are the left singular vectors of the SVD.
Compute core tensor
C = Y ×1 (V (1) )T ×2 (V (2) )T ×3 . . . ×m (V (m) )T .
2
3
3
5
5
Steffen Rendle
5
[Tucker 66, Lathauwer et al. 2000]
18 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Example: Higher-Order SVD (HOSVD)
HOSVD is such an approximative fitting algorithm:
I Loss: Least-squares loss without regularization; no missing value
treatment.
I Model: Tucker decomposition
I Algorithm:
I
For each mode l
I
I
I
I
Unfold Y to matrix form.
Compute SVD.
V (l) are the left singular vectors of the SVD.
Compute core tensor
C = Y ×1 (V (1) )T ×2 (V (2) )T ×3 . . . ×m (V (m) )T .
2
3
5
Steffen Rendle
[Tucker 66, Lathauwer et al. 2000]
18 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Example: Higher-Order SVD (HOSVD)
HOSVD is such an approximative fitting algorithm:
I Loss: Least-squares loss without regularization; no missing value
treatment.
I Model: Tucker decomposition
I Algorithm:
I
For each mode l
I
I
I
I
Unfold Y to matrix form.
Compute SVD.
V (l) are the left singular vectors of the SVD.
Compute core tensor
C = Y ×1 (V (1) )T ×2 (V (2) )T ×3 . . . ×m (V (m) )T .
2
5
3
2
5
Steffen Rendle
2
2
[Tucker 66, Lathauwer et al. 2000]
18 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Example: Higher-Order SVD (HOSVD)
HOSVD is such an approximative fitting algorithm:
I Loss: Least-squares loss without regularization; no missing value
treatment.
I Model: Tucker decomposition
I Algorithm:
I
For each mode l
I
I
I
I
Unfold Y to matrix form.
Compute SVD.
V (l) are the left singular vectors of the SVD.
Compute core tensor
C = Y ×1 (V (1) )T ×2 (V (2) )T ×3 . . . ×m (V (m) )T .
2
3
5
Steffen Rendle
[Tucker 66, Lathauwer et al. 2000]
18 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Example: Higher-Order SVD (HOSVD)
HOSVD is such an approximative fitting algorithm:
I Loss: Least-squares loss without regularization; no missing value
treatment.
I Model: Tucker decomposition
I Algorithm:
I
For each mode l
I
I
I
I
Unfold Y to matrix form.
Compute SVD.
V (l) are the left singular vectors of the SVD.
Compute core tensor
C = Y ×1 (V (1) )T ×2 (V (2) )T ×3 . . . ×m (V (m) )T .
2
2
3
3
5
Steffen Rendle
3
3
3
3
[Tucker 66, Lathauwer et al. 2000]
18 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Example: Higher-Order SVD (HOSVD)
HOSVD is such an approximative fitting algorithm:
I
Loss: Least-squares loss without regularization; no missing value
treatment.
I
Model: Tucker decomposition
Algorithm:
I
I
For each mode l
I
I
I
I
Unfold Y to matrix form.
Compute SVD.
V (l) are the left singular vectors of the SVD.
Compute core tensor
C = Y ×1 (V (1) )T ×2 (V (2) )T ×3 . . . ×m (V (m) )T .
Additional Alternating Least-Squares (ALS) steps can improve the fit.
[Tucker 66, Lathauwer et al. 2000]
Steffen Rendle
18 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Machine Learning with TF Models
I
Optimize only w.r.t. observed elements of Y .
I
I
Choose loss/ likelihood according to the target variables/ task.
I
I
E.g. logit for classification, pairwise classification for ranking, etc.
Add priors / regularization to model parameters.
I
I
Comparable to MF: Weighted Low-Rank Approximations [Srebro et
al. 2003]
E.g. L2/ Gaussian priors.
Model lower-level interactions.
I
E.g. add factorized pairwise interactions [Rendle et al. 2010]
TF models are multilinear ⇒ simple SGD or ALS algorithms can be used
for optimization.
Steffen Rendle
19 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Outline
Tensor Factorization
Problem Setting
Models
Learning
Examples for Applications
Summary
Time-aware Factorization Models
Factorization Machines
Steffen Rendle
20 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Personalized Tag Recommendation
[Hotho
et al., 2006]
Task: Recommend a user a (personalized) list of tags for a specific
item.
Steffen Rendle
21 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
+
tag
+ +
1
+
+
+
+
+
User 1
+ +
+ +
+
tag
us
er
Personalized Tag Recommendation
+
+
User 2
+
+
User 3
+
+
+
+
item
item
+
+ +
+
item
item
I
U ... users
I
I ... items
I
T ... tags
I
S ⊆ U × I × T ... observed tags
I
PS = {(u, i)|∃t ∈ T : (u, i, t) ∈ S} ... observed tagging posts
[Hotho et al., 2006]
Steffen Rendle
21 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Evaluation: Prediction Quality
Last.fm
0.50
●
●
●
●
●
0.40
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.4
●
0.5
●
F−Measure
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.3
0.25
0.30
BPR−PITF 64
BPR−CD 64
RTF−TD 64
FolkRank
PageRank
HOSVD
npmax
●
●
●
●
0.35
F−Measure
0.45
●
BPR−PITF 128
BPR−CD 128
RTF−TD 128
FolkRank
PageRank
HOSVD
0.6
0.55
BibSonomy
2
4
6
8
Top n
10
2
4
6
8
10
Top n
I adapted PageRank for tag recommendation/ Folkrank [Hotho et al. 2006]
I HOSVD: TD for least squares, no missing values, no reg. [Symeonidis et al. 2008]
I RTF-TD: TD model optimized for regularized ranking [Rendle et al. 2009]
I BPR-PITF, BPR-CD: PITF/ PARAFAC model optimized for regularized ranking
[Rendle et al. 2010]
[Rendle et. al 2010]
Steffen Rendle
22 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Evaluation: Learning Runtime
0.6
0.5
0.4
BPR−PITF 64
BPR−CD 64
RTF−TD 64
0
5
10
15
20
25
Learning runtime in days
30
0.0
0.0
BPR−PITF 64
BPR−CD 64
RTF−TD 64
0.1
0.2
0.3
Top3 F−Measure
0.4
0.3
0.1
0.2
Top3 F−Measure
0.5
0.6
0.7
Last.fm: Prediction quality vs. learning runtime
0.7
Last.fm: Prediction quality vs. learning runtime
0
20
40
60
80
100
120
Learning runtime in minutes
[Rendle et. al 2010]
Steffen Rendle
22 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
ECML/PKDD Discovery Challenge 2009
Rank
1
2
3
4
5
6
...
Method
BPR-PITF + adaptive list size
BPR-PITF (not submitted)
Relational Classification [Marinho et al. 09]
Content-based [Lipczak et al. 09]
Content-based [Zhang et al. 09]
Content-based [Ju and Hwang 09]
Personomy translation [Wetzker et al. 09]
...
Top-5 F-Measure
0.35594
0.345
0.33185
0.32461
0.32230
0.32134
0.32124
...
Task 2: ECML/ PKDD Challenge 2009,
http://www.kde.cs.uni-kassel.de/ws/dc09/results
[Rendle et. al 2010]
Steffen Rendle
22 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Querying Incomplete RDF-Graphs
I
Task: Answer queries about subject-predicate-pairs. E.g. What is
McCartney member of?
[Franz et al. 2009, Drumond et al. 2012]
Steffen Rendle
23 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Querying Incomplete RDF-Graphs
An RDF-Graph can be expressed as a function over three categorical
domains: y : S × P × O → {0, 1}
[Franz et al. 2009, Drumond et al. 2012]
Steffen Rendle
23 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Prediction Quality
I CD Dense: PARAFAC optimized for least-squares, no missing values, no reg.
I CD-BPR: PARAFAC optimized for regularized ranking.
I PITF-BPR: PITF (pairwise interactions) optimized for regularized ranking.
[Drumond et al. 2012]
Steffen Rendle
24 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Other Applications: Examples
I
Multiverse Recommendation [Karatzoglou et al. 2010]
I
I
I
I
I
I
I
CubeSVD [Sun et al. 2005]
I
I
Steffen Rendle
Task: Context-aware Rating prediction.
Model: Tucker Decomposition.
Missing values are handled.
Loss: task dependent, e.g. MAE, RMSE.
Regularization: L1, L2.
Algorithm: Stochastic Gradient Descent (SGD).
Task: Clickthrough prediction.
Approach: HOSVD.
25 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Outline
Tensor Factorization
Problem Setting
Models
Learning
Examples for Applications
Summary
Time-aware Factorization Models
Factorization Machines
Steffen Rendle
26 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Summary
I
Prediction functions with m categorical variables can be modeled
with tensor factorization.
I
Parallel Factor Analysis (PARAFAC) generalizes matrix factorization
to m modes.
I
Tucker Decomposition allows a free core tensor. (High
computational complexity!)
I
Lower-order interactions, e.g. pairwise ones should be integrated for
better prediction quality in sparse settings.
I
For learning: missing values, loss/likelihood and regularization/
priors should be considered.
Steffen Rendle
27 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Summary
I
Prediction functions with m categorical variables can be modeled
with tensor factorization.
I
Parallel Factor Analysis (PARAFAC) generalizes matrix factorization
to m modes.
I
Tucker Decomposition allows a free core tensor. (High
computational complexity!)
I
Lower-order interactions, e.g. pairwise ones should be integrated for
better prediction quality in sparse settings.
I
For learning: missing values, loss/likelihood and regularization/
priors should be considered.
Problem: Only categorical variables can be handled.
Steffen Rendle
27 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Outline
Tensor Factorization
Time-aware Factorization Models
Models
Summary
Factorization Machines
Steffen Rendle
28 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Time-Aware: Problem Setting
I
3 predictor variables:
I
I
two variables of categorical domain I and J.
one numerical variable (time), t ∈ R.
Target y : Real-valued (regression), binary (classification), scores
(ranking).
I
Supervised task: set of observations S = {(i, j, t, y ), . . .}
I
Modelling: function ŷ : I × J × R → Y.
Observations
over I and J
Steffen Rendle
29 / 75
tim
e
I
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Tensor Factorization
1. Discretize time variable, e.g. by binning. ⇒ 3 cat. domains: I, J, T.
b : R → T,
e.g. b(t) := bt/(24 ∗ 60 ∗ 60)c
[Xiong et al. 2010]
Steffen Rendle
30 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Tensor Factorization
1. Discretize time variable, e.g. by binning. ⇒ 3 cat. domains: I, J, T.
b : R → T,
e.g. b(t) := bt/(24 ∗ 60 ∗ 60)c
2. Apply tensor factorization, e.g. Tucker Decomposition, PARAFAC.
ŷ (i, j, t) :=
k
X
I
J
T
vi,f
vj,f
vb(t),f
f =1
[Xiong et al. 2010]
Steffen Rendle
30 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Tensor Factorization
1. Discretize time variable, e.g. by binning. ⇒ 3 cat. domains: I, J, T.
b : R → T,
e.g. b(t) := bt/(24 ∗ 60 ∗ 60)c
2. Apply tensor factorization, e.g. Tucker Decomposition, PARAFAC.
ŷ (i, j, t) :=
k
X
I
J
T
vi,f
vj,f
vb(t),f
f =1
3. Smooth time factors V T , s.th. nearby points in time have similar
factors. E.g. by regularization:
T
T
vt+1,f
∼ N (vt,f
, 1/λT ),
∀t ∈ T , f ∈ {1, . . . , k}
[Xiong et al. 2010]
Steffen Rendle
30 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Tensor Factorization
1. Discretize time variable, e.g. by binning. ⇒ 3 cat. domains: I, J, T.
b : R → T,
e.g. b(t) := bt/(24 ∗ 60 ∗ 60)c
2. Apply tensor factorization, e.g. Tucker Decomposition, PARAFAC.
ŷ (i, j, t) :=
k
X
I
J
T
vi,f
vj,f
vb(t),f
f =1
3. Smooth time factors V T , s.th. nearby points in time have similar
factors. E.g. by regularization:
T
T
vt+1,f
∼ N (vt,f
, 1/λT ),
∀t ∈ T , f ∈ {1, . . . , k}
For learning/ inference, e.g. a MCMC sampler can be used.
[Xiong et al. 2010]
Steffen Rendle
30 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Time-Aware Matrix Factorization
Time-Aware Matrix Factorization
ŷ (i, j, t) :=
k
X
wi,f (t) hj,f (t)
f =1
where the factor matrices H and W depend on the time t:
W : R → R|I |×k ,
H : R → R|J|×k
[Koren 2009]
Steffen Rendle
31 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Time-Aware Matrix Factorization
Modeling time dependent factors, e.g. for W :
I
Constant
wi,f (t) := w̃i,f ,
W̃ ∈ R|I |×k
[Koren 2009]
Steffen Rendle
31 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Time-Aware Matrix Factorization
Modeling time dependent factors, e.g. for W :
I
Constant
wi,f (t) := w̃i,f ,
I
W̃ ∈ R|I |×k
Linear
wi,f (t) := w̃i,f + zi,f t,
W̃ ∈ R|I |×k , Z ∈ R|I |×k
[Koren 2009]
Steffen Rendle
31 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Time-Aware Matrix Factorization
Modeling time dependent factors, e.g. for W :
I
Constant
wi,f (t) := w̃i,f ,
I
Linear
wi,f (t) := w̃i,f + zi,f t,
I
W̃ ∈ R|I |×k
W̃ ∈ R|I |×k , Z ∈ R|I |×k
Binning with function b
wi,f (t) := w̃i,f ,b(t) ,
W̃ ∈ R|I |×k×|img(b)|
[Koren 2009]
Steffen Rendle
31 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Time-Aware Matrix Factorization
Modeling time dependent factors, e.g. for W :
I
Constant
wi,f (t) := w̃i,f ,
I
Linear
wi,f (t) := w̃i,f + zi,f t,
I
W̃ ∈ R|I |×k , Z ∈ R|I |×k
Binning with function b
wi,f (t) := w̃i,f ,b(t) ,
I
W̃ ∈ R|I |×k
W̃ ∈ R|I |×k×|img(b)|
Spline with mi predefined control points at position ti,1 , . . . , ti,m
Pmi
w̃
exp(−γ|t − ti,l |)
Pmi i,f ,l
wi,f (t) := l=1
, W̃ ∈ R|I |×k×mi
exp(−γ|t
−
t
|)
i,l
l=1
[Koren 2009]
Steffen Rendle
31 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Time-Aware Matrix Factorization
Modeling time dependent factors, e.g. for W :
I
Constant
wi,f (t) := w̃i,f ,
I
Linear
wi,f (t) := w̃i,f + zi,f t,
I
W̃ ∈ R|I |×k
W̃ ∈ R|I |×k , Z ∈ R|I |×k
Binning with function b
wi,f (t) := w̃i,f ,b(t) ,
W̃ ∈ R|I |×k×|img(b)|
I
Spline with mi predefined control points at position ti,1 , . . . , ti,m
Pmi
w̃
exp(−γ|t − ti,l |)
Pmi i,f ,l
wi,f (t) := l=1
, W̃ ∈ R|I |×k×mi
exp(−γ|t
−
t
|)
i,l
l=1
I
Linear combinations of the functions above.
Steffen Rendle
31 / 75
[Koren 2009]
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Time-Aware Matrix Factorization
Choices for the timeSVD++ model for the Netflix challenge:
I User factors W: linear combination of
I
I
I
Item factors H:
I
I
linear effect
binning with bin size 1
constant
Additional (time-unaware) implicit indicators (from SVD++ [Koren,
2008])
[Koren 2009]
Steffen Rendle
31 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Time-Aware Matrix Factorization
Choices for the timeSVD++ model for the Netflix challenge:
I User factors W: linear combination of
I
I
I
Item factors H:
I
I
linear effect
binning with bin size 1
constant
Additional (time-unaware) implicit indicators (from SVD++ [Koren,
2008])
For learning, e.g. a SGD algorithm can be used.
[Koren 2009]
Steffen Rendle
31 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Comparison
I
Time-aware MF with binning (TAMF) and tensor factorization with
discretization (TF) treat the time variable similarly:
ŷ TAMF (i, j, t) :=
k
X
wi,f ,b(t) hj,f
f =1
ŷ TF (i, j, t) :=
k
X
wi,f hj,f zb(t),f
f =1
I
Main difference:
I
I
Steffen Rendle
In tensor factorization, the (i,t)-interaction is factorized.
In time-aware MF, the (i,t)-interaction is modeled unfactorized.
32 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Discussion
I
Binning and splines cannot make use of time for future events.
I
I
Future bins are empty and variables cannot be estimated.
Variables in (future) control points of splines cannot be estimated.
I
Seasonal time indicators can help, e.g. weekday, holiday, Christmas,
etc,
I
Other approach: use qualitative/ sequential information
Steffen Rendle
33 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Sequential Prediction
Bt­3
a
User 1 b c
d
Bt
b
a
b
?
c
?
c
c
e
a
?
e
c
User 4
I
Bt­1
a
User 2
User 3
Bt­2
e
?
Task: Which items will be selected next?
[e.g. Zimdars et al. 2001, Rendle et al. 2010]
Steffen Rendle
34 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Markov Chains
Markov chain of order 1:
p(jt |lt−1 )
I
t is a sequential index.
I
lt−1 is the item selected previously.
I
The Markov chain is defined by a transition matrix A ∈ R|J|×|J| .
B
C
A ? ? ?
B ? ? ?
C ? ? ?
from item
A B C
A
to item
Steffen Rendle
35 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Markov Chains
Markov chain of order 1:
p(jt |lt−1 )
I
t is a sequential index.
I
lt−1 is the item selected previously.
I
The Markov chain is defined by a transition matrix A ∈ R|J|×|J| .
I
Model is (weakly) personalized by taking the last item selected by a
user into account.
B
C
A ? ? ?
B ? ? ?
C ? ? ?
from item
A B C
A
to item
Steffen Rendle
35 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Factorized Personalized Markov Chain
Model equation
ŷ (i, j, t) := ẑ(i, j, s(i, t))
where s(i, t) is the previously (w.r.t. t) selected entity (by i).
I
I
ẑ can be modeled by TD, PARAFAC, PITF, . . .
For product recommendation i is the user and j the current item.
[Rendle et al. 2010]
Steffen Rendle
36 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Factorized Personalized Markov Chain
Model equation
ŷ (i, j, t) := ẑ(i, j, s(i, t))
where s(i, t) is the previously (w.r.t. t) selected entity (by i).
I
I
I
ẑ can be modeled by TD, PARAFAC, PITF, . . .
For product recommendation i is the user and j the current item.
If a set of items can be selected previously, one can average over this
set:
X
1
ŷ (i, j, t) :=
ẑ(i, j, l)
|s(i, t)|
l∈s(i,t)
For learning, e.g. a SGD algorithm can be used.
[Rendle et al. 2010]
Steffen Rendle
36 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Outline
Tensor Factorization
Time-aware Factorization Models
Models
Summary
Factorization Machines
Steffen Rendle
37 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Summary
I
Time can be taken into account by:
I
I
I
I
With time-variables, the dataset split should be considered:
I
I
Steffen Rendle
Discretization and applying Tensor Factorization.
Time-variant factors, e.g. binning, linear effects, splines, . . .
Sequential indicators, e.g. last item selected.
Random split: absolute time can be modeled.
Time split: binning not effective, time transformation that are
predictive for future points in time should be chosen; e.g. seasonal or
sequential.
38 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Outline
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Problem Setting
Standard Models
Factorization Machines
Applications
Summary
Steffen Rendle
39 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Motivation
All the presented factorization models work empirically very well, but:
I For each new problem a new model, a new learning algorithm and
implementation is necessary.
I
For some of the models there are dozens of improved learning
algorithms proposed (that work only with this particular model).
I
For non-experts in factorization models this is not applicable.
I
How does this relate to standard models?
Steffen Rendle
40 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Data and Variable Representation
Many standard ML approaches work with real valued input data (a
design matrix). It allows to represent, e.g.:
I
any number of variables
I
categorical domains by using dummy indicator variables
I
numerical domains
I
set-categorical domains by using dummy indicator variables
Using this representation allows to apply a wide variety of standard
models (e.g. linear regression, SVM, etc.).
Steffen Rendle
41 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Data and Variable Representation: Example
User
Alice
Alice
Alice
Bob
Bob
Charlie
Charlie
...
Movie
Titanic
Notting Hill
Star Wars
Star Wars
Star Trek
Titanic
Star Wars
...
Rating
5
3
1
4
5
1
5
...
2 categorical variables
Steffen Rendle
42 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Data and Variable Representation: Example
Feature vector x
User
Alice
Alice
Alice
Bob
Bob
Charlie
Charlie
...
Movie
Titanic
Notting Hill
Star Wars
Star Wars
Star Trek
Titanic
Star Wars
...
Rating
5
3
1
4
5
1
5
...
0
0
...
1
0
0
0
...
5 y(1)
x(2) 1
0
0
...
0
1
0
0
...
3 y(2)
x(3) 1
0
0
...
0
0
1
0
...
1 y(3)
0
1
0
...
0
0
1
0
...
4 y(4)
x(5) 0
1
0
...
0
0
0
1
...
5 y(5)
0
1
...
1
0
0
0
...
1 y(6)
0
1
...
0
0
1
0
...
5 y(7)
B C
User
...
TI NH SW ST
Movie
x
x
(4)
(6)
0
x(7) 0
A
...
|U| + |I | real valued variables
2 categorical variables
Steffen Rendle
Target y
1
x
(1)
42 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Problem Setting
I
Predictor variables: p variables of real-valued domain
X1 , . . . , Xp ∈ R.
I
Target y : Real-valued (regression), binary (classification), scores
(ranking).
I
Supervised task: set of observations S = {(x1 , . . . , xp , y ), . . .}
Steffen Rendle
43 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Problem Setting
I
Predictor variables: p variables of real-valued domain
X1 , . . . , Xp ∈ R.
I
Target y : Real-valued (regression), binary (classification), scores
(ranking).
I
Supervised task: set of observations S = {(x1 , . . . , xp , y ), . . .}
This is the most common machine learning task.
Steffen Rendle
43 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Outline
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Problem Setting
Standard Models
Factorization Machines
Applications
Summary
Steffen Rendle
44 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Standard Machine Learning Models
I
Categorical variables can be represented with real-valued ones.
I
There are many well-studied standard ML models that can work
with real-valued variables.
I
Why shouldn’t we work with them? Why do we need factorization
models?
Steffen Rendle
45 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Linear Regression
I
Let x ∈ Rp be an input vector with p predictor variables.
I
Model equation:
ŷ (x) := w0 +
p
X
w i xi
i=1
I
Model parameters:
w0 ∈ R,
w ∈ Rp
O(p) model parameters.
Steffen Rendle
46 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Polynomial Regression
I
Let x ∈ Rp be an input vector with p predictor variables.
I
Model equation (degree 2):
ŷ (x) := w0 +
p
X
w i xi +
i=1
I
p X
p
X
wi,j xi xj
i=1 j≥i
Model parameters:
w0 ∈ R,
w ∈ Rp ,
W ∈ Rp×p
O(p 2 ) model parameters.
Steffen Rendle
47 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Application to Large Categorical Domains
Feature vector x
User
Alice
Alice
Alice
Bob
Bob
Charlie
Charlie
...
Movie
Titanic
Notting Hill
Star Wars
Star Wars
Star Trek
Titanic
Star Wars
...
Rating
5
3
1
4
5
1
5
...
Target y
1
0
0
...
1
0
0
0
...
5 y(1)
x(2) 1
0
0
...
0
1
0
0
...
3 y(2)
x(3) 1
0
0
...
0
0
1
0
...
1 y(3)
0
1
0
...
0
0
1
0
...
4 y(4)
x(5) 0
1
0
...
0
0
0
1
...
5 y(5)
0
0
1
...
1
0
0
0
...
1 y(6)
x(7) 0
0
1
...
0
0
1
0
...
5 y(7)
B C
User
...
TI NH SW ST
Movie
x
x
x
(1)
(4)
(6)
A
...
Applying regression models to this data leads to:
Steffen Rendle
48 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Application to Large Categorical Domains
Feature vector x
User
Alice
Alice
Alice
Bob
Bob
Charlie
Charlie
...
Movie
Titanic
Notting Hill
Star Wars
Star Wars
Star Trek
Titanic
Star Wars
...
Rating
5
3
1
4
5
1
5
...
Target y
1
0
0
...
1
0
0
0
...
5 y(1)
x(2) 1
0
0
...
0
1
0
0
...
3 y(2)
x(3) 1
0
0
...
0
0
1
0
...
1 y(3)
0
1
0
...
0
0
1
0
...
4 y(4)
x(5) 0
1
0
...
0
0
0
1
...
5 y(5)
0
0
1
...
1
0
0
0
...
1 y(6)
x(7) 0
0
1
...
0
0
1
0
...
5 y(7)
B C
User
...
TI NH SW ST
Movie
x
x
x
(1)
(4)
(6)
A
...
Applying regression models to this data leads to:
Linear regression:
ŷ (x) = w0 + wu + wi
Steffen Rendle
48 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Application to Large Categorical Domains
Feature vector x
User
Alice
Alice
Alice
Bob
Bob
Charlie
Charlie
...
Movie
Titanic
Notting Hill
Star Wars
Star Wars
Star Trek
Titanic
Star Wars
...
Rating
5
3
1
4
5
1
5
...
Target y
1
0
0
...
1
0
0
0
...
5 y(1)
x(2) 1
0
0
...
0
1
0
0
...
3 y(2)
x(3) 1
0
0
...
0
0
1
0
...
1 y(3)
0
1
0
...
0
0
1
0
...
4 y(4)
x(5) 0
1
0
...
0
0
0
1
...
5 y(5)
0
0
1
...
1
0
0
0
...
1 y(6)
x(7) 0
0
1
...
0
0
1
0
...
5 y(7)
B C
User
...
TI NH SW ST
Movie
x
x
x
(1)
(4)
(6)
A
...
Applying regression models to this data leads to:
Linear regression:
ŷ (x) = w0 + wu + wi
Polynomial regression:
Steffen Rendle
ŷ (x) = w0 + wu + wi + wu,i
48 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Application to Large Categorical Domains
Feature vector x
User
Alice
Alice
Alice
Bob
Bob
Charlie
Charlie
...
Movie
Titanic
Notting Hill
Star Wars
Star Wars
Star Trek
Titanic
Star Wars
...
Rating
5
3
1
4
5
1
5
...
Target y
1
0
0
...
1
0
0
0
...
5 y(1)
x(2) 1
0
0
...
0
1
0
0
...
3 y(2)
x(3) 1
0
0
...
0
0
1
0
...
1 y(3)
0
1
0
...
0
0
1
0
...
4 y(4)
x(5) 0
1
0
...
0
0
0
1
...
5 y(5)
0
0
1
...
1
0
0
0
...
1 y(6)
x(7) 0
0
1
...
0
0
1
0
...
5 y(7)
B C
User
...
TI NH SW ST
Movie
x
x
x
(1)
(4)
(6)
A
...
Applying regression models to this data leads to:
Linear regression:
ŷ (x) = w0 + wu + wi
Steffen Rendle
Polynomial regression:
ŷ (x) = w0 + wu + wi + wu,i
Matrix factorization (with biases):
ŷ (u, i) = w0 + wu + hi + hwu , hi i
48 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Application to Large Categorical Domains
For the recommender data of the example:
I
Steffen Rendle
Linear regression has no user-item interaction.
49 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Application to Large Categorical Domains
For the recommender data of the example:
I
Linear regression has no user-item interaction.
I
Steffen Rendle
⇒ Linear regression is not expressive enough.
49 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Application to Large Categorical Domains
For the recommender data of the example:
I
Linear regression has no user-item interaction.
I
I
Steffen Rendle
⇒ Linear regression is not expressive enough.
Polynomial regression includes pairwise interactions but cannot
estimate them from the data.
49 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Application to Large Categorical Domains
For the recommender data of the example:
I
Linear regression has no user-item interaction.
I
I
Polynomial regression includes pairwise interactions but cannot
estimate them from the data.
I
Steffen Rendle
⇒ Linear regression is not expressive enough.
n p 2 : number of cases is much smaller than number of model
parameters.
49 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Application to Large Categorical Domains
For the recommender data of the example:
I
Linear regression has no user-item interaction.
I
I
Polynomial regression includes pairwise interactions but cannot
estimate them from the data.
I
I
Steffen Rendle
⇒ Linear regression is not expressive enough.
n p 2 : number of cases is much smaller than number of model
parameters.
Max.-likelihood estimator for a pairwise effect is:
(
y − w0 − wi − wu , if (i, j, y ) ∈ S.
wi,j =
not defined,
else
49 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Application to Large Categorical Domains
For the recommender data of the example:
I
Linear regression has no user-item interaction.
I
I
Polynomial regression includes pairwise interactions but cannot
estimate them from the data.
I
I
I
Steffen Rendle
⇒ Linear regression is not expressive enough.
n p 2 : number of cases is much smaller than number of model
parameters.
Max.-likelihood estimator for a pairwise effect is:
(
y − w0 − wi − wu , if (i, j, y ) ∈ S.
wi,j =
not defined,
else
Polynomial regression cannot generalize to any unobserved pairwise
effect.
49 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Factorization Models and Real-valued Variables
I
Factorization models work well for categorical variables of large
domain.
I
Standard Models are more flexible as they allow real-valued predictor
variables that can be used for encoding several kind of variables.
Steffen Rendle
50 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Factorization Models and Real-valued Variables
I
Factorization models work well for categorical variables of large
domain.
I
Standard Models are more flexible as they allow real-valued predictor
variables that can be used for encoding several kind of variables.
I
How can these advantages be combined?
Steffen Rendle
50 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Outline
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Problem Setting
Standard Models
Factorization Machines
Applications
Summary
Steffen Rendle
51 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Factorization Machine (FM)
I
Let x ∈ Rp be an input vector with p predictor variables.
I
Model equation (degree 2):
ŷ (x) := w0 +
p
X
p X
p
X
w i xi +
hvi , vj i xi xj
i=1
I
i=1 j>i
Model parameters:
w0 ∈ R,
w ∈ Rp ,
V ∈ Rp×k
[Rendle 2010, Rendle 2012]
Steffen Rendle
52 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Factorization Machine (FM)
I
I
Let x ∈ Rp be an input vector with p predictor variables.
Model equation (degree 2):
ŷ (x) := w0 +
p
X
p X
p
X
w i xi +
hvi , vj i xi xj
i=1
I
i=1 j>i
Model parameters:
w0 ∈ R,
w ∈ Rp ,
V ∈ Rp×k
Compared to Polynomial regression:
I Model equation (degree 2):
ŷ (x) := w0 +
p
X
wi xi +
i=1
I
wi,j xi xj
i=1 j≥i
Model parameters:
w0 ∈ R,
Steffen Rendle
p X
p
X
52 / 75
w ∈ Rp ,
W ∈ Rp×p
[Rendle 2010, Rendle 2012]
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Factorization Machine (FM)
I
Let x ∈ Rp be an input vector with p predictor variables.
I
Model equation (degree 2):
ŷ (x) := w0 +
p
X
p X
p
X
w i xi +
hvi , vj i xi xj
i=1
I
i=1 j>i
Model parameters:
w0 ∈ R,
w ∈ Rp ,
V ∈ Rp×k
[Rendle 2010, Rendle 2012]
Steffen Rendle
52 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Factorization Machine (FM)
I
Let x ∈ Rp be an input vector with p predictor variables.
I
Model equation (degree 3):
ŷ (x) := w0 +
p
X
w i xi +
i=1
+
p X
p
X
hvi , vj i xi xj
i=1 j>i
p X
p X
p X
k
X
(3)
(3)
(3)
vi,f vj,f vl,f xi xj xl
i=1 j>i l>j f =1
I
Model parameters:
w0 ∈ R,
w ∈ Rp ,
V ∈ Rp×k ,
V(3) ∈ Rp×k
[Rendle 2010, Rendle 2012]
Steffen Rendle
52 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Factorization Machines: Discussion
I
FMs work with real valued input.
I
FMs include variable interactions like polynomial regression.
I
Model parameters for interactions are factorized.
I
Number of model parameters is O(k p) (instead of O(p 2 ) for poly.
regr.).
Steffen Rendle
53 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Factorization Machines: Discussion
I
FMs work with real valued input.
I
FMs include variable interactions like polynomial regression.
I
Model parameters for interactions are factorized.
I
Number of model parameters is O(k p) (instead of O(p 2 ) for poly.
regr.).
I
How are FMs related to the factorization models we have seen so
far?
Steffen Rendle
53 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Matrix Factorization and Factorization Machines
Two categorical variables encoded with real valued predictor variables:
Feature vector x
x(1) 1
0
0
...
1
0
0
0
...
x(2) 1
0
0
...
0
1
0
0
...
x(3) 1
0
0
...
0
0
1
0
...
x(4) 0
1
0
...
0
0
1
0
...
x(5) 0
1
0
...
0
0
0
1
...
x(6) 0
0
1
...
1
0
0
0
...
x(7) 0
0
1
...
0
0
1
0
...
B C
User
...
TI NH SW ST
Movie
A
...
With this data, the FM is identical to MF with biases:
ŷ (x) = w0 + wu + wi + hvu , vi i
| {z }
MF
Steffen Rendle
54 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Tag-Recommendation with Factorization Machines
Three categorical variables encoded with real valued predictor variables:
Feature vector x
x(1) 1
0
0
...
1
0
0
0
...
1
0
0
0
...
x(2) 1
0
0
...
0
1
0
0
...
0
1
0
0
...
x(3) 1
0
0
...
0
0
1
0
...
0
0
0
1
...
x(4) 0
1
0
...
0
0
1
0
...
0
0
1
0
...
x(5) 0
1
0
...
0
0
0
1
...
0
0
1
0
...
x(6) 0
0
1
...
1
0
0
0
...
1
0
0
0
...
x(7) 0
0
1
...
0
0
1
0
...
0
0
0
1
...
A
B C
User
... S1 S2 S3 S4 ... T1 T2 T3 T4 ...
Song
Tag
With this data, the FM is a tensor factorization model with lower-level
interactions (here up to pairwise ones):
ŷ (x) := w0 + wi + wu + wt + hvu , vt i + hvi , vt i + hvu , vi i
Steffen Rendle
55 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Time with Factorization Machines
Two categorical variables and time as linear predictor:
Feature vector x
x(1) 1
0
0
...
1
0
0
0
... 0.2
x(2) 1
0
0
...
0
1
0
0
... 0.6
x(3) 1
0
0
...
0
0
1
0
... 0.61
x(4) 0
1
0
...
0
0
1
0
... 0.3
x
(5)
0
1
0
...
0
0
0
1
... 0.5
x
(6)
0
0
1
...
1
0
0
0
... 0.1
x(7) 0
0
1
...
0
0
1
0
... 0.8
B C
User
...
TI NH SW ST
Movie
A
...
Time
The FM model would correspond to:
ŷ (x) := w0 + wi + wu + t wtime + hvu , vi i + t hvu , vtime i + t hvi , vtime i
Steffen Rendle
56 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Time with Factorization Machines
Two categorical variables and time discretized in bins (b(t)):
Feature vector x
1
0
0
...
1
0
0
0
...
1
0
0
x(2) 1
0
0
...
0
1
0
0
...
0
1
0
x(3) 1
0
0
...
0
0
1
0
...
0
1
0
x(4) 0
1
0
...
0
0
1
0
...
1
0
0
x(5) 0
1
0
...
0
0
0
1
...
0
1
0
x(6) 0
0
1
...
1
0
0
0
...
1
0
0
x(7) 0
0
1
...
0
0
1
0
...
0
0
1
B C
User
...
TI NH SW ST
Movie
x
(1)
A
... T1 T2 T3
Time
With this data, a three-order FM includes the time-aware tensor
factorization model described before:
ŷ (x) := w0 + wi + wu + wb(t) + hvu , vi i + hvu , vb(t) i + hvi , vb(t) i
+
k
X
(3)
(3)
(3)
vu,f vi,f vb(t),f
f =1
|
{z
}
Time Tensor Factorization Model
Steffen Rendle
56 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Time with Factorization Machines
Two categorical variables and time discretized in bins (b(t)):
Feature vector x
x
(1)
1
x
(2)
0
0
0
0
0
0
0
0
...
1
0
0
0
...
0
1
0
0
0
0
0
0
0
...
0
1
0
0
...
x(3) 0
1
0
0
0
0
0
0
0
...
0
0
1
0
...
x(4) 0
0
0
1
0
0
0
0
0
...
0
0
1
0
...
x(5) 0
0
0
0
1
0
0
0
0
...
0
0
0
1
...
x(6) 0
0
0
0
0
0
1
0
0
...
1
0
0
0
...
x(7) 0
0
0
0
0
0
0
0
1
...
0
0
1
0
...
AT1 AT2 AT3 BT1 BT2 BT3 CT1 CT2 CT3 ...
User­Time
TI NH SW ST
Movie
...
With this data, an FM includes the time-aware matrix factorization
model with binned user-time interactions:
ŷ (x) := w0 + wi + wu,b(t) +
hvu,b(t) , vi i
| {z }
MF with time variant factors
[Koren, 2009]
Steffen Rendle
56 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
SVD++
Feature vector x
(1)
1
0
0
...
1
0
0
0
... 0.3 0.3 0.3 0
...
x(2) 1
0
0
...
0
1
0
0
... 0.3 0.3 0.3 0
...
x(3) 1
0
0
...
0
0
1
0
... 0.3 0.3 0.3 0
...
x(4) 0
1
0
...
0
0
1
0
...
0
0 0.5 0.5 ...
x(5) 0
1
0
...
0
0
0
1
...
0
0 0.5 0.5 ...
x(6) 0
0
1
...
1
0
0
0
... 0.5 0 0.5 0
...
x(7) 0
0
1
...
0
0
1
0
... 0.5 0 0.5 0
...
B C
User
...
TI NH SW ST
Movie
x
A
...
TI NH SW ST ...
Other Movies rated
With this data, the FM is identical to:
SVD++
z
}|
1
{
X
ŷ (x) = w0 + wu + wi + hvu , vi i + p
hvi , vl i
|Nu | l∈Nu


X
X
1
1
wl + hvu , vl i + p
+p
hvl , vl0 i
|Nu | l∈Nu
|Nu | l 0 ∈Nu ,l 0 >l
[Koren, 2008]
Steffen Rendle
57 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Factorization Machines: Discussion II
I
Representing categorical variables with real-valued variables and
applying FMs is comparable to the factorization models that have
been derived individually before (e.g. (bias) MF, tensor factorization,
SVD++).
I
FMs are much more flexible and can handle also non-categorical
variables.
I
Applying FMs is simple, as only data preprocessing has to be done
(defining the real-valued predictor variables).
Steffen Rendle
58 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Computation Complexity
Factorization Machine model equation:
ŷ (x) := w0 +
p
X
w i xi +
i=1
I
Steffen Rendle
p X
p
X
hvi , vj i xi xj
i=1 j>i
Trivial computation: O(p 2 k)
59 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Computation Complexity
Factorization Machine model equation:
ŷ (x) := w0 +
p
X
w i xi +
i=1
p X
p
X
hvi , vj i xi xj
i=1 j>i
I
Trivial computation: O(p 2 k)
I
Efficient computation can be done in: O(p k)
Steffen Rendle
59 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Computation Complexity
Factorization Machine model equation:
ŷ (x) := w0 +
p
X
w i xi +
i=1
p X
p
X
hvi , vj i xi xj
i=1 j>i
I
Trivial computation: O(p 2 k)
I
Efficient computation can be done in: O(p k)
Making use of many zeros in x even in: O(Nz (x) k), where Nz (x) is
the number of non-zero elements in vector x.
I
Steffen Rendle
59 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Efficient Computation
The model equation of an FM can be computed in O(p k).
Steffen Rendle
60 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Efficient Computation
The model equation of an FM can be computed in O(p k).
Proof:
ŷ (x) := w0 +
p
X
w i xi +
i=1
= w0 +
p
X
i=1
Steffen Rendle
p X
p
X
hvi , vj i xi xj
i=1 j>i


!2
p
p
k
X
1 X X
2
xi vi,f
−
(xi vi,f ) 
w i xi +
2
f =1
60 / 75
i=1
i=1
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Efficient Computation
The model equation of an FM can be computed in O(p k).
Proof:
ŷ (x) := w0 +
p
X
w i xi +
i=1
= w0 +
p
X
i=1
p X
p
X
hvi , vj i xi xj
i=1 j>i


!2
p
p
k
X
1 X X
2
xi vi,f
−
(xi vi,f ) 
w i xi +
2
f =1
i=1
i=1
I
In the sums over i, only non-zero xi elements have to be summed up
⇒ O(Nz (x) k).
I
(The complexity of polynomial regression is O(Nz (x)2 ).)
Steffen Rendle
60 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Multilinearity
FMs are multilinear:
∀θ ∈ Θ = {w0 , w, V} :
ŷ (x, θ) = h(θ) (x) θ + g(θ) (x)
where g(θ) and h(θ) do not depend on the value of θ.
Steffen Rendle
61 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Multilinearity
FMs are multilinear:
∀θ ∈ Θ = {w0 , w, V} :
ŷ (x, θ) = h(θ) (x) θ + g(θ) (x)
where g(θ) and h(θ) do not depend on the value of θ.
E.g. for second order effects (θ = vl,f ):
g(v
z
ŷ (x, vl,f ) := w0 +
p
X
wi xi +
(x)
l,f )
p
p
X
X
i=1
}|
i=1 j=i+1
k
X
{
vi,f 0 vj,f 0 xi xj
f 0 =1
(f 0 6=f )∨(l6∈{i,j})
+ vl,f xl
X
vi,f xi
i=1,i6=l
|
Steffen Rendle
61 / 75
h(v
{z
(x)
l,f )
}
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Learning
Using these properties, learning algorithms can be developed:
I L2-regularized regression and classification:
I
I
I
I
Stochastic gradient descent [Rendle, 2010]
Alternating least squares/ Coordinate Descent [Rendle et al., 2011,
Rendle 2012]
Markov Chain Monte Carlo (for Bayesian FMs) [Freudenthaler et al.
2011, Rendle 2012]
L2-regularized ranking:
I
Stochastic gradient descent [Rendle, 2010]
All the proposed learning algorithms have a runtime of O(k Nz (X ) i),
where i is the number of iterations and Nz (X ) the number of non-zero
elements in the design matrix X .
Steffen Rendle
62 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Stochastic Gradient Descent (SGD)
I
For each training case (x, y ) ∈ S, SGD updates the FM model
parameter θ using:
θ0 = θ − α (ŷ (x) − y )h(θ) (x) + λ(θ) θ
I
α is the learning rate / step size.
I
λ(θ) is the regularization value of the parameter θ.
I
SGD can easily be applied to other loss functions.
[Rendle, 2010]
Steffen Rendle
63 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Alternating Least Squares (ALS)
I
Elementwise ALS updates each FM model parameter θ using:
P
(x,y )∈S g(θ) (x) − y h(θ) (x)
0
P
θ =−
2
(x,y )∈S h(θ) (x) + λ(θ)
I
Using caches of intermediate results, the runtime for updating all
model parameters is O(k Nz (X )).
I
The advantage of ALS compared to SGD is that no learning rate has
to be specified.
I
ALS can be extended to classification [Rendle, 2012].
[Rendle et al., 2011]
Steffen Rendle
64 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Bayesian FMs (BFM)
w , w
 v , v
Factorization Machines
0 , 0
 , 
w
v
w
j=1,...,p
wj
j=1,...,p
vj
vj
wj
x ij
yi
w0
w , w
0
0
x ij
yi
w0
i=1,...,n

v
w , w
0
i=1,...,n
0

 0 , 0
w0 ∼ N (µw0 , 1/λw0 ),
µw ∼ N (µ0 , γ0 λw ),
∀j ∈ {1, . . . , p} : wj ∼ N (µw , 1/λw ),
λw ∼ Γ(αλ , βλ ),
µv ,f ∼ N (µ0 , γ0 λv ,f ),
vj ∼ N (µv , Λ−1
v )
λv ,f ∼ Γ(αλ , βλ )
[Freudenthaler et al., 2011]
Steffen Rendle
65 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Bayesian FMs (BFM)
w , w
 v , v
wj
vj
0 , 0
 , 
w
v
wj
vj
w
j=1,...,p
j=1,...,p
x ij
yi
w0
w , w
0
0
x ij
yi
w0
i=1,...,n

v
w , w
0
i=1,...,n
0

 0 , 0
I
The SGD and ALS models correspond to the left model.
I
The right side is a two level model that integrates priors.
[Freudenthaler et al., 2011]
Steffen Rendle
65 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Bayesian FMs (BFM): Inference
I
For Bayesian inference an efficient Gibbs sampler can be derived.
I
The Gibbs posterior distribution for each model parameter θ is
related to the ALS.
I
Sampling all model parameters once can be done in O(k Nz (X )) as
well.
I
Introducing hyperpriors and integrating over priors has the advantage
over ALS that the values of the priors are ‘automatically’ found.
I
BFMs can be extended to classification [Rendle, 2012].
[Freudenthaler et al., 2011]
Steffen Rendle
66 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Outline
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Problem Setting
Standard Models
Factorization Machines
Applications
Summary
Steffen Rendle
67 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Applications
FMs are especially suited for ML problems:
I
Categorical variables of large domain.
I
Number of predictor variables is large.
I
Interactions between predictor variables are of interest.
Several variables involved.
I
Steffen Rendle
68 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
(Context-aware) Recommender Systems
I
Main variables:
I
I
I
Additional variables:
I
I
I
I
I
I
User ID (categorical)
Item ID (categorical)
time
mood
user profile
item meta data
...
Examples: Netflix prize, Movielens, KDDCup 2011
+
User
Steffen Rendle
♪
Song
69 / 75
+
+
Time
Mood
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Clickthrough Prediction
I
Main variables:
I
I
I
I
Additional variables:
I
I
I
I
User ID
Query ID
Ad/ Link ID
query tokens
user profile
...
Example: KDDCup 2012 Track 2 (FM placed 3rd/171)
+
Link 1
keyword...
User
Steffen Rendle
Query
70 / 75
+
Link 2
Link 3
Ad/ Link
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Student Performance Prediction
I
Main variables:
I
I
I
Additional variables:
I
I
I
I
I
Student ID
Question ID
question hierarchy
sequence of questions
skills required
...
Examples: KDDCup 2010, Grockit Challenge2 (FM placed 1st/241)
?
+
Student
Question
2 http://www.kaggle.com/c/WhatDoYouKnow
Steffen Rendle
71 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Link Prediction in Social Networks
I
Main variables:
I
I
I
Additional variables:
I
I
I
I
Actor A ID
Actor B ID
profiles
actions
...
Example: KDDCup 2012 Track 1 (FM placed 2nd/658)
+
Actor A
Steffen Rendle
72 / 75
Actor B
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
libFM Software
libFM is an implementation of FMs
I
Model: second-order FMs
I
Learning/ inference: SGD, ALS, MCMC
I
Classification and regression
I
Uses the same data format as LIBSVM, LIBLINEAR [Lin et. al],
SVMlight [Joachims].
I
Supports variable grouping.
I
Available with source code.
[http://www.libfm.org/]
Steffen Rendle
73 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Outline
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Problem Setting
Standard Models
Factorization Machines
Applications
Summary
Steffen Rendle
74 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Summary
I
Real-valued predictor variables can encode information from
variables of other domains, e.g. categorical variables.
I
Applying linear regression to large categorical domains results in too
little expressiveness; applying polynomial regression results in too
much expressiveness.
I
Factorization Machines (FM) are a polynomial regression model with
factorized interaction parameters.
I
FMs bring together the generality of standard machine learning
methods with the prediction quality of factorization models.
I
FMs are multilinear and can be computed efficiently.
Steffen Rendle
75 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Y. Cai, M. Zhang, D. Luo, C. Ding, and S. Chakravarthy.
Low-order tensor decompositions for social tagging recommendation.
In Proceedings of the fourth ACM international conference on Web
search and data mining, WSDM ’11, pages 695–704, New York, NY,
USA, 2011. ACM.
J. Carroll and J. Chang.
Analysis of individual differences in multidimensional scaling via an
n-way generalization of eckart-young decomposition.
Psychometrika, 35:283–319, 1970.
A. Cichocki, R. Zdunek, A. H. Phan, and S.-i. Amari.
Nonnegative Matrix and Tensor Factorizations: Applications to
Exploratory Multi-way Data Analysis and Blind Source Separation.
Wiley Publishing, 2009.
L. Drumond, S. Rendle, and L. Schmidt-Thieme.
Predicting rdf triples in incomplete knowledge bases with tensor
factorization.
Steffen Rendle
75 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
In Proceedings of the 27th Annual ACM Symposium on Applied
Computing, SAC ’12, pages 326–331, New York, NY, USA, 2012.
ACM.
T. Franz, A. Schultz, S. Sizov, and S. Staab.
Triplerank: Ranking semantic web data by tensor decomposition.
In Proceedings of the 8th International Semantic Web Conference,
ISWC ’09, pages 213–228, Berlin, Heidelberg, 2009. Springer-Verlag.
C. Freudenthaler, L. Schmidt-Thieme, and S. Rendle.
Bayesian factorization machines.
In Workshop on Sparse Representation and Low-rank Approximation,
NIPS 2011, 2011.
R. A. Harshman.
Foundations of the parafac procedure: models and conditions for an
’exploratory’ multimodal factor analysis.
UCLA Working Papers in Phonetics, pages 1–84, 1970.
A. Hotho, R. Jäschke, C. Schmitz, and G. Stumme.
Information retrieval in folksonomies: Search and ranking.
Steffen Rendle
75 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
In Y. Sure and J. Domingue, editors, The Semantic Web: Research
and Applications, volume 4011 of Lecture Notes in Computer
Science, pages 411–426, Heidelberg, June 2006. Springer.
A. Karatzoglou, X. Amatriain, L. Baltrunas, and N. Oliver.
Multiverse recommendation: n-dimensional tensor factorization for
context-aware collaborative filtering.
In RecSys ’10: Proceedings of the fourth ACM conference on
Recommender systems, pages 79–86, New York, NY, USA, 2010.
ACM.
T. G. Kolda and B. W. Bader.
Tensor decompositions and applications.
SIAM Review, 51(3):455–500, September 2009.
Y. Koren.
Factorization meets the neighborhood: a multifaceted collaborative
filtering model.
In KDD ’08: Proceeding of the 14th ACM SIGKDD international
conference on Knowledge discovery and data mining, pages 426–434,
New York, NY, USA, 2008. ACM.
Steffen Rendle
75 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Y. Koren.
Collaborative filtering with temporal dynamics.
In KDD ’09: Proceedings of the 15th ACM SIGKDD international
conference on Knowledge discovery and data mining, pages 447–456,
New York, NY, USA, 2009. ACM.
L. D. Lathauwer, B. D. Moor, and J. Vandewalle.
A multilinear singular value decomposition.
SIAM J. Matrix Anal. Appl., 21(4):1253–1278, 2000.
S. Rendle.
Factorization machines.
In Proceedings of the 10th IEEE International Conference on Data
Mining. IEEE Computer Society, 2010.
S. Rendle.
Factorization machines with libfm.
ACM Trans. Intell. Syst. Technol., 3(3):57:1–57:22, May 2012.
S. Rendle, C. Freudenthaler, and L. Schmidt-Thieme.
Factorizing personalized markov chains for next-basket
recommendation.
Steffen Rendle
75 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
In WWW ’10: Proceedings of the 19th international conference on
World wide web, pages 811–820, New York, NY, USA, 2010. ACM.
S. Rendle, Z. Gantner, C. Freudenthaler, and L. Schmidt-Thieme.
Fast context-aware recommendations with factorization machines.
In Proceedings of the 34th ACM SIGIR Conference on Research and
Development in Information Retrieval. ACM, 2011.
S. Rendle and L. Schmidt-Thieme.
Pairwise interaction tensor factorization for personalized tag
recommendation.
In WSDM ’10: Proceedings of the third ACM international
conference on Web search and data mining, pages 81–90, New York,
NY, USA, 2010. ACM.
J.-T. Sun, H.-J. Zeng, H. Liu, Y. Lu, and Z. Chen.
Cubesvd: a novel approach to personalized web search.
In WWW ’05: Proceedings of the 14th international conference on
World Wide Web, pages 382–390, New York, NY, USA, 2005. ACM.
L. Tucker.
Some mathematical notes on three-mode factor analysis.
Steffen Rendle
75 / 75
Social Network Analysis, University of Konstanz
Tensor Factorization
Time-aware Factorization Models
Factorization Machines
Psychometrika, 31:279–311, 1966.
L. Xiong, X. Chen, T.-K. Huang, J. Schneider, and J. G. Carbonell.
Temporal collaborative filtering with bayesian probabilistic tensor
factorization.
In Proceedings of SIAM Data Mining, 2010.
A. Zimdars, D. M. Chickering, and C. Meek.
Using temporal data for making recommendations.
In UAI ’01: Proceedings of the 17th Conference in Uncertainty in
Artificial Intelligence, pages 580–588, San Francisco, CA, USA,
2001. Morgan Kaufmann Publishers Inc.
Steffen Rendle
75 / 75
Social Network Analysis, University of Konstanz
Related documents