Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
10/10/2010
Web Mining – a predictive
approach
Prof. Michalis Vazirgiannis
DB--NET (Research Team on Data & Web Mining),
DB
Dept. of Informatics
Athens University of Economics & Business, GREECE
WWW: http://www.db
http://www.db--net.aueb.gr/
&
Digiteo Chair @ LIX,
LIX, Ecole Polytechnique
Polytechnique,,
Paris, France
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
October 2010
1
Tutorial Outline
•
Introduction, Motivation
Web Data nature (graph, time evolving)
Examples of data sets (snap – our data set)
Aspects: content, structure and behavior mining.
•
•
•
•
Data Preprocessing
–
–
–
•
Content preprocessing
Dimensionality reduction algorithms
Feature selection fundamentals
Ranking Web data
–
–
Graph based ranking – Pagerank, HITS
Text based ranking
•
Unsupervised mining algorithms and their foundations:
•
Predictive learning
–
–
–
–
–
Graph based Clustering
Markov Models,
regression models, variants of regression with probabilistic
clustering and Principal Components Analysis (PCA)
EM based approach with Bayesian learning.
•
Top-k similarity measures
•
Predictive learning case studies
–
K-Sim, Osim, NDCG, Spearman correlation.
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
2
1
10/10/2010
Web Mining - Introduction
• The Web looks like a “bowtie” [Kumar+1999]
• IN, SCC, OUT, ‘tendrils’
• Disconnected components
• Power laws apply for
count vs. (in-outlinks,
user clicks)
• The plot is linear in loglog scale
freq = degree (-2.15)
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010
3
Web Mining - I
The three aspects
• Web Content/ Structure / Usage Mining
Web content Mining
• Metadata: Keywords, Phrases, Filename, URL, Size etc.
• Feature extraction techniques from web contents
• Web document representation models: Vector space
model, ontology based ones
• Web document clustering: Similarity measures, Web
document clustering algorithms
• Web document classification techniques
• Content based ranking web documents – for queries
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
4
2
10/10/2010
Web Mining - II
Web Structure Mining
• Link analysis theory (Hubs, Authorities, Co-citations, …)
• Mining web communities
• The role of hyperlinks in web crawling
• Ranking based on link analysis: HITS, PageRank
Web Usage Mining
• Web usage logs – click streams
• Data Preprocessing: cleaning, session identification, User
identification (anonymous user profiles vs. registered
users)
• Pattern discovery: Web usage-mining algorithms:
Classification, Clustering, Association Rules, Sequential
pattern discovery
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010
5
Web Mining - III
• Graph of web pages (mainly text)
• Huge: Google (~15 bn pages < 2% of the real web
volume)
• Rapidly evolving: each week +8% pages, 25% of the links
change!!
• Web search is a dominant activity! (Google as verb in
dictionaries…)
• Ranking is an essential add-on towards meaningful
results
• Further knowledge extraction and web organization is
necessary
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
6
3
10/10/2010
Web Mining – The value of predictions [1
[1]
Web search patterns are good predictors
(A) for box-office movie revenue
(D) correlation between predicted and actual outcomes when predictions are
based on query data t weeks prior to the event.
--------------------------------------------------------------------------------------------------------[1] Predicting consumer behavior with Web search, Sharad Goel, Jake M. Hofman, Sébastien Lahaie, David M. Pennock,
and Duncan J. Watts, Microeconomics and Social Systems, Yahoo! Research,
www.pnas.org/cgi/doi/10.1073/pnas.1005962107
M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010
October 2010
7
Tutorial Outline
•
Introduction, Motivation
Web Data nature (graph, time evolving)
Examples of data sets (snap – our data set)
Aspects: content, structure and behavior mining.
•
•
•
•
Data Preprocessing
–
–
–
•
Content preprocessing (text to matrix)
Dimensionality reduction algorithms
Feature selection fundamentals
Ranking Web data
–
–
Graph based ranking – Pagerank, HITS
Text based ranking
•
Unsupervised mining algorithms and their foundations:
•
Predictive learning
–
–
–
–
–
Graph based Clustering
Markov Models,
regression models, variants of regression with probabilistic
clustering and Principal Components Analysis (PCA)
EM based approach with Bayesian learning.
•
Top-k similarity measures
•
Predictive learning case studies
–
K-Sim, Osim, NDCG, Spearman correlation.
–
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
8
4
10/10/2010
Content preprocessing
2.3 -1.5 … -1.3
n
1.1 0.1
… -0.1
…
…
…
…
p
•
•
Rows = objects
Columns = measurements on objects
– Represent each row as a p-dimensional vector, where p is the dimensionality
• In efffect, embed our objects in a p-dimensional vector space
• Often useful, but always appropriate
•
Both n and p can be very large in certain data mining applications
9
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
October 2010
Sparse Matrix (Text) Data
50
100
150
Text
Documents
200
250
300
350
400
450
500
20
40
60
80
100
120
140
160
180
200
Word IDs
10
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
October 2010
5
10/10/2010
Tutorial Outline
•
Introduction, Motivation
Web Data nature (graph, time evolving)
Examples of data sets (snap – our data set)
Aspects: content, structure and behavior mining.
•
•
•
•
Data Preprocessing
–
–
•
Ranking Web data
–
–
•
Graph based ranking – Pagerank, HITS
Text based ranking
Unsupervised mining algorithms and their foundations:
–
•
Content preprocessing (text to matrix)
Dimensionality reduction algorithms
Graph based Clustering
Predictive learning
–
–
–
–
Markov Models,
regression models, variants of regression with probabilistic
clustering and Principal Components Analysis (PCA)
EM based approach with Bayesian learning.
•
Top-k similarity measures
•
Predictive learning case studies
–
K-Sim, Osim, NDCG, Spearman correlation.
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
11
Dim. Reduction – Linear Algorithms
•
•
•
•
Principal Components Analysis (PCA)
Singular Value Decomposition (SVD)
Multidimensional Scaling (MDS)
Latent Semantic Indexing (LSI)
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
12
6
10/10/2010
Linear Algebra
Basic Principles
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010
13
Dim. Reduction–Eigenvectors
A nxn matrix (square)
• eigenvalues λ: |Α-λΙ|=0
• Eigenvectors x : Ax=λx
• Matrix rank: # linearly independent rows or columns
• A real symmetric table Α nxn can be expressed as: A=UΛUT
• U’s columns are Α’s eigenvectors
• Λ’ s diagonal contains Α’s eigenvalues
• Α=UΛUT=λ1x1xT1+λ2x2xT2+…+λnxnxTn
• x1xT1 represents projection via x1 (λi eigenvalue, xi
eigenvector)
October 2010
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
14
7
10/10/2010
Linear Independence
•
Linear Independence:
Given vectors x1,x2,…,xn and equation λ1x1 + λ2x2 +…+
λnxn=0, if the equation has one solution, λi=0, then all vectors are
linear independent. If there are more solutions then the vectors are
linearly dependent.
• Linear dependence: One of the vectors is the result of the linear
combination of one or more of remaining vectors.
–
•
Basis of vector space V:
The set of linear independent vectors from which all elements of
vector space V are produced as their linear combinations.
• i.e. Β ={e1,e2,…,en} with ei={0,0,0,…,1,…,0} then Β is the basis of
Rn .
• The same stands for ei={1,1,…,1,0,…,0}
–
•
Dimension of vector space V:
The cardinality of its basis, dimV=n
• Notice: A space V may have more than one basis. However, all
basis are of the same cardinality.
–
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010
15
Singular Value Decomposition (SVD)
• Decomposition into eigen values and eigenvectors is applied to square
matrices. Data tables are usually non square, in these case we apply
Singular Value Decomposition.
• Spectral theorem: a normal matrix M can be unitarily diagonalized using a
basis of eigenvectors.
• singular value: A non-negative real number σ is a singular value for M if there
exist unit-length vectors u in Km and v in Kn such that:
Mv= σu and MTu=σv
u and v are called left-singular and right-singular vectors for σ, respectively.
singular value decomposition : M = UΣVT
- diagonal entries of Σ are the singular values of M.
- columns of U / V are left/right-singular vectors for the corresponding singular
values.
An m × n matrix M has at least one and at most p = min(m,n) distinct singular values.
October 2010
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
16
8
10/10/2010
Singular Value Decomposition (SVD) - I
Relation to eigenvectors/values
– Let Α mxn table, can be expressed ULVT
– U: mxm, its columns are A*AT eigenvectors.
– U,V define orthogonal basis: UUT= VVT =1
– L: mxn contains A’s singular values (square roots
of A*AT eigenvalues)
– V : nxn, its columns are AT*A eigenvectors
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
17
Singular Value Decomposition (SVD) - II
Matrix approximation
• The best rank r approximation M’ of a matrix M.
(minimizing the Frobenius norm)
• M’=UΣ’V* (Σ’ keeps the r largest singular values
from Σ)
• Thus: distance (||M||F ,||M’||F ) is minimal
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
18
9
10/10/2010
Multidimensional Scaling (MDS)
• Initially we depict vectors in random places
• Iteratively reposition them in order to minimize
Stress.
–
–
Stress = (f(dij)-dij’)2/ f(dij)2 : The total error in distances mapping
Complexity Ο(N3) (Ν:number of vectors)
• Result:
•
– Mapping of the data in a lower dimensional space.
Implement usually by:
– Eigen decomposition of the inner product matrix and
projection on the k eigenvectors that correspond to the k
largest eigenvalues.
M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010
October 2010
19
Multidimensional Scaling
•
Data is given as rows in Χ
–
–
–
C=XXT (inner product of xi with xj)
Eigen decomposition of C’ = ULU-1
Eventually X’ = UkLk1/2, where k is the projection dimension
U=
X=
SVD
L=
XXT
C=
October 2010
L=
X’=U2L21/2=
M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010
20
10
10/10/2010
Principal Components Analysis
• The main concept behind Principal Components Analysis is
dimensionality reduction, maintaining as much as possible
data’s variance.
• variance:
V(X)=σ2=Ε[(Χ-µ)2]
• Let Ν objects, with mean value, m, it is approximated as:
• In a sample of Ν objects with unknown mean value:
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
21
Dimensionality reduction based on variance
maintenance
Axis
maximizing
variance
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
22
11
10/10/2010
Principal Components Analysis
• «Α linear transformation that chooses a new coordinate
system for the data set such that the greatest variance by
any projection of the data set comes to lie on the first axis
(then called the first principal component), the second
greatest variance on the second axis, and so on ...»
(wikipedia)
• Let n dimensional data, with dimensions: x1,…,xn
• The objective is to project the data to k dimensions a some linear
decomposition:
y1=a1*x1+…+an*xn
………
yk=b1*x1+…+bn*xn
• should maintain the variance of the original data
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
23
Covariance Μatrix
• Let Matrix
where Xi vectors
• covariance matrix Σ is the matrix whose (i, j) entry is
the covariance
• Also: cov(X) = XXT
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
24
12
10/10/2010
Principal Components Analysis (PCA)
• The basic idea of PCA is the maximization of the
covariance.
– Variance: Depicts the maximum deviation of a random
variable from the mean.
– σ2=∑i=1n ((xi – µi )2/n)
•
Method:
–
–
–
Assumption: Data is described by p variables and contained as rows
in matrix Αpxn
We subtract mean values from columns. A’=(A-M)
Calculate covariance matrix W = A’T A’
M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010
October 2010
25
Principal Components Analysis (PCA) – (2)
•
Calculation of covariance matrix (W)
–
–
A matrix nxn, in each cell of which (i,j) we have the covariance of
Xi,Xj.
Covariance depicts the dependency between variables
•
•
•
•
•
•
Calculate eigenvalues and eigenvectors of W (X,D)
Retain k largest eigenvalues and corresponding eigenvectors
–
•
If cov(Xi,Xj)=0 the Xi,Xj are independent
If cov(Xi,Xj)<0 then Xi,Xj are negative dependent (when i increases j
decreases and vice versa)
If cov(Xi,Xj)>0 then Xi,Xj are positive dependent
Simply: cov(Xi,Xj) =(∑f=1n(Xif – Xi~)(Xjf – Xj~))/(n-1)
k : input parameter such that ∑pj=k+1λj/∑pj=1λj > threshold
Projection : A’Χk
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010
26
13
10/10/2010
Principal Components Analysis
x1
2
3
4
5
x2
6
4
2
6
X= x3
1
5
6
8
x4
1
4
4
6
x5
3
4
9
5
X’=
∑j=15xij
xij- mj
x1
-1 -1
-1 -1
x2
3
0
-3 0
x3
1
1
1
x4
-2 -1 -1 1
x5
0
sumj/5
15 20
0
3
25 30
0
4
Cov=
2
U=
L=
-0.18
-0.19
0.34
-0.20
-0.40
-0.20
-0.86
0.38
-0.66
-0.54
0.34
-0.01
-0.60
0.78
0.09
4.38
2.89
1.02
k=2
Uk=
5
-1.05
0.2
0.7
0.55
0.55
-1.05
0.55
2.20
0.70
0.2
0.55
0.70
1.70
6
(∑f=15(xif-mi)(xjf-mj))/4
-0.90
-0.18
-0.20
-0.40
-0.90
-0.18
2.2
-0.20
-0.40
0
0
-2.4
-1.56
2.7
0.54
0.2
0.4
0.004
X’Uk
October 2010
1.05
1.05
-1
Cov=ULUT
-0.90
3.70
1.16
M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010
27
PCA, example
Axis corresponding to the
second principal component
Axis corresponding to the
first principal component
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
28
14
10/10/2010
PCA Applications
• It is a dimensionality reduction method
• Nominal complexity Ο( np^2+p^3)
– n: number of data points
– p: number of initial space dimensions
• The new space maintains sufficiently the data variance.
•
Preprocessing step preceding the application of data mining
algorithms (such as clustering).
•
Data Visualization.
•
Noise reduction.
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
29
Latent Structure in documents
•Documents are represented based on the Vector Space Model
•Vector space model consists of the keywords contained in a document.
•In many cases baseline keyword based performs poorly – not able to
detect synonyms.
•Therefore document clustering is problematic
•Example where of keyword matching with the query: “IDF in computerbased information look-up”
access
Doc1
x
document
x
retrieval
theory
x
x
database
x
x
Doc2
Doc3
information
x
x
indexing
computer
x
x
x
Indexing by Latent Semantic Analysis (1990) Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman, Journal of the
American Society of Information Science
M. Vazirgiannis “Web Mining - a predictive
October 2010
approach “- BDA 2010
30
15
10/10/2010
Latent Semantic Indexing (LSI) - I
• Finding similarity with exact keyword matching is
problematic.
• Using
SVD
we
process
the
initial
document-term
document.
• Then we choose the k larger singular values. The
resulting matrix is of order k and is the most similar to the
original one based on the Frobenius norm than any other
k-order matrix.
October 2010
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
31
Latent Semantic Indexing (LSI) - II
• The initial matrix is SVD decomposed as: Α=ULVT
• Choosing the top-k singular values from L we have:
Αk=UkLkVkT ,
• Lk is square kxk containing the top-k singular values of the diagonal in
matrix L,
• Uk, the mxk matrix containing the first k columns in U (left singular
vectors)
• VkT, the kxn matrix containing the first k lines of VT (right singular vectors)
Typical values for κ~200-300 (empirically chosen based on experiments
appearing in the bibliography)
October 2010
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
32
16
10/10/2010
LSI capabilities
• Term to term similarity: ΑkΑkT=UkLk2UkT
• document-document similarity: ΑkTΑk=VkLk2VkT
• term document similarity (as an element of the
transformed – document matrix)
• Extended query capabilities transforming initial query
q to qn :
qn=qTUkLk–1
• Thus qn can be regarded a line in matrix Vk
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
33
LSI – an example [1]
LSI application on a term – document matrix
C1: Human machine Interface for Lab ABC computer application
C2: A survey of user opinion of computer system response time
C3: The EPS user interface management system
C4: System and human system engineering testing of EPS
C5: Relation of user-perceived response time to error measurements
M1: The generation of random, binary unordered trees
M2: The intersection graph of path in trees
M3: Graph minors IV: Widths of trees and well-quasi-ordering
M4: Graph minors: A survey
• The dataset consists of 2 classes, 1st: “human – computer interaction” (c1-c5)
2nd: related to graph (m1-m4). After feature extraction the titles are
represented as follows.
____________________________________________________________________________________________________________________
[1] Indexing by Latent Semantic Analysis (1990) Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard
Harshman, Journal of the American Society of Information Science
October 2010
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
34
17
10/10/2010
LSI – an example
C1
C2
C3
C4
C5
M1
M2
M3
M4
human
1
0
0
1
0
0
0
0
0
Interfac
e
1
0
1
0
0
0
0
0
0
comput
er
1
1
0
0
0
0
0
0
0
User
0
1
1
0
1
0
0
0
0
System
0
1
1
2
0
0
0
0
0
Respon
se
0
1
0
0
1
0
0
0
0
Time
0
1
0
0
1
0
0
0
0
EPS
0
0
1
1
0
0
0
0
0
Survey
0
1
0
0
0
0
0
0
1
Trees
0
0
0
0
0
1
1
1
0
Graph
0
0
0
0
0
0
1
1
1
Minors
0
0
0
0
0
0
0
1
1
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
October 2010
35
LSI – an example
A=ULVT
A=
October 2010
1
0
0
1
0
0
0
0
0
1
0
1
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
1
1
0
1
0
0
0
0
0
1
1
2
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
1
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
1
1
1
0
0
0
0
0
0
0
1
1
1
0
0
0
0
0
0
0
1
1
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
36
18
10/10/2010
LSI – an example
A=ULVT
0
U=
0
0
0.22
-0.11
0.29
-0.41
-0.11
-0.34
0.52
-0.06
0.41
0.20
-0.07
0.14
-0.55
0.28
0.50
-0.07
-0.01
-0.11
0.24
0.04
-0.16
-0.59
-0.11
-0.25
-0.30
0.06
0.49
0.40
0.06
-0.34
0.10
0.33
0.38
0.00
0.00
0.01
0.64
-0.17
0.36
0.33
-0.16
-0.21
-0.17
0.03
0.27
0.27
0.11
-0.43
0.07
0.08
-0.17
0.28
-0.02
0.05
0.27
0.11
-0.43
0.07
0.08
-0.17
0.28
-0.02
0.05
0.30
-0.14
0.33
0.19
0.11
0.27
0.03
-0.02
0.17
0.21
0.27
-0.18
-0.03
-0.54
0.08
-0.47
-0.04
0.58
0.01
0.49
0.23
0.03
0.59
-0.39
-0.29
0.25
0.23
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0.04
0.62
0.22
0.00
-0.07
0.11
0.16
-0.68
0.23
0
0
0
0.03
0.45
0.14
-0.01
-0.30
0.28
0.34
0.68
0.18
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
October 2010
37
LSI – an example
A=ULVT
L=
October 2010
3.3
4
0
0
0
0
0
0
0
0
0
2.54
0
0
0
0
0
0
0
0
0
2.35
0
0
0
0
0
0
0
0
0
1.64
0
0
0
0
0
0
0
0
0
1.50
0
0
0
0
0
0
0
0
0
1.31
0
0
0
0
0
0
0
0
0
0.85
0
0
0
0
0
0
0
0
0
0.56
0
0
0
0
0
0
0
0
0
0.36
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
38
19
10/10/2010
LSI – an example
A=ULVT
V=
0.20
-0.06
0.11
-0.95
0.05
-0.08
0.18
-0.01
-0.06
0.61
0.17
-0.50
-0.03
-0.21
-0.26
-0.43
0.05
0.24
0.46
-0.13
0.21
0.04
0.38
0.72
-0.24
0.01
0.02
0.54
-0.23
0.57
0.27
-0.21
-0.37
0.26
-0.02
-0.08
0.28
0.11
-0.51
0.15
0.33
0.03
0.67
-0.06
-0.26
0.00
0.19
0.10
0.02
0.39
-0.30
-0.34
0.45
-0.62
0.01
0.44
0.19
0.02
0.35
-0.21
-0.15
-0.76
0.02
0.02
0.62
0.25
0.01
0.15
0.00
0.25
0.45
0.52
0.08
0.53
0.08
-0.03
-0.60
0.36
0.04
-0.07
-0.45
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
October 2010
39
LSI – an example
Choosing the 2 largest singular values we have
Uk=
0.22
-0.11
0.20
-0.07
0.24
0.04
0.40
0.06
0.64
-0.17
0.27
0.11
0.27
0.11
0.30
-0.14
0.21
0.27
0.01
0.49
0.04
0.62
0.03
0.45
October 2010
Lk=
0.20
VkT= 0.06
3.34
0
0
2.54
0.6
1
0.46
0.54
0.28
0.00
0.02
0.02
0.08
0.1
7
-0.13
-0.23
0.11
0.19
0.44
0.62
0.53
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
40
20
10/10/2010
LSI (2 singular values)
Αk =
C1
C2
C3
C4
C5
M1
M2
M3
M4
human
0.16
0.40
0.38
0.47
0.18
-0.05
-0.12
-0.16
-0.09
Interface
0.14
0.37
0.33
0.40
0.16
-0.03
-0.07
-0.10
-0.04
Compute
r
0.15
0.51
0.36
0.41
0.24
0.02
0.06
0.09
0.12
User
0.26
0.84
0.61
0.70
0.39
0.03
0.08
0.12
0.19
System
0.45
1.23
1.05
1.27
0.56
-0.07
-0.15
-0.21
-0.05
Respons
e
0.16
0.58
0.38
0.42
0.28
0.06
0.13
0.19
0.22
Time
0.16
0.58
0.38
0.42
0.28
0.06
0.13
0.19
0.22
EPS
0.22
0.55
0.51
0.63
0.24
-0.07
-0.14
-0.20
-0.11
Survey
0.10
0.53
0.23
0.21
0.27
0.14
0.31
0.44
0.42
Trees
-0.06
0.23
-0.14
-0.27
0.14
0.24
0.55
0.77
0.66
Graph
-0.06
0.34
-0.15
-0.30
0.20
0.31
0.69
0.98
0.85
Minors
-0.04
0.25
-0.10
-0.21
0.15
0.22
0.50
0.71
0.62
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
October 2010
41
LSI Example
• Query:
“human
computer
interaction”
retrieves
documents: c1,c2, c4 but not c3 and c5.
• If we submit the same query (based on the
transformation shown before) to the transformed matrix
we retrieve (using cosine similarity) all c1-c5 even if c3
and c5 have no common keyword to the query.
• According to the transformation for the queries we
have:
October 2010
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
42
21
10/10/2010
Query transformation
query
1
human
1
0
Interface
0
1
computer
1
0
User
0
0
System
0
Response
0
Time
0
0
EPS
0
0
Survey
0
0
Trees
0
0
Graph
0
0
Minors
0
q=
0
0
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
October 2010
43
Query transformation
qT=
1
0
0.22
-0.11
0.20
-0.07
0.24
0.04
0.40
0.06
Uk= 0.64
-0.17
0.27
0.11
0.27
0.11
0.30
-0.14
0.21
0.27
0.01
0.49
0.04
0.62
0.03
0.45
October 2010
1
0
0
0
Lk-1=
0
0
0
0.3
0
0
0.39
qn=qTUkLk -1=
0.138
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
0
0
0
-0.0273
44
22
10/10/2010
Query transformation
VkLk=
qnLk =
0.20
-0.06
0.67
-0.15
0.61
0.17
2.04
0.43
0.46
-0.13
1.54
-0.33
0.54
-0.23
1.80
-0.58
0.28
0.11
0.94
0.28
0.00
0.19
0.00
0.48
0.01
0.44
0.03
1.12
0.02
0.62
0.07
1.57
0.08
0.53
0.27
1.35
0.138
3.34
0
0
2.54
-0.0273
=
3.34
0
0
2.54
=
0.46
-0.069
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
October 2010
45
Query document similarity in LSI space
1.5
m3
m4
m2
1
query
0.5 m1
C2
C5
q 0.5C1
-0.5
October 2010
1
1.5
2
C3
C4
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
C1
0.99
C2
0.94
C3
0.99
C4
0.99
C5
0.90
M1
-0.14
M2
-0.13
M3
-0.11
M4
0.05
46
23
10/10/2010
Tutorial Outline
•
Introduction, Motivation
Web Data nature (graph, time evolving)
Examples of data sets (snap – our data set)
Aspects: content, structure and behavior mining.
•
•
•
•
Data Preprocessing
–
–
–
•
Content preprocessing (text to matrix)
Dimensionality reduction algorithms
Feature selection fundamentals
Ranking Web data
–
–
Graph based ranking – Pagerank, HITS
Text based ranking
•
Unsupervised mining algorithms and their foundations:
•
Predictive learning
–
–
–
–
–
Graph based Clustering
Markov Models,
regression models, variants of regression with probabilistic
clustering and Principal Components Analysis (PCA)
EM based approach with Bayesian learning.
•
Top-k similarity measures
•
Predictive learning case studies
–
K-Sim, Osim, NDCG, Spearman correlation.
–
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
47
Why link analysis?
• The web is not just a collection of documents
– its hyperlinks are important!
• A link from page A to page B may indicate:
– A is related to B, or
– A is recommending, citing or endorsing B
• Links are either
– referential – click here and get back home, or
– Informational – click here to get more detail
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010
48
24
10/10/2010
HITS - Kleinberg’s Algorithm
• HITS – Hypertext Induced Topic Selection
• For each vertex v Є V in a subgraph of interest:
a(v) - the authority of v
h(v) - the hubness of v
• A site is very authoritative if it receives many
citations. Citation from important sites weight
more than citations from less-important sites
• Hubness shows the importance of a site. A good hub
is a site that links to many authoritative sites
October 2010
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
49
Authority and Hubness
5
2
3
1
1
4
6
7
a(1) = h(2) + h(3) + h(4)
October 2010
h(1) = a(5) + a(6) + a(7)
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
50
25
10/10/2010
Authority and Hubness Convergence
• Recursive dependency:
a(v) Σ
h(w)
w Є pa[v]
h(v) Σ w Є ch[v]
a(w)
• Using Linear Algebra, we can prove:
a(v) and h(v) converge
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
October 2010
51
HITS Example
Find a base subgraph:
• Start with a root set R {1, 2, 3, 4}
• {1, 2, 3, 4} - nodes relevant to
the topic
• Expand the root set R to include all
the children and a fixed number of
parents of nodes in R
A new set S (base subgraph)
October 2010
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
52
26
10/10/2010
HITS Example Results
Authority
Hubness
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Authority and hubness weights
October 2010
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
53
Markov Chain
A Markov chain has two
components:
•
•
A network structure much like a
web site, where each node is
called a state.
A transition probability of
traversing a link given that the
chain is in a state.
•
For each state the sum of
outgoing probabilities is one.
b2
b4
c1
b3
d2
d1
A sequence of steps through the
chain is called a random walk.
October 2010
a1
b1
M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010
e2
e1
54
27
10/10/2010
PageRank - Motivation
• A link from page A to page B is a vote of the author of A for B,
or a recommendation of the page.
• The number incoming links to a page is a measure of
importance and authority of the page.
• Also take into account the quality of recommendation, so a
page is more important if the sources of its incoming links are
important. M. Vazirgiannis “Web Mining - a predictive approach “October 2010
55
BDA 2010
The Random Surfer
• Assume the web is a Markov chain.
• Surfers randomly click on links, where the probability
of an outlink from page A is 1/m, where m is the
number of outlinks from A.
• The surfer occasionally gets bored and is teleported
to another web page, say B, where B is equally likely
to be any page.
• Using the theory of Markov chains it can be shown
that if the surfer follows links for long enough, the
PageRank of a web page is the probability that the
surfer will visit that page.
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010
56
28
10/10/2010
Dangling Pages
B
• Problem: A and B have no outlinks.
Solution: Assume A and B have links to all web pages
with equal probability.
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010
57
Rank Sink
• Problem: Pages in a loop accumulate rank but do not
distribute it.
• Solution: Teleportation, i.e. with a certain probability
the surfer can jump to any other web page to get out
of the loop.M. Vazirgiannis “Web Mining - a predictive approach “October 2010
BDA 2010
58
29
10/10/2010
PageRank
• Introduced by Page et al (1998)
– The weight is assigned by the rank of parents
• Difference with HITS
– HITS takes Hubness & Authority weights
– The page rank is proportional to its parents’ rank, but
inversely proportional to its parents’ outdegree
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
October 2010
59
PageRank (PR) – Definition
PR(P)
•
•
•
•
•
PR(P1 ) PR(P2 )
PR(Pn )
d
(1 d)(
...
)
N
O(P1 )
O(P2 )
O(Pn )
P is a web page
Pi are the web pages that have a link to P
O(Pi) is the number of outlinks from Pi
d is the teleportation probability
N is the size of the web
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010
60
30
10/10/2010
Example Web Graph
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010
61
Iteratively Computing PageRank
• Replace d/N in the def. of PR(P) by d, so PR will take values
between 1 and N.
• d is normally set to 0.15, but for simplicity lets set it to 0.5
• Set initial PR values to 1
• Solve the following equations iteratively:
PR( A) 0.15 / 3 0.85PR(C )
PR(B ) 0.15 / 3 0.85(PR( A) / 2)
PR(C ) 0.15 / 3 0.85(PR( A) / 2 PR(B ))
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010
62
31
10/10/2010
Example Computation of PR
Iteration PR(A)
PR(B)
PR(C)
0
1
2
3
4
5
…
12
1
0.75
0.765625
0.76855469
0.76910400
0.76920700
…
0.76923077
1
1.125
1.1484375
1.15283203
1.15365601
1.15381050
…
1.15384615
October 2010
1
1
1.0625
1.07421875
1.07641602
1.07682800
…
1.07692308
M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010
63
Matrix Notation
Adjacent Matrix
A=
* http://www.kusatro.kyoto-u.com
October 2010
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
64
32
10/10/2010
Matrix Notation
• Matrix Notation
r=αBr=Mr
α : eigenvalue
r : eigenvector of B
Ax=λx
| A - λI | x = 0
B=
Finding Pagerank
to find eigenvector of B with an associated eigenvalue α
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
October 2010
65
Matrix Notation
PageRank: eigenvector of P relative to max eigenvalue
B = P D P-1
D: diagonal matrix of eigenvalues {λ1, … λn}
P: regular matrix that consists of eigenvectors
PageRank
October 2010
r1 =
normalized
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
66
33
10/10/2010
Matrix Notation
• Confirm the result
# of inlinks from high ranked page
hard to explain about 5&2, 6&7
• Interesting Topic
How do you achieve your homepage
to be highly ranked?
October 2010
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
67
PageRank Algorithm
* Page et al, 1998
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
68
34
10/10/2010
Pagerank computation methods
G=(V, E)
xi:Pagerank of page i,
dj:out-degree of page j
In tabular form: x=Ax where:
Power method (recursive) converges to the principal eigenvector
(probability visiting the page).
• If many pages have no inlinks eigen vector mostly 0’s
• Thus
•
•
•
•
•
____________________________________________________________________________
“PageRank Computation and the Structure of the Web: Experiments and Algorithms”, Arvind
Arasu, Jasmine Novak, Andrew Tomkins & John Tomlin
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
69
Pagerank computation methods
• Jacobi iteration
• Gauss-Seidel method
»
October 2010
convergence speed
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
70
35
10/10/2010
The Largest Matrix Computation
in the World
• Computing PageRank can be done via matrix
multiplication, where the matrix has several
billion rows and columns.
• The matrix is sparse as average number of
outlinks is between 7 and 8.
• Setting d = 0.15 or below requires at most 100
iterations to convergence.
• Researchers still trying to speed-up the
computation.
M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010
October 2010
71
Tutorial Outline
•
Introduction, Motivation
Web Data nature (graph, time evolving)
Examples of data sets (snap – our data set)
Aspects: content, structure and behavior mining.
•
•
•
•
Data Preprocessing
–
–
–
•
Content preprocessing (text to matrix)
Dimensionality reduction algorithms
Feature selection fundamentals
Ranking Web data
–
–
Graph based ranking – Pagerank, HITS
Text based ranking
•
Unsupervised mining algorithms and their foundations:
•
Predictive learning
–
–
–
–
–
Graph based Clustering
Markov Models,
regression models, variants of regression with probabilistic
clustering and Principal Components Analysis (PCA)
EM based approach with Bayesian learning.
•
Top-k similarity measures
•
Predictive learning case studies
–
K-Sim, Osim, NDCG, Spearman correlation.
October–2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
72
36
10/10/2010
Text based ranking - Overview
•
•
•
•
•
•
•
•
(Typical) Search Process
Boolean Retrieval Model
Vector Space Model
BM25 Ranking Function
Bayesian and Combinatorial Models
Ranking with Document fields
Ranking with Proximity
Learning to Rank
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
73
(Typical) Search process
• A User performs a task and has an information need, from
which a query is formulated.
• A search engine typically
processes textual queries and
retrieves a (ranked) list
of documents
• The user obtains the results
and may reformulate
the query
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
74
37
10/10/2010
Boolean Retrieval Model
• Documents are represented as sets of terms
• Queries are Boolean expressions with precise semantics
• Retrieval is precise and the answer is a set of documents
– Documents are not ranked
– Simplest retrieval model
• Example:
D1 = {web, text, mining}
D2 = {data, mining, text, clustering}
D3 = {frequent, itemset, mining, web}
• For query q = web ∧ mining answer is { D1, D3 }
• For query q = text ∧ ( clustering ∨ web) answer is { D1, D2 }
• Extended Boolean Retrieval Model (Salton et al., 1983)
• Ranks documents according to similarity to query
• Matches documents that satisfy the Boolean query partially
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
75
Vector Space Model
• Document d and query q are represented as k-dimensional vectors d =
(w1,d, …, wk,d) and q = (w1,q, …, wk,q)
–
–
–
–
Each dimension corresponds to a term from the collection vocabulary
Independence between terms
wi,q is the weight of i-th vocabulary word in q
(Salton & McGill, 1983; Baeza-Yates & Ribeiro-Neto, 1999)
• Degree of similarity between d and q is the cosine of the angle between
the two vectors
k
r r
t 1 w t,d wt,q
dq
sim(d,q) r r
k
k
dq
t 1 w t,d2 t 1 w t,q2
• Generalized Vector Space Model (Wong et al., 1985)
– Incorporates dependencies between terms
– Using semantics (Tsatsaronis & Panagiotopoulou, 2009)
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
76
38
10/10/2010
Term weighting in VSM
• Term weighting with tf-idf (and variations)
w t,d tf t,d idf t
• tf models the importance of a term in a document
tf t,d f t,d
tf t,d
f t,d
maxf s,d
– ft,d is the frequency of term t in document d
• idf models the importance of a term in the document collection
– Logarithm base not important
– Theoretically derived from Probabilistic IR, Information Theory
(Robertson, 2004; Papineni, 2001; Metzler, 2008)
idf t log
N
nt
idf t log
N n t 0.5
n t 0.5
– N is the total number of documents, nt is the document frequency of term t
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
77
BM25 Ranking Function
• tf-idf like ranking function assuming bag-of-words document
representation (Robertson et al., 1994)
tf t,d k1 1
N n t 0.5
where idf t log
n t 0.5
len
d
t d q
tf t,d k1 1 b b
avglen
d
– lend is the length of document
– avglen is the average document length in the collection
score(d,q)
idf t
• Values of parameters k1 and b
depend on collection/task
– k1 controls term frequency saturation
– b controls length normalization
– Default values: k1 =1.2 and b =0.75
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
78
39
10/10/2010
Bayesian & Combinatorial Models
•
Language Models for IR based on a Bayesian approach (Ponte & Croft, 1998;
Hiemstra, 1998; Laferty & Zhai, 2001)
– Generate a language model Md for every document d
– Rank documents according to probability P(q|Md) that they generate query q
– Principled integration of the importance of documents (e.g. PageRank), as prior
probability P(d)
•
Divergence From Randomness based on a combinatorial approach
(Amati, 2003)
– Information theoretic framework for generating ranking functions
w d ,q
(log P ) (1 P )
1
2
t q d
– Models consist of 3 components:
– Randomness model computes probability P1 that t appears with given frequency in d
– Risk/Information Gain Model computes probability P2 that t with given frequency in d will
occur one time more
– Term Frequency Normalisation adjusts term frequency with respect to document length
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
79
Ranking with Document fields
• Documents often have content in different fields
– Metadata in PDF or MS Office files
– Title, headings, emphasized text, anchor text of incoming hyperlinks in
Web pages
• Rank by
– Scoring each field separately and combining the scores
• Difficult to interpret: combining non-linear transformations of term
frequencies
– Computing weighted sum of field term-frequencies, then compute
score
• Per-field term frequency normalization: e.g. no need to discount anchor
text frequencies, because each anchor is a vote for the Web page
• Extension of BM25 (Zaragoza et al. 2004) and DFR models
(Macdonald et al., 2005) with per-field normalization
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
80
40
10/10/2010
Ranking with Proximity
•
Query terms that appear near each other contribute more to the document’s
score
– Requires position information in the search engine index
– Score both individual query terms and pairs of query terms
score(d,q) tq score(d,t)
score(d,t ,t
i
j
)
t i ,t j q,t i t j
•
Count occurrence of pair of terms if they occur within window of w terms
•
Random Markov Fields to model dependencies (Metzler & Croft, 2005)
– Sequential Dependence: only consecutive terms
– Full Dependence: all possible pairs of terms
•
Extensions to BM25 and DFR (Büttcher & Clarke, 2006; Peng et al., 2007)
•
Using global counts for pairs of terms expensive to compute and store
– Not always effective (Macdonald & Ounis, 2010)
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
81
Learning To Rank
• Web search engines use hundreds of features to rank Web pages
– Text-based features (e.g. BM25 scores), term proximity, document structure,
– Link-based features (e.g. PageRank),
– spam metrics, freshness, click-through rates
• Limitations of typical IR models
– Parameter tuning can be difficult
– Cannot combine large number of features to obtain more effective ranking
• Machine Learning from
– Points: relevance judgment for query-document pair (Crammer & Singer,
2001)
• Doc A is relevant, Doc B is not relevant, Doc C is somehow relevant
– Pairs: preferences between pairs of documents (Burges et al., 2005)
• Doc A is better than Doc B
– Lists: maximize average evaluation measure for many queries (Yue et al.,
2007)
• Hard problem, evaluation measures are not always continuous, approximations are required
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
82
41
10/10/2010
Tutorial Outline
•
Introduction, Motivation
Web Data nature (graph, time evolving)
Examples of data sets (snap – our data set)
Aspects: content, structure and behavior mining.
•
•
•
•
Data Preprocessing
–
–
–
•
Content preprocessing (text to matrix)
Dimensionality reduction algorithms
Feature selection fundamentals
Ranking Web data
–
–
Graph based ranking – Pagerank, HITS
Text based ranking
•
Unsupervised mining algorithms and their foundations:
•
Predictive learning
–
–
–
–
–
Graph based Clustering
Markov Models,
regression models, variants of regression with probabilistic
clustering and Principal Components Analysis (PCA)
EM based approach with Bayesian learning.
•
Top-k similarity measures
•
Predictive learning case studies
–
K-Sim, Osim, NDCG, Spearman correlation.
October–2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
83
Graph Clusters & Communities
• There is no widely accepted definition
• In general the graph has to be relatively sparse.
• If the graph is too dense then there is no meaning in
search of a cluster using the structural properties of
the graph.
• Many clustering algorithms or problems related to
clustering are NP-hard
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
84
42
10/10/2010
Basic Methodologies
•
•
•
•
•
•
Hierarchical Clustering
Graph Partitioning
Partitional Clustering
Spectral Clustering
Divisive Algorithms
Modularity Based Methodologies
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
85
Hierarchical Clustering
• Either top-down (Divisive) or bottom-up
(Agglomerative)
• On Agglomerative methodologies small clusters are
merged together into bigger ones based on a
measurement of similarity
• The individual merging steps can be used to create a
hierarchy of domains in a dendogram.
• Starting point: Each node is a cluster
• Ending point: Arbitrary e.g. by the size/number of
the clusters or by a quality measure e.g. Modularity
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
86
43
10/10/2010
Hierarchical Clustering
• Similarity functions:
– Jaccard coefficient[1] : If A and B two groupings (in graphs the vertices
that are connected to two different nodes) the jaccard coefficient is
– In ROCK Algorithm[2]->goodness measure g(Ci;Cj) :
Where:
• f(θ) is a function that is dependent on the data set as well as the
kind of clusters we are interested in
• Link[Ci,Cj] are the links between the two clusters
• nif(θ) is an approximation on the links each vertex has inside Ci
By taking into account both the linkage outside and inside of each cluster
we avoid the situations where big clusters “swallow” smaller, but well
defined, ones
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
87
Graph Partitioning
• The general problem is to find k groups of vertices so that
the edges between them is minimal.
• The k is given as an argument otherwise we end with each
vertex as each own cluster.
• Most common algorithms use the Ford and Fulkerson maxflow min-cut theorem [4][5]
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
88
44
10/10/2010
Max-Flow Min-Cut
• By choosing a source and a sink (or by adding them-depending on the
implementation and problem) we send the maximum flow between these
two nodes.
• The edges that belong in the path with the maximum flow and have their
respective flow maximized are the edges that belong to the minimum cut
• A common implementation that works generally fast on sparse networks is
the Edmonds–Karp algorithm[6]
• In the image below if a source is on the left and the sink is on the right
then the two clusters will be easily seperated
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
89
Partitional Clustering
• The number of clusters is predefined
• For each pair of vertices we define a distance
• Clusters are formed by maximizing or
minimizing a cost function based on the
distance
• Most common algorithm is k-Means
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
90
45
10/10/2010
Partitional Clustering
• Common cost Functions :
– Minimum k-clustering: The cost function is the diameter of a cluster:
the largest distance between two points of a cluster. The points are
classified such that the largest of the k cluster diameters is the
smallest possible. Trying to keep the clusters very ``compact''.
– k-clustering sum: Same as minimum k-clustering, but the diameter is
replaced by the average distance between all pairs of points of a
cluster.
– k-center: For each cluster i we define a reference point xi (centroid)
and compute the maximum di of the distances of each cluster point
from the centroid. The clusters and centroids are chosen to minimize
the largest value of di.
– k-median: Same as k-center, but the maximum distance from the
centroid is replaced by the average distance.
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
91
Spectral Clustering
• A similarity matrix is needed-> the adjacency matrix
A
• We use the eigenvectors to partition the matrix into
clusters.
• Most common methodologies use the Laplacian
Matrix L=D-A
– where D is the diagonal matrix whose element Dii equals
the degree of vertex I
• We apply decomposition on L and we use the k
eigenvectors with the highest Eigen value
• If k=2 we use the second eigenvector
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
92
46
10/10/2010
Spectral Clustering (k=2)
x1
x2
x3
x1
1.5
0.8
0.6
x2
0.8
1.6
0.8
0
x3
0.6
0.8
1.6
0.2
0
0
x4
0
0
0.2
1.7
0.8
0.7
0
0.8
1.7
0.8
x5
x6
0.1
0
0
0
0
x4
x5
0
0.1
0.7
0
0.8
x6
0
Λ=
0.2
0.1
0.4
-0.2
-0.9
0.4
0.2
0.1
-0.
0.4
0.3
2.2
0.4
0.2
-0.2
0.0
-0.2
0.6
0.4
-0.4
0.9
0.2
-0.4
-0.6
2.5
0.4
-0.7
-0.4
-0.8
-0.6
-0.2
3.0
0.4
-0.7
-0.2
0.5
0.8
0.9
X=
x1
0.2
x2
x3
0.2
x4
-0.4
x5
-0.7
x1
0.2
x4
-0.4
-0.7
x2
0.2
x5
-0.7
x3
0.2
x6
-0.7
x6
October 2010
0.4
0.4
2.3
0
1.5
0.0
(Here threshold for spliting
=0)
0.2
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
93
Spectral Clustering (Unnormalized)
Input: Similarity matrix S = Rn×n, number k of clusters to construct.
Construct a similarity graph by one of the ways define before.
Let W be its (weighted) adjacency matrix.
Compute the Laplacian L.
Compute the first k eigenvectors u1, . . . , uk of L.
Let U = Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
For i = 1, . . . , n, let yi = Rk be the vector corresponding to the i-th row of
U.
• Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into
clusters
• C1, . . . ,Ck.
• Output: Clusters A1, . . . ,Ak with Ai = {j| yj 2 Ci}. [8]
•
•
•
•
•
•
•
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
94
47
10/10/2010
Divisive Algorithms
• The idea seem similar with Hierarchical Divisive
Algorithms (HAD)
• In HAD we remove edges based on similarity
between vertices INSTEAD here we remove edges
from within the cluster based only on properties of
the edge
• Edge betweeness : The number of shortest paths
between all vertex pairs that run along the edge. It is
higher for edges that connect communities
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
95
Divisive Algorithms
• Girvan and Newman Algorithm[9]:
• 1. Computation of the centrality for all edges
• 2. Removal of edge with largest centrality: in case of
ties with other edges, one of them is picked at
random
• 3. Recalculation of centralities on the running graph
• 4. Iteration of the cycle from step 2.
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
96
48
10/10/2010
Modularity
• Modularity is a metric that can be used to evaluate the quality
of a cluster
• Where:
–
–
–
–
–
m is the number of links in the whole graph
A the adjacency matrix
kj the degree of vertex i
ci the cluster I belongs to
and δ(ci,cj)=0 if ci==cj otherwise δ(ci,cj)=1
• The value of the modularity lies in the range [−1,1]
• It is positive if the number of edges within groups exceeds
the number expected on the basis of chance.
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
97
Modularity
•
•
•
Modularity can also be used as a function in
algorithms in order to extract clusters.
Example: Simulated Annealing
In the most common implementation there
are two moves
1. Moving at random a vertex from one cluster to
another
2. Merging Splitting clusters
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
98
49
10/10/2010
Tutorial Outline
•
Introduction, Motivation
Web Data nature (graph, time evolving)
Examples of data sets (snap – our data set)
Aspects: content, structure and behavior mining.
•
•
•
•
Data Preprocessing
–
–
–
•
Content preprocessing (text to matrix)
Dimensionality reduction algorithms
Feature selection fundamentals
Ranking Web data
–
–
Graph based ranking – Pagerank, HITS
Text based ranking
•
Unsupervised mining algorithms and their foundations:
•
Predictive learning
–
–
–
–
–
Graph based Clustering
Markov Models,
regression models, variants of regression with probabilistic
clustering and Principal Components Analysis (PCA)
EM based approach with Bayesian learning.
•
Top-k similarity measures
•
Predictive learning case studies
–
K-Sim, Osim, NDCG, Spearman correlation.
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
99
World Wide Web (WWW)
• A highly dynamic structure continuously changing: Web pages
and hyperlinks are created, deleted or modified
• Important activity: Searching the content that matches users’
keyword-based queries
• Ranking of the results enables users to effectively retrieve
relevant and important information
• Up-to-date ranking involves a huge infrastructure (intense
crawling, index updates …)
• A robust predictive framework enables:
– Saving infrastructure resources for search engines
– Identifying promising web pages / users in social networks to advertise
on!!
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
100
50
10/10/2010
Proposed Framework
• Goal: Predict the future ranking of Web pages
based on previous ranking positions
• Framework:
– Computation of ranking trends
– Definition of the predictive models:
• Linear/2nd order Regression, Principal Components
Regression, EM based clustering.
– Models’ training and predictions
– Evaluation of predictions’ quality using various
similarity measures
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
101
Normalized ranking
• Ranking across graph snapshots
– As the size of the graph increases the measure
should be comparable across different graph sizes
(a page p is more important when it occupies the
5th out of 1000 pages rather than 100 pages)
• Definition of normalized rank (nrank)
(Vlachou,Vazirgiannis [WAW2006]):
nrank ( p, t i )
October 2010
2 rank ( p,t i )
nt i
2
, nt i | Gti |
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
102
51
10/10/2010
Rank Change Rate (Racer)
• Intuitively, an appropriate measure is the rank change rate
between two snapshots
• It represents the rate and relevant importance of rank change
racer(p, t i )
nrank(p, t i ) nrank(p, t i 1 )
nrank(p, t i 1 )
1
nrank(p, t i )
nrank( p, t i )
• Changes in top rank positions are more important than those in
lower positions
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
103
Markov Models (MMs)
• Well suited for modeling and predicting values in various
applications where frequent state transitions occur
• An m–order MM is defined as:
– A set of states S = {s1,..,sn} and
– A matrix of transition probabilities T
• Markov Property: Future values depend only on m previous values
(m = the MM’ s order)
k
P x1 x 2 ... x k P x1 * P xi | xi m ... xi 1
i 2
• Other MM properties:
– Reducibility, Periodicity, Recurrence, Ergodicity
October 2010
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
104
52
10/10/2010
MMs’ Learning
• MMs learning from racer sequences:
– Racer values are real numbers High
precision makes impossible the recognition
of patterns
• Solution: Discretize racer values:
states cardinality
– Large: High computational complexity, low
coverage, overfitting of the model
Degradation of predictions’ ability
– Small: Fine grained trends cannot be
represented accuracy falls
•
URL
t1
t2
t3
P1
0.83233
0.93323
0.13233
P2
0.83234
0.93434
0.23233
P3
0.83235
0.93536
0.33233
P4
0.83236
0.93644
0.43233
….
…
…
…
…
(Jagadish et al. 1998 ): Partition the data
space into n non-overlapping equiprobable
adjacent partitions
October 2010
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
105
Definition of MMs’ States
• If R is the set with d distinct values and F are the
corresponding frequencies:
– Repeatedly partition R until there are n partitions:
• Choose the partition Ri with the highest sum of frequencies
• Divide Ri into two equi-probable partitions
– For each partition, calculate the weighted mean:
si
ri R i
f i ri
ri R i
fi
• The set of all weighted mean values S defines the n states of the
MM n = 20 (after conducting a number of experiments)
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
106
53
10/10/2010
MMs’ Predictions with
• Assuming m+1 successive crawls:
– Construct an m-order MM
– Compute transition probabilities for every path
using the chain rule:
m
P (a1 ... am ) P(a1 ) P (ai ai 1,..,ai m )
i 2
(where ai = ranking(pj, ti))
– The predicted value (and therefore the ranking
position) is:
X arg max P (a1 ... am 1 X )
X
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
107
Predictive Models
•
•
•
•
Markov Models
Linear Regression
Principal Components Regression
Bayesian Clustering
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
108
54
10/10/2010
Linear Regression
• Predicts future nrank values by fitting
a line to the values of the previous times
• A separate linear predictor is defined
for each page p
• Given the nrank values for m snapshots,
the prediction for a snapshot ti is:
nrank( p, t i ) a t i b i
where εi: the Gaussian noise with zero mean and known variance σ2
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
109
Linear Regression
• Parameters a, b are found by solving the least
squares problem:
m
2
a i 1 t i
b m
i 1 t i
m
t
m
i 1 i
1
m t x
i
i 1m i
i 1 xi
• Advantages: Simple and flexible
• Disadvantages:
– Does not take into account the correlation among
Web pages
– Computationally expensive for high dimensions
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
110
55
10/10/2010
Second Order Prediction
• Assumes that Web pages do not follow a linear behavior through time
• Given m available snapshots, future nrank values are
estimated as:
2
x i a t i bi t ci i
• Parameters a, b and c are given by solving the equation:
m
4
a i 1 t i
b m t 3
i 1 i
c m t i 2
i 1
October 2010
m
i 1
m
i 1
m
ti
3
ti
2
i 1
ti
m
2
ti
t
i 1 i
m
i 1
m
1
m t 2x
i
i 1 i
m
i 1 t i x i
m
i 1 x i
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
111
Predictive Models
•
•
•
•
Markov Models
Linear Regression
Principal Components Regression
Bayesian Clustering
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
112
56
10/10/2010
Principal Components Regression
(PCR)
• Principal Component Analysis (PCA) is applied prior to
estimating the linear coefficients:
– Identifies temporal evolution
patterns in the pages rankings
– Highlights Web pages’
similarities and differences
– Reduces the number of dimensions
with controlled loss of information
• Input: a data matrix whose columns represent the variables
(time in our case)
• Output: a transformed matrix of lower dimension containing
only uncorrelated variables
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
113
PCA : Method
• Subtract the mean from each dimension (DataAdjust) (A-M)>A’
• Calculate the covariance matrix (A’A’T)->W
• Calculate the eigenvectors and eigenvalues of the covariance
matrix: W= ULVT
• Choose the most important components and form a feature
vector (FeatureVector) Wk= UkLkVkT
• Derive the new data set: FeatureVector’ x DataAdjust’
(Ak=UkLk )
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
114
57
10/10/2010
PCA + Regression (I)
Xk=Akzk+μ+ε
• Xk: the rank time series/pattern (n-dim vector), A nxq matrix, zk latent
vector, µ is the mean of Xks. The noise vector ε is drawn from a zero-mean
isotropic Gaussian N(0, σ2, In),
• latent vector zk is drawn from N(0, Iq).
• By marginalization: unconditional probability distribution of Xk
p(Xk)=N(µk,AΑT+σ2Ι)
• probability distribution is Gaussian with full covariance matrix AAT + σ2I
thus correlation of different Web pages in the n-rank vector Xk
represented.
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
115
PCA + Regression (II)
two-stage learning process.
- Apply PCA to X to find the q principal components of A
- We form the low dimensional projection of the data computing the low
dimensional latent vectors:
Zk = AT (Xk−μ), k = 1, . . . ,m.
-
Train q independent linear regression models in the low dimensional
space:
- Z is a qxm matrix that collects together all the latent vectors. each linear
regression model is constructed for any row of Z:
zjk = aj tk+bj+ε k = 1, . . . ,m
– The parameters (aj , bj) for each j are obtained by solving a linear system using
least squares
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
116
58
10/10/2010
PCA + Regression (III)
• To predict the future nrank values X∗,
– We estimate the latent vector z∗ at time t∗ by
zjk = aj t*+bj+ε k = 1, . . . ,m
– prediction: X∗
= Az∗ + µ
– Choose k out of m eigenvectors such that the variation losss/error is:
m
j k 1
m
l 1
October 2010
j
15%
l
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
117
Predictive Models
•
•
•
•
Markov Models
Linear Regression
Principal Components Regression
Bayesian Clustering
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
118
59
10/10/2010
Bayesian Clustering
• We assume that Web pages follow one of J different
behaviour/evolution types (clusters)
• Clusters correspond to different dominant Web page
behaviors’
• Clustering is viewed as training a mixture model:
– A cluster type j is first chosen with probability πj for each Web page
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
119
Bayesian Clustering -I
• Assume n-rank falls J different clusters
• Clustering can be viewed as training a mixture probability
model.
xik = aj tk+bj+εk k = 1, . . . ,m
• xik the n-rank value for the ith Web page at time tk, cluster
type j with probability πj and ε Gaussian noise N(0, σ^2)
• n-rank values are drawn from the following product of
Gaussians:
m
p(xi | j)
N(x
ik
| ajtk bj , 2j )
k 1
• By marginalization, xi unconditional probability:
J
p( xi )
p(x
j
i
| j)
j 1
October 2010
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
120
60
10/10/2010
Bayesian Clustering: Training
• Objective: train the model and estimate parameters:
{πj , aj, b j , σ 2j }Jj 1
• through log likelihood maximization of a dataset consisting of N Web pages
is:
N
L( ) log p( xi )
n 1
• The maximization of L(θ) is achived via the Expectation – Maximization (EM)
algorithm
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
October 2010
121
Bayesian Clustering: Training with
Expectation Maximization
• E-Step computes the posterior probabilities:
R ij
j p(xi | j )
J
1
j: clusters,i: pages, k: time
p( xi | )
• M-Step updates the parameters (t: vector containing all time stamps)
j
j2
N
i 1
m
N
n 1
R nj
R ij k 1 x ik a j t k b j
2
mN j
a j
1
b
j
October 2010
1
N
j
t T t
T
t I
tT I
m
1
N R ij x iT t
Ni 1 i T
i 1 R j x i I
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
122
61
10/10/2010
Bayesian Clustering: Prediction
• To predict the nrank of page x*i at time t* based on a hostory of
oservations xi we express the posterior probabilty (p(x*i |xi)
according to Bayes Rule:
J
p( x i * | x i )
R
n
j
N( x i * | a j t* b j , 2j )
j 1
j arg max k Rki
THUS x i * a j t* b j
where
the cluster maximizing the posterior probability
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
October 2010
123
Tutorial Outline
•
Introduction, Motivation
Web Data nature (graph, time evolving)
Examples of data sets (snap – our data set)
Aspects: content, structure and behavior mining.
•
•
•
•
Data Preprocessing
–
–
–
•
Content preprocessing (text to matrix)
Dimensionality reduction algorithms
Feature selection fundamentals
Ranking Web data
–
–
Graph based ranking – Pagerank, HITS
Text based ranking
•
Unsupervised mining algorithms and their foundations:
•
Predictive learning
–
–
–
–
–
Graph based Clustering
Markov Models,
regression models, variants of regression with probabilistic
clustering and Principal Components Analysis (PCA)
EM based approach with Bayesian learning.
•
Top-k similarity measures
•
Predictive learning case studies
–
K-Sim, Osim, NDCG, Spearman correlation.
–
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
124
62
10/10/2010
Evaluating the predictions
• Boils down to top-k lists similarity (predicted vs. actual)
• OSim(A, B): Indicates the degree of overlap between the top-k elements of
two sets
A B
OSim( A, B) =
A∩ B
k
• KSim(A, B) : Based on Kendall’s distance.
Indicates the degree that the
relative orderings of two top-k lists are
in concordance:
KSim( A, B) =
October 2010
c
a
b
b
c
{(u, v) : A', B' agree on order}
A∪ B
( A∪ B
a
d
1)
d
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
125
Top-k lists similarity measures - OSim
• Let Α, Β two top-k lists of web pages then we define:
OSim(A,B) = A∩ B
k
i.e. the portion of common pages weighted over the size of the lists.
Example: Consider the top-6 lists Α={5, 1, 2, 4, 10, 8}
Β={ 4, 3, 9, 1, 5,10}, each number represents a web page id (order
represents the ranking). Then Α ∩ Β = {5,1,4,10}
and as Α ∩ Β| = 4 and k = 6:
OSim(A,B) = 4/6 = 2/3
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
126
63
10/10/2010
Top-k lists similarity measures - KSim
|{(u,v): u in A΄, v in Β΄ agree on order}|
KSim(A,B) =
where:
|A U B| (|A U B| - 1)
Α΄= ΑU(B – A) and Β΄= ΒU(Α – Β): all pairs that maintain
their relative ordering in both sets: (Α΄, Β΄)
Example: Let the top-6 lists Α={5, 1, 2, 4, 10, 8} and Β={ 4,
3, 9, 1, 5,10}, then Α΄= {5, 1, 2, 4, 10, 8, 3, 9} and Β΄={4, 3, 9,
1, 5, 10, 2, 8}. The pairs maintaining their relative order in Α΄,
Β΄ are: (5,2), (5,10), (5,8), (1,2), (1,10), (1,8), (2,8), (4,10),
(4,8), (4,3), (4,9), (10,8), (3,9) (13 all of them). Hence
KSim(A,B) = 13/56
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
127
Top-k lists similarity measures - NDCG
NDCG (= Normalized Discounted Cumulative Gain):
i)
real(pi) = i ≤ k, predicted(pi) = j ≤ k score G(i) = k – |i – j| ≤ k
G(i), for
i=1
ii)
CG(i) =
iii)
DCG(i) =
iv)
IDCG (ideal DCG) DCG
k with the optimal (descending) elements order (max)
CG(i-1) + G(i), if i>1
G(i), i < b
CG(i-1) + G(i) / logbi, i ≥ b
Let DCGk = Σi=1 DCG(i), then we define:
NDCG =
DCGk
IDCGk
Example: Let the top-5 lists: Α={4, 2, 3, 1, 5} (real) and B = {2, 1, 5, 4, 3} (predicted),
then (k=5) G = ( 3, 4, 3, 2, 3). For b=2, it is : CG = (3, 7, 10, 12, 15),
DCG = (3, 5, 8.89, 12, 13.89) and ΙDCG = (4, 7, 8.89, 11.5, 13.86),
Thus: DCG5 = 42.786 and ΙDCG5 = 45.254. Hence: NDCG = 0.9455
Observation: To compute IDCG and ΙDCG5, we assumed G: (4, 3, 3, 3, 2)
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
128
64
10/10/2010
Top-k lists similarity measures – Spearman’s rank
correlation coefficient (ρ)
Spearman’ s rank Correlation Coefficient (ρ):
i) Based on the correlation among two variables X and Y
ii) Perfect Spearman correlation : one variable can be
represented as a monotone function of the other (ρ = ±1, xi
= ayi+b)
k
iii) If di = xi – yi , i = 1,2,…,k, then we define ρ = 1 –
ρ=1
October 2010
6 Σi=1 di2
k (k2 – 1)
ρ = -1
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
129
Top-k lists similarity measures– Kendall tau
rank correlation
•
Let Α={a1, a2, …, ak}, Β ={b1, b2, …, bk} two top-k lists and (x1, y1),
(x2, y2), …, (xk, yk) the sets of of pairs, where xi and yi are unique
and represent the rank of the page pi στις in the Α, Β lists
respectively.
i.
The pairs (xi, yi) and (xj, yj) agree on the relative order
(concordant), iff (xi < yi and xj < yj ) or (xi > yi and xj > yj)
ii.
(xi, yi) and (xj, yj) have a different relative order (discordant), iff
(xi < yi and xj > yj ) or (xi > yi and xj < yj ).
iii. Then we define Kendall tau rank correlation (τ):
# concordant pairs - # discordant pairs
τ=
October 2010
k (k-1) / 2
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
130
65
10/10/2010
Top-k lists similarity measures– Kendall tau
rank correlation (2)
Example: Let two top-4 lists
Α = {2, 4, 3, 1} and Β = {1, 4, 3, 2} where xi, yi the order of element i
in Α and Β respectively.
Let x1 = 4, x2 = 1, x3 = 3, x4 = 2 και y1 = 1, y2 = 4, y3 = 3, y4 = 2. Thus
we form the pairs: (xi, yi) = {(4,1), (1,4), (3,3), (2,2)} (with k= 4= | (xi,
yi) |).
Comparing all the possible pairs (xi, yi) we find:
# concordant pairs = 1: (3,3) , (2,2)
# disconcordant pairs = 5
Therefore
τ = (1-5)/[0.5(4·3)] = -1/6
This measure represents the correlation between two ranked lists
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
131
Top-k lists similarity measures – Kendall tau
rank correlation (3)
Observations:
1) -1 ≤ τ ≤ 1
2) We can as well define it as :
τ=
Number of concordant pairs i
k (k-1) / 2
then in this case: 0 ≤ τ ≤ 1.
3) Based on this definition for the example of the previous
slide: τ = 1/6.
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
132
66
10/10/2010
Top-k lists similarity measures – RSim
RSim(A,B) = 1 –
CPS (A ,B)
CPSmax (A ,B)
, where
CPS(A ,B) = Σi=1 |Ai – Bi|·(k + 1 – Ai) and Ai the real and Βi
the predicted ranking of page i.
Observations: - Αi, Bi ϵ {1, 2,..., k}
- RSim(A,B) = 1 the best case
≠ RSim(A,B) → 0
- Assume for(α)): Αi = i, i ϵ {1, 2,..., k},
then
CPS(A ,B) ≤ Σi=1 |i - Bi| (k + 1 – i) = Σi=1 (k + 1 – i)2 = Σi=1 i2
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
133
Tutorial Outline
•
Introduction, Motivation
Web Data nature (graph, time evolving)
Examples of data sets (snap – our data set)
Aspects: content, structure and behavior mining.
•
•
•
•
Data Preprocessing
–
–
–
•
Content preprocessing (text to matrix)
Dimensionality reduction algorithms
Feature selection fundamentals
Ranking Web data
–
–
Graph based ranking – Pagerank, HITS
Text based ranking
•
Unsupervised mining algorithms and their foundations:
•
Predictive learning
–
–
–
–
–
Graph based Clustering
Markov Models,
regression models, variants of regression with probabilistic
clustering and Principal Components Analysis (PCA)
EM based approach with Bayesian learning.
•
Top-k similarity measures
•
Predictive learning case studies
–
K-Sim, Osim, NDCG, Spearman correlation.
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
134
67
10/10/2010
Datasets Description
• Internet Archive (Global rankings):
– Comprises of 500.000 pages referring to weekly collections of UK government
websites
– 24 snapshots were obtained between March 2004 and January 2006
• Wikipedia (Query-based rankings):
– Represents a dynamic large-scale collection concerning Wikipedia’s English
version
– 55 consecutive monthly snapshots – 1/2002 to 7/2006.
– very dynamic large-scale collection (1,467,982 articles in the last snapshot)
– query-based top-k results, maintain for each of the queries selected (selection
criteria follow) the top-k results, for each of the 55 monthly snapshots.
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
135
Wikipedia: Query Classification
• Frequency: The proportion of documents containing a given term
(TFwi(tj), where wi a specific snapshot)
– Wide queries: Correspond to frequent terms
• Dynamism: In terms of the rate of increase (or decrease) in the
number of documents that contain the query term as time
progresses
SDwi (t j )
TFwi (t j ) TFwi 1(t j )
TFwi (t j )
– Dynamic positive queries : D(tj) >0
– Dynamic negative queries : D(tj)<0
– Least dynamic queries (Dtj~0)
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
136
68
10/10/2010
Ranking Function for Querying Wikipedia
– tf/idf(t,d): tf/idf value of the term for the specific
page.
– title(t, d) = 2 if t appears in title of d, 1 otherwise,
– pr(d) the PageRank score for the page d.
scoret(d) = (tf/idf(t, d)*title(t, d))1.5*pr(d)
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
137
Yahoo! dataset
• Rankings were collected for a load of 24
queries :
– Random queries from Wikipedia
– Popular queries that appeared in Google Trends
– Popular current queries (such as “euro 2008”)
• 37 consecutive daily top-1000 ranked lists,
from April 2008 to May 2008 computed with
the Yahoo! Search Web Services
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
138
69
10/10/2010
Data Sets Preprocessing
• Choice of the appropriate values for the
regression-based models: nrank vs. racer
• Distribution of the Mean Square Error (MSE):
– Quantifies the divergence of an estimator from
the true value being estimated
– Large values depict less accurate predictions
2
– Defined as: MSE ( ) 1 n θ θ j
n
October 2010
j
j 1
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
139
MSE – Yahoo!
Racer Values
October 2010
NRank Values
M. Vazirgiannis “Web Mining - a predictive
approach “- BDA 2010
140
70
10/10/2010
Data Sets Preprocessing – Bayesian
Clustering
• Tested clusters’ sizes from 2 to 10 for each query
• Chose each time the number of clusters which maximized the
overall quality of clustering, defined as: bc/wc:
– bc (between cluster variation): Measured by the distance between cluster
centers
– wc (within cluster variation): Look at the sum of squares of distances from each
point to the center of the cluster it belongs to
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
141
Experimental Evaluation
• Usage of two baseline schemes:
– Static: Just returns the top-k list of the previous snapshot
– Random Markov State: Assumes that the racer value of a Web page is
uniformly chosen among the available states of the learned “MMs”
• All experiments were run using a 10-fold cross validation: 10%
of the dataset as test and 90% as training
• Results were evaluated using all similarity measures: OSim,
KSim and RSim
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
142
71
10/10/2010
Internet Archive
OSim
KSim
RSim
M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010
October 2010
143
Wikipedia (total queries)
OSim
KSim
RSim
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
144
72
10/10/2010
Wikipedia – Wide queries
OSim
KSim
RSim
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
145
Wikipedia – Dynamic/Positive queries
OSim
KSim
RSim
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
146
73
10/10/2010
Wikipedia – Dynamic/Negative queries
KSim
OSim
October 2010
RSim
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
147
Wikipedia – Least Dynamic queries
OSim
October 2010
RSim
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
KSim
148
74
10/10/2010
Yahoo!
OSim
KSim
RSim
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
149
Conclusions
• Web Mining is a hot research are with prominent
industrial applications (DIGITEO CHAIR GRANT, France)
– Predicting the ranking of web pages, social net users
– Advertising !
• Promising directions
– Graph mining and relevant metrics for Social networks and
intranet information systems.
– Monitoring and measuring temporal evolution of Web and
Social graph dynamics
– Dynamic personalization algorithms
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
150
75
10/10/2010
Thank you!
For more information check:
http://www.db-net.aueb.gr/michalis/
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
151
Text Ranking - References
R. Baeza-Yates and B. Ribeiro-Neto (1999). Modern Information Retrieval. Addison Wesley.
C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, G. Hullender (2005). Learning
to rank using gradient descent. In Proc. 22nd ICML, pp. 89-96.
S. Büttcher, C. Clarke, B. Lushman. Term proximity scoring for ad-hoc retrieval on very large text
collections. In Procs SIGIR 2006, pp 621–622.
K. Crammer, Y. Singer (2001). Pranking with ranking. In Advances in Neural Information Processing
Systems 14, Vol. 14, pp. 641-647.
D. Hiemstra (1998). A linguistically motivated probabilistic model of information retrieval. In Proc.
ECDL'98, pp. 569-584.
J. Lafferty, C. Zhai (2001). Document language models, query models, and risk minimization for
information retrieval. In Proc. SIGIR, pp. 111-119
C. Macdonald, V. Plachouras, B. He, C. Lioma, I. Ounis (2005). University of Glasgow at WebCLEF
2005: Experiments in per-field normalisation and language specific stemming. In Proceedings
of the Cross Language Evaluation Forum (CLEF) 2005.
D. Metzler, W.B. Croft (2005). A Markov Random Field for Term Dependencies. In Proc of ACM
SIGIR’05.
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
152
76
10/10/2010
Text Ranking – References I
D. Metzler (2008). Generalized Inverse Document Frequency. In Proceedings of the 17th CIKM’08,
pp 399-408.
K. Papineni (2001). Why inverse document frequency? In Proc 2nd North American Chapter of the
Assn. for Computational Linguistics on Language Technologies, pp. 1–8.
J. Peng, C. Macdonald, B. He, V. Plachouras, I. Ounis. Incorporating term dependency in the DFR
framework. In Proc. SIGIR 2007, pp. 843–844.
J.M. Ponte, W.B. Croft (1998). A language modeling approach to information retrieval. In Proc.
SIGIR, pp. 275-281.
S. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, M. Gatford (1994). Okapi at TREC-3. In
Proc of the 3rd Text REtrieval Conference (TREC-1994).
S. Robertson (2004). Understanding Inverse Document Frequency: On theoretical arguments for
IDF. Journal of Documentation, 60(5):503-520.
G. Salton and M.J. McGill (1983). Introduction to Modern Information Retrieval. McGraw-Hill.
G. Salton, E.A. Fox, H. Wu (1983). Extended Boolean information retrieval. CACM 26(11):10221036.
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
153
Text Ranking – References II
G. Tsatsaronis and V. Panagiotopoulou (2009). A Generalized Vector Space Model for Text
Retrieval Based on Semantic Relatedness. In Proc EACL 2009 Student Research Workshop, pp
70–78.
S.K.M. Wong, W. Ziarko, P.C.N. Wong (1985). Generalized vector space model in information
retrieval. In Proc 8th ACM SIGIR, pp 18-25.
Y. Yue, T. Finley, F. Radlinski, T. Joachims (2007). A support vector method for optimizing average
precision. In Proc. 30th SIGIR, pp. 271-278.
H. Zaragoza, N. Craswell, M.J. Taylor, S. Saria, S. Robertson (2004). Microsoft Cambridge at TREC
13: Web and Hard Tracks. In Proc of TREC 2004.
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
154
77
10/10/2010
Graph Clustering - References
[1] http://en.wikipedia.org/wiki/Jaccard_index
[2] ROCK: A Robust Clustering Algorithm for Categorical Attributes Sudipto
Guha,Rajeev Rastogi,Kyuseok Shim
[3] http://en.wikipedia.org/wiki/Gomory%E2%80%93Hu_tree
[4] http://en.wikipedia.org/wiki/Minimum_cut
[5] http://en.wikipedia.org/wiki/Cut_%28graph_theory%29
[6] http://en.wikipedia.org/wiki/Edmonds%E2%80%93Karp_algorithm
[7] http://mathworld.wolfram.com/LaplacianMatrix.html
[8] A Tutorial on Spectral Clustering Ulrike von Luxburg
[9] Community structure in social and biological networks M. Girvan and M. E. J.
Newman
[10] Community detection in graphs Santo Fortunato
[11] Modularity from Fluctuations in Random Graphs and Complex Networks Roger
Guimer`a, Marta Sales-Pardo, and Lu´ıs A. Nunes Amara
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
155
Top-k similarity measures – References
1.
2.
3.
4.
5.
6.
7.
Chen, Y.-Y., Gan, Q., & Suel, T. (2004). Local Methods for Estimating PageRank
Values. Proceedings CIKM. Washington, USA.
Deshpande, M., & Karypis, G. (2004). Selective Markov models for predicting
Web page accesses. ACM Trans. Internet Technol. , 163-184.
Sayyadi, H., & Getoor, L. (2009). Future Rank: Ranking Scientific Articles by
Predicting their Future PageRank. 2009 SIAM International Conference on Data
Mining (SDM09) (pp. 533-544). Sparks, Nevada: SIAM.
Vazirgiannis, M., Drosos, D., Senellart, P., & Vlachou, A. (2008). Web Page Rank
Prediction with Markov Models. WWW poster. Beijing, China.
Zacharouli, P., Titsias, M., & Vazirgiannis, M. (2009). Web Page Rank Prediction
with PCA and EM Clustering. Proceedings of the 6th International Workshop on
Algorithms and Models for the Web-Graph (pp. 104-115). Barcelona, Spain:
Springer-Verlag Berlin, Heidelberg.
Jarvelin K. & Kekalainen J. (2002). Cumulated Gain-based Evaluation of IR
Techniques. ACM Transactions on Information Systems. Vol. 20, No. 4 (pp. 422336).
Jaime Teevan, Susan T. Dumais, Eric Horvitz. (2007). Characterizing the value of
personalizing search. Proceedings of the 30th annual international ACM SIGIR
conference on Research and development in information retrieval, Amsterdam.
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
156
Web Mining
78
10/10/2010
References - Pagerank prediction & Web Dynamics
1.
Baeza-Yates, R. A., Castillo, C., & Saint-Jean, F. (2004). Web Dynamics, Structure, and Page Quality. In M. Levene, & A.
Poulovasillis (Eds.), Web Dynamics (pp. 93-112). Springer.
2. Broder, A. Z., Lempel, R., Maghoul, F., & Pedersen, J. (2006). Efficient PageRank approximation via graph aggregation.
Information Retrieval , 9 (2), 123-138.
3. Chen, Y.-Y., Gan, Q., & Suel, T. (2004). Local Methods for Estimating PageRank Values. Proceedings CIKM. Washington, USA.
4. Chien, S., Dwork, C., Kumar, R., Simon, D. R., & Sivakumar, D. (2003). Link Evolution: Analysis and Algorithms. Internet
Mathematics , 1 (3), 277-304.
5. Davis, J. V., & Dhillon, I. S. (2006). Estimating the global PageRank of Web communities. Proceedings KDD. Philadelphia, USA.
6. Deshpande, M., & Karypis, G. (2004). Selective Markov models for predicting Web page accesses. ACM Trans. Internet
Technol. , 163-184.
7. Langville, A. N., & Meyer, C. D. (2004). Updating PageRank with iterative aggregation. Proceedings of the 13th international
World Wide Web conference on Alternate track papers \& posters (pp. 392-393). New York, USA: ACM.
8. Piatetsky-Shapiro, G., & Connell, C. (1984). Accurate estimation of the number of tuples satisfying a condition. SIGMOD Rec.
, 14 (2), 256-276.
9. Sayyadi, H., & Getoor, L. (2009). Future Rank: Ranking Scientific Articles by Predicting their Future PageRank. 2009 SIAM
International Conference on Data Mining (SDM09) (pp. 533-544). Sparks, Nevada: SIAM.
10. Vazirgiannis, M., Drosos, D., Senellart, P., & Vlachou, A. (2008). Web Page Rank Prediction with Markov Models. WWW
poster. Beijing, China.
11. Vlachou, A., Berberich, K., & Vazirgiannis, M. (2006). Representing and quantifying rank-change for the Web graph.
Algorithms and Models for the Web-Graph, Fourth International Workshop, WAW 2006. 4936, pp. 157-165. Banff, Canada:
Springer.
12. Zacharouli, P., Titsias, M., & Vazirgiannis, M. (2009). Web Page Rank Prediction with PCA and EM Clustering. Proceedings of
the 6th International Workshop on Algorithms and Models for the Web-Graph (pp. 104-115). Barcelona, Spain: SpringerVerlag Berlin, Heidelberg.
October 2010
M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010
157
79