Download Web Mining – a predictive approach Tutorial Outline

10/10/2010 Web Mining – a predictive approach Prof. Michalis Vazirgiannis DB--NET (Research Team on Data & Web Mining), DB Dept. of Informatics Athens University of Economics & Business, GREECE WWW: http://www.db http://www.db--net.aueb.gr/ & Digiteo Chair @ LIX, LIX, Ecole Polytechnique Polytechnique,, Paris, France M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 October 2010 1 Tutorial Outline • Introduction, Motivation Web Data nature (graph, time evolving) Examples of data sets (snap – our data set) Aspects: content, structure and behavior mining. • • • • Data Preprocessing – – – • Content preprocessing Dimensionality reduction algorithms Feature selection fundamentals Ranking Web data – – Graph based ranking – Pagerank, HITS Text based ranking • Unsupervised mining algorithms and their foundations: • Predictive learning – – – – – Graph based Clustering Markov Models, regression models, variants of regression with probabilistic clustering and Principal Components Analysis (PCA) EM based approach with Bayesian learning. • Top-k similarity measures • Predictive learning case studies – K-Sim, Osim, NDCG, Spearman correlation. October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 2 1 10/10/2010 Web Mining - Introduction • The Web looks like a “bowtie” [Kumar+1999] • IN, SCC, OUT, ‘tendrils’ • Disconnected components • Power laws apply for count vs. (in-outlinks, user clicks) • The plot is linear in loglog scale freq = degree (-2.15) October 2010 M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010 3 Web Mining - I The three aspects • Web Content/ Structure / Usage Mining Web content Mining • Metadata: Keywords, Phrases, Filename, URL, Size etc. • Feature extraction techniques from web contents • Web document representation models: Vector space model, ontology based ones • Web document clustering: Similarity measures, Web document clustering algorithms • Web document classification techniques • Content based ranking web documents – for queries October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 4 2 10/10/2010 Web Mining - II Web Structure Mining • Link analysis theory (Hubs, Authorities, Co-citations, …) • Mining web communities • The role of hyperlinks in web crawling • Ranking based on link analysis: HITS, PageRank Web Usage Mining • Web usage logs – click streams • Data Preprocessing: cleaning, session identification, User identification (anonymous user profiles vs. registered users) • Pattern discovery: Web usage-mining algorithms: Classification, Clustering, Association Rules, Sequential pattern discovery October 2010 M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010 5 Web Mining - III • Graph of web pages (mainly text) • Huge: Google (~15 bn pages < 2% of the real web volume) • Rapidly evolving: each week +8% pages, 25% of the links change!! • Web search is a dominant activity! (Google as verb in dictionaries…) • Ranking is an essential add-on towards meaningful results • Further knowledge extraction and web organization is necessary October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 6 3 10/10/2010 Web Mining – The value of predictions [1 [1] Web search patterns are good predictors (A) for box-office movie revenue (D) correlation between predicted and actual outcomes when predictions are based on query data t weeks prior to the event. --------------------------------------------------------------------------------------------------------[1] Predicting consumer behavior with Web search, Sharad Goel, Jake M. Hofman, Sébastien Lahaie, David M. Pennock, and Duncan J. Watts, Microeconomics and Social Systems, Yahoo! Research, www.pnas.org/cgi/doi/10.1073/pnas.1005962107 M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010 October 2010 7 Tutorial Outline • Introduction, Motivation Web Data nature (graph, time evolving) Examples of data sets (snap – our data set) Aspects: content, structure and behavior mining. • • • • Data Preprocessing – – – • Content preprocessing (text to matrix) Dimensionality reduction algorithms Feature selection fundamentals Ranking Web data – – Graph based ranking – Pagerank, HITS Text based ranking • Unsupervised mining algorithms and their foundations: • Predictive learning – – – – – Graph based Clustering Markov Models, regression models, variants of regression with probabilistic clustering and Principal Components Analysis (PCA) EM based approach with Bayesian learning. • Top-k similarity measures • Predictive learning case studies – K-Sim, Osim, NDCG, Spearman correlation. – October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 8 4 10/10/2010 Content preprocessing 2.3 -1.5 … -1.3 n 1.1 0.1 … -0.1 … … … … p • • Rows = objects Columns = measurements on objects – Represent each row as a p-dimensional vector, where p is the dimensionality • In efffect, embed our objects in a p-dimensional vector space • Often useful, but always appropriate • Both n and p can be very large in certain data mining applications 9 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 October 2010 Sparse Matrix (Text) Data 50 100 150 Text Documents 200 250 300 350 400 450 500 20 40 60 80 100 120 140 160 180 200 Word IDs 10 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 October 2010 5 10/10/2010 Tutorial Outline • Introduction, Motivation Web Data nature (graph, time evolving) Examples of data sets (snap – our data set) Aspects: content, structure and behavior mining. • • • • Data Preprocessing – – • Ranking Web data – – • Graph based ranking – Pagerank, HITS Text based ranking Unsupervised mining algorithms and their foundations: – • Content preprocessing (text to matrix) Dimensionality reduction algorithms Graph based Clustering Predictive learning – – – – Markov Models, regression models, variants of regression with probabilistic clustering and Principal Components Analysis (PCA) EM based approach with Bayesian learning. • Top-k similarity measures • Predictive learning case studies – K-Sim, Osim, NDCG, Spearman correlation. October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 11 Dim. Reduction – Linear Algorithms • • • • Principal Components Analysis (PCA) Singular Value Decomposition (SVD) Multidimensional Scaling (MDS) Latent Semantic Indexing (LSI) October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 12 6 10/10/2010 Linear Algebra Basic Principles October 2010 M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010 13 Dim. Reduction–Eigenvectors A nxn matrix (square) • eigenvalues λ: |Α-λΙ|=0 • Eigenvectors x : Ax=λx • Matrix rank: # linearly independent rows or columns • A real symmetric table Α nxn can be expressed as: A=UΛUT • U’s columns are Α’s eigenvectors • Λ’ s diagonal contains Α’s eigenvalues • Α=UΛUT=λ1x1xT1+λ2x2xT2+…+λnxnxTn • x1xT1 represents projection via x1 (λi eigenvalue, xi eigenvector) October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 14 7 10/10/2010 Linear Independence • Linear Independence: Given vectors x1,x2,…,xn and equation λ1x1 + λ2x2 +…+ λnxn=0, if the equation has one solution, λi=0, then all vectors are linear independent. If there are more solutions then the vectors are linearly dependent. • Linear dependence: One of the vectors is the result of the linear combination of one or more of remaining vectors. – • Basis of vector space V: The set of linear independent vectors from which all elements of vector space V are produced as their linear combinations. • i.e. Β ={e1,e2,…,en} with ei={0,0,0,…,1,…,0} then Β is the basis of Rn . • The same stands for ei={1,1,…,1,0,…,0} – • Dimension of vector space V: The cardinality of its basis, dimV=n • Notice: A space V may have more than one basis. However, all basis are of the same cardinality. – October 2010 M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010 15 Singular Value Decomposition (SVD) • Decomposition into eigen values and eigenvectors is applied to square matrices. Data tables are usually non square, in these case we apply Singular Value Decomposition. • Spectral theorem: a normal matrix M can be unitarily diagonalized using a basis of eigenvectors. • singular value: A non-negative real number σ is a singular value for M if there exist unit-length vectors u in Km and v in Kn such that: Mv= σu and MTu=σv u and v are called left-singular and right-singular vectors for σ, respectively. singular value decomposition : M = UΣVT - diagonal entries of Σ are the singular values of M. - columns of U / V are left/right-singular vectors for the corresponding singular values. An m × n matrix M has at least one and at most p = min(m,n) distinct singular values. October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 16 8 10/10/2010 Singular Value Decomposition (SVD) - I Relation to eigenvectors/values – Let Α mxn table, can be expressed ULVT – U: mxm, its columns are A*AT eigenvectors. – U,V define orthogonal basis: UUT= VVT =1 – L: mxn contains A’s singular values (square roots of A*AT eigenvalues) – V : nxn, its columns are AT*A eigenvectors October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 17 Singular Value Decomposition (SVD) - II Matrix approximation • The best rank r approximation M’ of a matrix M. (minimizing the Frobenius norm) • M’=UΣ’V* (Σ’ keeps the r largest singular values from Σ) • Thus: distance (||M||F ,||M’||F ) is minimal October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 18 9 10/10/2010 Multidimensional Scaling (MDS) • Initially we depict vectors in random places • Iteratively reposition them in order to minimize Stress. – – Stress = (f(dij)-dij’)2/ f(dij)2 : The total error in distances mapping Complexity Ο(N3) (Ν:number of vectors) • Result: • – Mapping of the data in a lower dimensional space. Implement usually by: – Eigen decomposition of the inner product matrix and projection on the k eigenvectors that correspond to the k largest eigenvalues. M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010 October 2010 19 Multidimensional Scaling • Data is given as rows in Χ – – – C=XXT (inner product of xi with xj) Eigen decomposition of C’ = ULU-1 Eventually X’ = UkLk1/2, where k is the projection dimension U= X= SVD L= XXT C= October 2010 L= X’=U2L21/2= M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010 20 10 10/10/2010 Principal Components Analysis • The main concept behind Principal Components Analysis is dimensionality reduction, maintaining as much as possible data’s variance. • variance: V(X)=σ2=Ε[(Χ-µ)2] • Let Ν objects, with mean value, m, it is approximated as: • In a sample of Ν objects with unknown mean value: October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 21 Dimensionality reduction based on variance maintenance Axis maximizing variance October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 22 11 10/10/2010 Principal Components Analysis • «Α linear transformation that chooses a new coordinate system for the data set such that the greatest variance by any projection of the data set comes to lie on the first axis (then called the first principal component), the second greatest variance on the second axis, and so on ...» (wikipedia) • Let n dimensional data, with dimensions: x1,…,xn • The objective is to project the data to k dimensions a some linear decomposition: y1=a1*x1+…+an*xn ……… yk=b1*x1+…+bn*xn • should maintain the variance of the original data October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 23 Covariance Μatrix • Let Matrix where Xi vectors • covariance matrix Σ is the matrix whose (i, j) entry is the covariance • Also: cov(X) = XXT October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 24 12 10/10/2010 Principal Components Analysis (PCA) • The basic idea of PCA is the maximization of the covariance. – Variance: Depicts the maximum deviation of a random variable from the mean. – σ2=∑i=1n ((xi – µi )2/n) • Method: – – – Assumption: Data is described by p variables and contained as rows in matrix Αpxn We subtract mean values from columns. A’=(A-M) Calculate covariance matrix W = A’T A’ M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010 October 2010 25 Principal Components Analysis (PCA) – (2) • Calculation of covariance matrix (W) – – A matrix nxn, in each cell of which (i,j) we have the covariance of Xi,Xj. Covariance depicts the dependency between variables • • • • • • Calculate eigenvalues and eigenvectors of W (X,D) Retain k largest eigenvalues and corresponding eigenvectors – • If cov(Xi,Xj)=0 the Xi,Xj are independent If cov(Xi,Xj)<0 then Xi,Xj are negative dependent (when i increases j decreases and vice versa) If cov(Xi,Xj)>0 then Xi,Xj are positive dependent Simply: cov(Xi,Xj) =(∑f=1n(Xif – Xi~)(Xjf – Xj~))/(n-1) k : input parameter such that ∑pj=k+1λj/∑pj=1λj > threshold Projection : A’Χk October 2010 M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010 26 13 10/10/2010 Principal Components Analysis x1 2 3 4 5 x2 6 4 2 6 X= x3 1 5 6 8 x4 1 4 4 6 x5 3 4 9 5 X’= ∑j=15xij xij- mj x1 -1 -1 -1 -1 x2 3 0 -3 0 x3 1 1 1 x4 -2 -1 -1 1 x5 0 sumj/5 15 20 0 3 25 30 0 4 Cov= 2 U= L= -0.18 -0.19 0.34 -0.20 -0.40 -0.20 -0.86 0.38 -0.66 -0.54 0.34 -0.01 -0.60 0.78 0.09 4.38 2.89 1.02 k=2 Uk= 5 -1.05 0.2 0.7 0.55 0.55 -1.05 0.55 2.20 0.70 0.2 0.55 0.70 1.70 6 (∑f=15(xif-mi)(xjf-mj))/4 -0.90 -0.18 -0.20 -0.40 -0.90 -0.18 2.2 -0.20 -0.40 0 0 -2.4 -1.56 2.7 0.54 0.2 0.4 0.004 X’Uk October 2010 1.05 1.05 -1 Cov=ULUT -0.90 3.70 1.16 M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010 27 PCA, example Axis corresponding to the second principal component Axis corresponding to the first principal component October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 28 14 10/10/2010 PCA Applications • It is a dimensionality reduction method • Nominal complexity Ο( np^2+p^3) – n: number of data points – p: number of initial space dimensions • The new space maintains sufficiently the data variance. • Preprocessing step preceding the application of data mining algorithms (such as clustering). • Data Visualization. • Noise reduction. October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 29 Latent Structure in documents •Documents are represented based on the Vector Space Model •Vector space model consists of the keywords contained in a document. •In many cases baseline keyword based performs poorly – not able to detect synonyms. •Therefore document clustering is problematic •Example where of keyword matching with the query: “IDF in computerbased information look-up” access Doc1 x document x retrieval theory x x database x x Doc2 Doc3 information x x indexing computer x x x Indexing by Latent Semantic Analysis (1990) Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman, Journal of the American Society of Information Science M. Vazirgiannis “Web Mining - a predictive October 2010 approach “- BDA 2010 30 15 10/10/2010 Latent Semantic Indexing (LSI) - I • Finding similarity with exact keyword matching is problematic. • Using SVD we process the initial document-term document. • Then we choose the k larger singular values. The resulting matrix is of order k and is the most similar to the original one based on the Frobenius norm than any other k-order matrix. October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 31 Latent Semantic Indexing (LSI) - II • The initial matrix is SVD decomposed as: Α=ULVT • Choosing the top-k singular values from L we have: Αk=UkLkVkT , • Lk is square kxk containing the top-k singular values of the diagonal in matrix L, • Uk, the mxk matrix containing the first k columns in U (left singular vectors) • VkT, the kxn matrix containing the first k lines of VT (right singular vectors) Typical values for κ~200-300 (empirically chosen based on experiments appearing in the bibliography) October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 32 16 10/10/2010 LSI capabilities • Term to term similarity: ΑkΑkT=UkLk2UkT • document-document similarity: ΑkTΑk=VkLk2VkT • term document similarity (as an element of the transformed – document matrix) • Extended query capabilities transforming initial query q to qn : qn=qTUkLk–1 • Thus qn can be regarded a line in matrix Vk October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 33 LSI – an example [1] LSI application on a term – document matrix C1: Human machine Interface for Lab ABC computer application C2: A survey of user opinion of computer system response time C3: The EPS user interface management system C4: System and human system engineering testing of EPS C5: Relation of user-perceived response time to error measurements M1: The generation of random, binary unordered trees M2: The intersection graph of path in trees M3: Graph minors IV: Widths of trees and well-quasi-ordering M4: Graph minors: A survey • The dataset consists of 2 classes, 1st: “human – computer interaction” (c1-c5) 2nd: related to graph (m1-m4). After feature extraction the titles are represented as follows. ____________________________________________________________________________________________________________________ [1] Indexing by Latent Semantic Analysis (1990) Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman, Journal of the American Society of Information Science October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 34 17 10/10/2010 LSI – an example C1 C2 C3 C4 C5 M1 M2 M3 M4 human 1 0 0 1 0 0 0 0 0 Interfac e 1 0 1 0 0 0 0 0 0 comput er 1 1 0 0 0 0 0 0 0 User 0 1 1 0 1 0 0 0 0 System 0 1 1 2 0 0 0 0 0 Respon se 0 1 0 0 1 0 0 0 0 Time 0 1 0 0 1 0 0 0 0 EPS 0 0 1 1 0 0 0 0 0 Survey 0 1 0 0 0 0 0 0 1 Trees 0 0 0 0 0 1 1 1 0 Graph 0 0 0 0 0 0 1 1 1 Minors 0 0 0 0 0 0 0 1 1 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 October 2010 35 LSI – an example A=ULVT A= October 2010 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1 2 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 1 1 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 36 18 10/10/2010 LSI – an example A=ULVT 0 U= 0 0 0.22 -0.11 0.29 -0.41 -0.11 -0.34 0.52 -0.06 0.41 0.20 -0.07 0.14 -0.55 0.28 0.50 -0.07 -0.01 -0.11 0.24 0.04 -0.16 -0.59 -0.11 -0.25 -0.30 0.06 0.49 0.40 0.06 -0.34 0.10 0.33 0.38 0.00 0.00 0.01 0.64 -0.17 0.36 0.33 -0.16 -0.21 -0.17 0.03 0.27 0.27 0.11 -0.43 0.07 0.08 -0.17 0.28 -0.02 0.05 0.27 0.11 -0.43 0.07 0.08 -0.17 0.28 -0.02 0.05 0.30 -0.14 0.33 0.19 0.11 0.27 0.03 -0.02 0.17 0.21 0.27 -0.18 -0.03 -0.54 0.08 -0.47 -0.04 0.58 0.01 0.49 0.23 0.03 0.59 -0.39 -0.29 0.25 0.23 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.04 0.62 0.22 0.00 -0.07 0.11 0.16 -0.68 0.23 0 0 0 0.03 0.45 0.14 -0.01 -0.30 0.28 0.34 0.68 0.18 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 October 2010 37 LSI – an example A=ULVT L= October 2010 3.3 4 0 0 0 0 0 0 0 0 0 2.54 0 0 0 0 0 0 0 0 0 2.35 0 0 0 0 0 0 0 0 0 1.64 0 0 0 0 0 0 0 0 0 1.50 0 0 0 0 0 0 0 0 0 1.31 0 0 0 0 0 0 0 0 0 0.85 0 0 0 0 0 0 0 0 0 0.56 0 0 0 0 0 0 0 0 0 0.36 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 38 19 10/10/2010 LSI – an example A=ULVT V= 0.20 -0.06 0.11 -0.95 0.05 -0.08 0.18 -0.01 -0.06 0.61 0.17 -0.50 -0.03 -0.21 -0.26 -0.43 0.05 0.24 0.46 -0.13 0.21 0.04 0.38 0.72 -0.24 0.01 0.02 0.54 -0.23 0.57 0.27 -0.21 -0.37 0.26 -0.02 -0.08 0.28 0.11 -0.51 0.15 0.33 0.03 0.67 -0.06 -0.26 0.00 0.19 0.10 0.02 0.39 -0.30 -0.34 0.45 -0.62 0.01 0.44 0.19 0.02 0.35 -0.21 -0.15 -0.76 0.02 0.02 0.62 0.25 0.01 0.15 0.00 0.25 0.45 0.52 0.08 0.53 0.08 -0.03 -0.60 0.36 0.04 -0.07 -0.45 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 October 2010 39 LSI – an example Choosing the 2 largest singular values we have Uk= 0.22 -0.11 0.20 -0.07 0.24 0.04 0.40 0.06 0.64 -0.17 0.27 0.11 0.27 0.11 0.30 -0.14 0.21 0.27 0.01 0.49 0.04 0.62 0.03 0.45 October 2010 Lk= 0.20 VkT= 0.06 3.34 0 0 2.54 0.6 1 0.46 0.54 0.28 0.00 0.02 0.02 0.08 0.1 7 -0.13 -0.23 0.11 0.19 0.44 0.62 0.53 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 40 20 10/10/2010 LSI (2 singular values) Αk = C1 C2 C3 C4 C5 M1 M2 M3 M4 human 0.16 0.40 0.38 0.47 0.18 -0.05 -0.12 -0.16 -0.09 Interface 0.14 0.37 0.33 0.40 0.16 -0.03 -0.07 -0.10 -0.04 Compute r 0.15 0.51 0.36 0.41 0.24 0.02 0.06 0.09 0.12 User 0.26 0.84 0.61 0.70 0.39 0.03 0.08 0.12 0.19 System 0.45 1.23 1.05 1.27 0.56 -0.07 -0.15 -0.21 -0.05 Respons e 0.16 0.58 0.38 0.42 0.28 0.06 0.13 0.19 0.22 Time 0.16 0.58 0.38 0.42 0.28 0.06 0.13 0.19 0.22 EPS 0.22 0.55 0.51 0.63 0.24 -0.07 -0.14 -0.20 -0.11 Survey 0.10 0.53 0.23 0.21 0.27 0.14 0.31 0.44 0.42 Trees -0.06 0.23 -0.14 -0.27 0.14 0.24 0.55 0.77 0.66 Graph -0.06 0.34 -0.15 -0.30 0.20 0.31 0.69 0.98 0.85 Minors -0.04 0.25 -0.10 -0.21 0.15 0.22 0.50 0.71 0.62 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 October 2010 41 LSI Example • Query: “human computer interaction” retrieves documents: c1,c2, c4 but not c3 and c5. • If we submit the same query (based on the transformation shown before) to the transformed matrix we retrieve (using cosine similarity) all c1-c5 even if c3 and c5 have no common keyword to the query. • According to the transformation for the queries we have: October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 42 21 10/10/2010 Query transformation query 1 human 1 0 Interface 0 1 computer 1 0 User 0 0 System 0 Response 0 Time 0 0 EPS 0 0 Survey 0 0 Trees 0 0 Graph 0 0 Minors 0 q= 0 0 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 October 2010 43 Query transformation qT= 1 0 0.22 -0.11 0.20 -0.07 0.24 0.04 0.40 0.06 Uk= 0.64 -0.17 0.27 0.11 0.27 0.11 0.30 -0.14 0.21 0.27 0.01 0.49 0.04 0.62 0.03 0.45 October 2010 1 0 0 0 Lk-1= 0 0 0 0.3 0 0 0.39 qn=qTUkLk -1= 0.138 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 0 0 0 -0.0273 44 22 10/10/2010 Query transformation VkLk= qnLk = 0.20 -0.06 0.67 -0.15 0.61 0.17 2.04 0.43 0.46 -0.13 1.54 -0.33 0.54 -0.23 1.80 -0.58 0.28 0.11 0.94 0.28 0.00 0.19 0.00 0.48 0.01 0.44 0.03 1.12 0.02 0.62 0.07 1.57 0.08 0.53 0.27 1.35 0.138 3.34 0 0 2.54 -0.0273 = 3.34 0 0 2.54 = 0.46 -0.069 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 October 2010 45 Query document similarity in LSI space 1.5 m3 m4 m2 1 query 0.5 m1 C2 C5 q 0.5C1 -0.5 October 2010 1 1.5 2 C3 C4 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 C1 0.99 C2 0.94 C3 0.99 C4 0.99 C5 0.90 M1 -0.14 M2 -0.13 M3 -0.11 M4 0.05 46 23 10/10/2010 Tutorial Outline • Introduction, Motivation Web Data nature (graph, time evolving) Examples of data sets (snap – our data set) Aspects: content, structure and behavior mining. • • • • Data Preprocessing – – – • Content preprocessing (text to matrix) Dimensionality reduction algorithms Feature selection fundamentals Ranking Web data – – Graph based ranking – Pagerank, HITS Text based ranking • Unsupervised mining algorithms and their foundations: • Predictive learning – – – – – Graph based Clustering Markov Models, regression models, variants of regression with probabilistic clustering and Principal Components Analysis (PCA) EM based approach with Bayesian learning. • Top-k similarity measures • Predictive learning case studies – K-Sim, Osim, NDCG, Spearman correlation. – October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 47 Why link analysis? • The web is not just a collection of documents – its hyperlinks are important! • A link from page A to page B may indicate: – A is related to B, or – A is recommending, citing or endorsing B • Links are either – referential – click here and get back home, or – Informational – click here to get more detail October 2010 M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010 48 24 10/10/2010 HITS - Kleinberg’s Algorithm • HITS – Hypertext Induced Topic Selection • For each vertex v Є V in a subgraph of interest: a(v) - the authority of v h(v) - the hubness of v • A site is very authoritative if it receives many citations. Citation from important sites weight more than citations from less-important sites • Hubness shows the importance of a site. A good hub is a site that links to many authoritative sites October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 49 Authority and Hubness 5 2 3 1 1 4 6 7 a(1) = h(2) + h(3) + h(4) October 2010 h(1) = a(5) + a(6) + a(7) M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 50 25 10/10/2010 Authority and Hubness Convergence • Recursive dependency: a(v)  Σ h(w) w Є pa[v] h(v)  Σ w Є ch[v] a(w) • Using Linear Algebra, we can prove: a(v) and h(v) converge M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 October 2010 51 HITS Example Find a base subgraph: • Start with a root set R {1, 2, 3, 4} • {1, 2, 3, 4} - nodes relevant to the topic • Expand the root set R to include all the children and a fixed number of parents of nodes in R  A new set S (base subgraph)  October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 52 26 10/10/2010 HITS Example Results Authority Hubness 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Authority and hubness weights October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 53 Markov Chain A Markov chain has two components: • • A network structure much like a web site, where each node is called a state. A transition probability of traversing a link given that the chain is in a state. • For each state the sum of outgoing probabilities is one. b2 b4 c1 b3 d2 d1 A sequence of steps through the chain is called a random walk. October 2010 a1 b1 M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010 e2 e1 54 27 10/10/2010 PageRank - Motivation • A link from page A to page B is a vote of the author of A for B, or a recommendation of the page. • The number incoming links to a page is a measure of importance and authority of the page. • Also take into account the quality of recommendation, so a page is more important if the sources of its incoming links are important. M. Vazirgiannis “Web Mining - a predictive approach “October 2010 55 BDA 2010 The Random Surfer • Assume the web is a Markov chain. • Surfers randomly click on links, where the probability of an outlink from page A is 1/m, where m is the number of outlinks from A. • The surfer occasionally gets bored and is teleported to another web page, say B, where B is equally likely to be any page. • Using the theory of Markov chains it can be shown that if the surfer follows links for long enough, the PageRank of a web page is the probability that the surfer will visit that page. October 2010 M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010 56 28 10/10/2010 Dangling Pages B • Problem: A and B have no outlinks. Solution: Assume A and B have links to all web pages with equal probability. October 2010 M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010 57 Rank Sink • Problem: Pages in a loop accumulate rank but do not distribute it. • Solution: Teleportation, i.e. with a certain probability the surfer can jump to any other web page to get out of the loop.M. Vazirgiannis “Web Mining - a predictive approach “October 2010 BDA 2010 58 29 10/10/2010 PageRank • Introduced by Page et al (1998) – The weight is assigned by the rank of parents • Difference with HITS – HITS takes Hubness & Authority weights – The page rank is proportional to its parents’ rank, but inversely proportional to its parents’ outdegree M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 October 2010 59 PageRank (PR) – Definition PR(P)  • • • • • PR(P1 ) PR(P2 ) PR(Pn ) d  (1  d)(   ... ) N O(P1 ) O(P2 ) O(Pn ) P is a web page Pi are the web pages that have a link to P O(Pi) is the number of outlinks from Pi d is the teleportation probability N is the size of the web October 2010 M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010 60 30 10/10/2010 Example Web Graph October 2010 M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010 61 Iteratively Computing PageRank • Replace d/N in the def. of PR(P) by d, so PR will take values between 1 and N. • d is normally set to 0.15, but for simplicity lets set it to 0.5 • Set initial PR values to 1 • Solve the following equations iteratively: PR( A)  0.15 / 3  0.85PR(C ) PR(B )  0.15 / 3  0.85(PR( A) / 2) PR(C )  0.15 / 3  0.85(PR( A) / 2  PR(B )) October 2010 M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010 62 31 10/10/2010 Example Computation of PR Iteration PR(A) PR(B) PR(C) 0 1 2 3 4 5 … 12 1 0.75 0.765625 0.76855469 0.76910400 0.76920700 … 0.76923077 1 1.125 1.1484375 1.15283203 1.15365601 1.15381050 … 1.15384615 October 2010 1 1 1.0625 1.07421875 1.07641602 1.07682800 … 1.07692308 M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010 63 Matrix Notation Adjacent Matrix A= * http://www.kusatro.kyoto-u.com October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 64 32 10/10/2010 Matrix Notation • Matrix Notation r=αBr=Mr α : eigenvalue r : eigenvector of B Ax=λx | A - λI | x = 0 B= Finding Pagerank  to find eigenvector of B with an associated eigenvalue α M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 October 2010 65 Matrix Notation PageRank: eigenvector of P relative to max eigenvalue B = P D P-1 D: diagonal matrix of eigenvalues {λ1, … λn} P: regular matrix that consists of eigenvectors PageRank October 2010 r1 = normalized M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 66 33 10/10/2010 Matrix Notation • Confirm the result # of inlinks from high ranked page hard to explain about 5&2, 6&7 • Interesting Topic How do you achieve your homepage to be highly ranked? October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 67 PageRank Algorithm * Page et al, 1998 October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 68 34 10/10/2010 Pagerank computation methods G=(V, E) xi:Pagerank of page i, dj:out-degree of page j In tabular form: x=Ax where: Power method (recursive) converges to the principal eigenvector (probability visiting the page). • If many pages have no inlinks eigen vector mostly 0’s • Thus • • • • • ____________________________________________________________________________ “PageRank Computation and the Structure of the Web: Experiments and Algorithms”, Arvind Arasu, Jasmine Novak, Andrew Tomkins & John Tomlin October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 69 Pagerank computation methods • Jacobi iteration • Gauss-Seidel method » October 2010 convergence speed M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 70 35 10/10/2010 The Largest Matrix Computation in the World • Computing PageRank can be done via matrix multiplication, where the matrix has several billion rows and columns. • The matrix is sparse as average number of outlinks is between 7 and 8. • Setting d = 0.15 or below requires at most 100 iterations to convergence. • Researchers still trying to speed-up the computation. M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010 October 2010 71 Tutorial Outline • Introduction, Motivation Web Data nature (graph, time evolving) Examples of data sets (snap – our data set) Aspects: content, structure and behavior mining. • • • • Data Preprocessing – – – • Content preprocessing (text to matrix) Dimensionality reduction algorithms Feature selection fundamentals Ranking Web data – – Graph based ranking – Pagerank, HITS Text based ranking • Unsupervised mining algorithms and their foundations: • Predictive learning – – – – – Graph based Clustering Markov Models, regression models, variants of regression with probabilistic clustering and Principal Components Analysis (PCA) EM based approach with Bayesian learning. • Top-k similarity measures • Predictive learning case studies – K-Sim, Osim, NDCG, Spearman correlation. October–2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 72 36 10/10/2010 Text based ranking - Overview • • • • • • • • (Typical) Search Process Boolean Retrieval Model Vector Space Model BM25 Ranking Function Bayesian and Combinatorial Models Ranking with Document fields Ranking with Proximity Learning to Rank October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 73 (Typical) Search process • A User performs a task and has an information need, from which a query is formulated. • A search engine typically processes textual queries and retrieves a (ranked) list of documents • The user obtains the results and may reformulate the query October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 74 37 10/10/2010 Boolean Retrieval Model • Documents are represented as sets of terms • Queries are Boolean expressions with precise semantics • Retrieval is precise and the answer is a set of documents – Documents are not ranked – Simplest retrieval model • Example: D1 = {web, text, mining} D2 = {data, mining, text, clustering} D3 = {frequent, itemset, mining, web} • For query q = web ∧ mining answer is { D1, D3 } • For query q = text ∧ ( clustering ∨ web) answer is { D1, D2 } • Extended Boolean Retrieval Model (Salton et al., 1983) • Ranks documents according to similarity to query • Matches documents that satisfy the Boolean query partially October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 75 Vector Space Model • Document d and query q are represented as k-dimensional vectors d = (w1,d, …, wk,d) and q = (w1,q, …, wk,q) – – – – Each dimension corresponds to a term from the collection vocabulary Independence between terms wi,q is the weight of i-th vocabulary word in q (Salton & McGill, 1983; Baeza-Yates & Ribeiro-Neto, 1999) • Degree of similarity between d and q is the cosine of the angle between the two vectors k r r t 1 w t,d  wt,q dq sim(d,q)  r r  k k dq t 1 w t,d2  t 1 w t,q2  • Generalized Vector Space Model (Wong et al., 1985) – Incorporates dependencies between terms – Using semantics (Tsatsaronis & Panagiotopoulou, 2009) October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 76 38 10/10/2010 Term weighting in VSM • Term weighting with tf-idf (and variations) w t,d  tf t,d  idf t • tf models the importance of a term in a document tf t,d  f t,d tf t,d  f t,d maxf s,d  – ft,d is the frequency of term t in document d • idf models the importance of a term in the document collection – Logarithm base not important – Theoretically derived from Probabilistic IR, Information Theory (Robertson, 2004; Papineni, 2001; Metzler, 2008) idf t  log N nt idf t  log N  n t  0.5 n t  0.5 – N is the total number of documents, nt is the document frequency of term t October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 77 BM25 Ranking Function • tf-idf like ranking function assuming bag-of-words document representation (Robertson et al., 1994) tf t,d  k1 1 N  n t  0.5 where idf t  log   n t  0.5 len d t d q tf t,d  k1  1  b  b  avglen   d – lend is the length of document – avglen is the average document length in the collection score(d,q)   idf t  • Values of parameters k1 and b depend on collection/task – k1 controls term frequency saturation – b controls length normalization – Default values: k1 =1.2 and b =0.75 October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 78 39 10/10/2010 Bayesian & Combinatorial Models • Language Models for IR based on a Bayesian approach (Ponte & Croft, 1998; Hiemstra, 1998; Laferty & Zhai, 2001) – Generate a language model Md for every document d – Rank documents according to probability P(q|Md) that they generate query q – Principled integration of the importance of documents (e.g. PageRank), as prior probability P(d) • Divergence From Randomness based on a combinatorial approach (Amati, 2003) – Information theoretic framework for generating ranking functions w d ,q   (log P ) (1  P ) 1 2 t q  d – Models consist of 3 components: – Randomness model computes probability P1 that t appears with given frequency in d – Risk/Information Gain Model computes probability P2 that t with given frequency in d will occur one time more – Term Frequency Normalisation adjusts term frequency with respect to document length October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 79 Ranking with Document fields • Documents often have content in different fields – Metadata in PDF or MS Office files – Title, headings, emphasized text, anchor text of incoming hyperlinks in Web pages • Rank by – Scoring each field separately and combining the scores • Difficult to interpret: combining non-linear transformations of term frequencies – Computing weighted sum of field term-frequencies, then compute score • Per-field term frequency normalization: e.g. no need to discount anchor text frequencies, because each anchor is a vote for the Web page • Extension of BM25 (Zaragoza et al. 2004) and DFR models (Macdonald et al., 2005) with per-field normalization October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 80 40 10/10/2010 Ranking with Proximity • Query terms that appear near each other contribute more to the document’s score – Requires position information in the search engine index – Score both individual query terms and pairs of query terms score(d,q)   tq score(d,t)   score(d,t ,t i j ) t i ,t j q,t i  t j • Count occurrence of pair of terms if they occur within window of w terms • Random Markov Fields to model dependencies (Metzler & Croft, 2005) – Sequential Dependence: only consecutive terms – Full Dependence: all possible pairs of terms • Extensions to BM25 and DFR (Büttcher & Clarke, 2006; Peng et al., 2007) • Using global counts for pairs of terms expensive to compute and store – Not always effective (Macdonald & Ounis, 2010) October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 81 Learning To Rank • Web search engines use hundreds of features to rank Web pages – Text-based features (e.g. BM25 scores), term proximity, document structure, – Link-based features (e.g. PageRank), – spam metrics, freshness, click-through rates • Limitations of typical IR models – Parameter tuning can be difficult – Cannot combine large number of features to obtain more effective ranking • Machine Learning from – Points: relevance judgment for query-document pair (Crammer & Singer, 2001) • Doc A is relevant, Doc B is not relevant, Doc C is somehow relevant – Pairs: preferences between pairs of documents (Burges et al., 2005) • Doc A is better than Doc B – Lists: maximize average evaluation measure for many queries (Yue et al., 2007) • Hard problem, evaluation measures are not always continuous, approximations are required October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 82 41 10/10/2010 Tutorial Outline • Introduction, Motivation Web Data nature (graph, time evolving) Examples of data sets (snap – our data set) Aspects: content, structure and behavior mining. • • • • Data Preprocessing – – – • Content preprocessing (text to matrix) Dimensionality reduction algorithms Feature selection fundamentals Ranking Web data – – Graph based ranking – Pagerank, HITS Text based ranking • Unsupervised mining algorithms and their foundations: • Predictive learning – – – – – Graph based Clustering Markov Models, regression models, variants of regression with probabilistic clustering and Principal Components Analysis (PCA) EM based approach with Bayesian learning. • Top-k similarity measures • Predictive learning case studies – K-Sim, Osim, NDCG, Spearman correlation. October–2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 83 Graph Clusters & Communities • There is no widely accepted definition • In general the graph has to be relatively sparse. • If the graph is too dense then there is no meaning in search of a cluster using the structural properties of the graph. • Many clustering algorithms or problems related to clustering are NP-hard October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 84 42 10/10/2010 Basic Methodologies • • • • • • Hierarchical Clustering Graph Partitioning Partitional Clustering Spectral Clustering Divisive Algorithms Modularity Based Methodologies October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 85 Hierarchical Clustering • Either top-down (Divisive) or bottom-up (Agglomerative) • On Agglomerative methodologies small clusters are merged together into bigger ones based on a measurement of similarity • The individual merging steps can be used to create a hierarchy of domains in a dendogram. • Starting point: Each node is a cluster • Ending point: Arbitrary e.g. by the size/number of the clusters or by a quality measure e.g. Modularity October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 86 43 10/10/2010 Hierarchical Clustering • Similarity functions: – Jaccard coefficient[1] : If A and B two groupings (in graphs the vertices that are connected to two different nodes) the jaccard coefficient is – In ROCK Algorithm[2]->goodness measure g(Ci;Cj) : Where: • f(θ) is a function that is dependent on the data set as well as the kind of clusters we are interested in • Link[Ci,Cj] are the links between the two clusters • nif(θ) is an approximation on the links each vertex has inside Ci By taking into account both the linkage outside and inside of each cluster we avoid the situations where big clusters “swallow” smaller, but well defined, ones October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 87 Graph Partitioning • The general problem is to find k groups of vertices so that the edges between them is minimal. • The k is given as an argument otherwise we end with each vertex as each own cluster. • Most common algorithms use the Ford and Fulkerson maxflow min-cut theorem [4][5] October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 88 44 10/10/2010 Max-Flow Min-Cut • By choosing a source and a sink (or by adding them-depending on the implementation and problem) we send the maximum flow between these two nodes. • The edges that belong in the path with the maximum flow and have their respective flow maximized are the edges that belong to the minimum cut • A common implementation that works generally fast on sparse networks is the Edmonds–Karp algorithm[6] • In the image below if a source is on the left and the sink is on the right then the two clusters will be easily seperated October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 89 Partitional Clustering • The number of clusters is predefined • For each pair of vertices we define a distance • Clusters are formed by maximizing or minimizing a cost function based on the distance • Most common algorithm is k-Means October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 90 45 10/10/2010 Partitional Clustering • Common cost Functions : – Minimum k-clustering: The cost function is the diameter of a cluster: the largest distance between two points of a cluster. The points are classified such that the largest of the k cluster diameters is the smallest possible. Trying to keep the clusters very ``compact''. – k-clustering sum: Same as minimum k-clustering, but the diameter is replaced by the average distance between all pairs of points of a cluster. – k-center: For each cluster i we define a reference point xi (centroid) and compute the maximum di of the distances of each cluster point from the centroid. The clusters and centroids are chosen to minimize the largest value of di. – k-median: Same as k-center, but the maximum distance from the centroid is replaced by the average distance. October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 91 Spectral Clustering • A similarity matrix is needed-> the adjacency matrix A • We use the eigenvectors to partition the matrix into clusters. • Most common methodologies use the Laplacian Matrix L=D-A – where D is the diagonal matrix whose element Dii equals the degree of vertex I • We apply decomposition on L and we use the k eigenvectors with the highest Eigen value • If k=2 we use the second eigenvector October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 92 46 10/10/2010 Spectral Clustering (k=2) x1 x2 x3 x1 1.5 0.8 0.6 x2 0.8 1.6 0.8 0 x3 0.6 0.8 1.6 0.2 0 0 x4 0 0 0.2 1.7 0.8 0.7 0 0.8 1.7 0.8 x5 x6 0.1 0 0 0 0 x4 x5 0 0.1 0.7 0 0.8 x6 0 Λ= 0.2 0.1 0.4 -0.2 -0.9 0.4 0.2 0.1 -0. 0.4 0.3 2.2 0.4 0.2 -0.2 0.0 -0.2 0.6 0.4 -0.4 0.9 0.2 -0.4 -0.6 2.5 0.4 -0.7 -0.4 -0.8 -0.6 -0.2 3.0 0.4 -0.7 -0.2 0.5 0.8 0.9 X= x1 0.2 x2 x3 0.2 x4 -0.4 x5 -0.7 x1 0.2 x4 -0.4 -0.7 x2 0.2 x5 -0.7 x3 0.2 x6 -0.7 x6 October 2010 0.4 0.4 2.3 0 1.5 0.0 (Here threshold for spliting =0) 0.2 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 93 Spectral Clustering (Unnormalized) Input: Similarity matrix S = Rn×n, number k of clusters to construct. Construct a similarity graph by one of the ways define before. Let W be its (weighted) adjacency matrix. Compute the Laplacian L. Compute the first k eigenvectors u1, . . . , uk of L. Let U = Rn×k be the matrix containing the vectors u1, . . . , uk as columns. For i = 1, . . . , n, let yi = Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters • C1, . . . ,Ck. • Output: Clusters A1, . . . ,Ak with Ai = {j| yj 2 Ci}. [8] • • • • • • • October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 94 47 10/10/2010 Divisive Algorithms • The idea seem similar with Hierarchical Divisive Algorithms (HAD) • In HAD we remove edges based on similarity between vertices INSTEAD here we remove edges from within the cluster based only on properties of the edge • Edge betweeness : The number of shortest paths between all vertex pairs that run along the edge. It is higher for edges that connect communities October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 95 Divisive Algorithms • Girvan and Newman Algorithm[9]: • 1. Computation of the centrality for all edges • 2. Removal of edge with largest centrality: in case of ties with other edges, one of them is picked at random • 3. Recalculation of centralities on the running graph • 4. Iteration of the cycle from step 2. October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 96 48 10/10/2010 Modularity • Modularity is a metric that can be used to evaluate the quality of a cluster • Where: – – – – – m is the number of links in the whole graph A the adjacency matrix kj the degree of vertex i ci the cluster I belongs to and δ(ci,cj)=0 if ci==cj otherwise δ(ci,cj)=1 • The value of the modularity lies in the range [−1,1] • It is positive if the number of edges within groups exceeds the number expected on the basis of chance. October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 97 Modularity • • • Modularity can also be used as a function in algorithms in order to extract clusters. Example: Simulated Annealing In the most common implementation there are two moves 1. Moving at random a vertex from one cluster to another 2. Merging Splitting clusters October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 98 49 10/10/2010 Tutorial Outline • Introduction, Motivation Web Data nature (graph, time evolving) Examples of data sets (snap – our data set) Aspects: content, structure and behavior mining. • • • • Data Preprocessing – – – • Content preprocessing (text to matrix) Dimensionality reduction algorithms Feature selection fundamentals Ranking Web data – – Graph based ranking – Pagerank, HITS Text based ranking • Unsupervised mining algorithms and their foundations: • Predictive learning – – – – – Graph based Clustering Markov Models, regression models, variants of regression with probabilistic clustering and Principal Components Analysis (PCA) EM based approach with Bayesian learning. • Top-k similarity measures • Predictive learning case studies – K-Sim, Osim, NDCG, Spearman correlation. October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 99 World Wide Web (WWW) • A highly dynamic structure continuously changing: Web pages and hyperlinks are created, deleted or modified • Important activity: Searching the content that matches users’ keyword-based queries • Ranking of the results enables users to effectively retrieve relevant and important information • Up-to-date ranking involves a huge infrastructure (intense crawling, index updates …) • A robust predictive framework enables: – Saving infrastructure resources for search engines – Identifying promising web pages / users in social networks to advertise on!! October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 100 50 10/10/2010 Proposed Framework • Goal: Predict the future ranking of Web pages based on previous ranking positions • Framework: – Computation of ranking trends – Definition of the predictive models: • Linear/2nd order Regression, Principal Components Regression, EM based clustering. – Models’ training and predictions – Evaluation of predictions’ quality using various similarity measures October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 101 Normalized ranking • Ranking across graph snapshots – As the size of the graph increases the measure should be comparable across different graph sizes (a page p is more important when it occupies the 5th out of 1000 pages rather than 100 pages) • Definition of normalized rank (nrank) (Vlachou,Vazirgiannis [WAW2006]): nrank ( p, t i )  October 2010 2  rank ( p,t i ) nt i 2 , nt i | Gti | M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 102 51 10/10/2010 Rank Change Rate (Racer) • Intuitively, an appropriate measure is the rank change rate between two snapshots • It represents the rate and relevant importance of rank change racer(p, t i )  nrank(p, t i )  nrank(p, t i 1 ) nrank(p, t i 1 ) 1 nrank(p, t i ) nrank( p, t i ) • Changes in top rank positions are more important than those in lower positions October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 103 Markov Models (MMs) • Well suited for modeling and predicting values in various applications where frequent state transitions occur • An m–order MM is defined as: – A set of states S = {s1,..,sn} and – A matrix of transition probabilities T • Markov Property: Future values depend only on m previous values (m = the MM’ s order) k P  x1  x 2  ...  x k   P  x1  *  P  xi | xi m ... xi 1  i 2 • Other MM properties: – Reducibility, Periodicity, Recurrence, Ergodicity October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 104 52 10/10/2010 MMs’ Learning • MMs learning from racer sequences: – Racer values are real numbers High precision makes impossible the recognition of patterns • Solution: Discretize racer values: states cardinality – Large: High computational complexity, low coverage, overfitting of the model  Degradation of predictions’ ability – Small: Fine grained trends cannot be represented  accuracy falls • URL t1 t2 t3 P1 0.83233 0.93323 0.13233 P2 0.83234 0.93434 0.23233 P3 0.83235 0.93536 0.33233 P4 0.83236 0.93644 0.43233 …. … … … … (Jagadish et al. 1998 ): Partition the data space into n non-overlapping equiprobable adjacent partitions October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 105 Definition of MMs’ States • If R is the set with d distinct values and F are the corresponding frequencies: – Repeatedly partition R until there are n partitions: • Choose the partition Ri with the highest sum of frequencies • Divide Ri into two equi-probable partitions – For each partition, calculate the weighted mean: si    ri  R i f i  ri ri  R i fi • The set of all weighted mean values S defines the n states of the MM  n = 20 (after conducting a number of experiments) October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 106 53 10/10/2010 MMs’ Predictions with • Assuming m+1 successive crawls: – Construct an m-order MM – Compute transition probabilities for every path using the chain rule: m P (a1  ...  am )  P(a1 )   P (ai ai 1,..,ai  m ) i 2 (where ai = ranking(pj, ti)) – The predicted value (and therefore the ranking position) is: X  arg max P (a1  ...  am 1  X ) X October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 107 Predictive Models • • • • Markov Models Linear Regression Principal Components Regression Bayesian Clustering October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 108 54 10/10/2010 Linear Regression • Predicts future nrank values by fitting a line to the values of the previous times • A separate linear predictor is defined for each page p • Given the nrank values for m snapshots, the prediction for a snapshot ti is: nrank( p, t i )  a  t i  b   i where εi: the Gaussian noise with zero mean and known variance σ2 October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 109 Linear Regression • Parameters a, b are found by solving the least squares problem: m 2  a   i  1 t i b    m     i 1 t i m t  m   i 1 i 1  m t  x  i    i 1m i  i 1 xi  • Advantages: Simple and flexible • Disadvantages: – Does not take into account the correlation among Web pages – Computationally expensive for high dimensions October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 110 55 10/10/2010 Second Order Prediction • Assumes that Web pages do not follow a linear behavior through time • Given m available snapshots, future nrank values are estimated as: 2 x i  a  t i  bi  t  ci   i • Parameters a, b and c are given by solving the equation: m 4  a    i 1 t i b    m t 3     i 1 i  c    m t i 2  i 1 October 2010 m    i 1 m i 1 m ti 3 ti 2 i 1 ti m 2 ti   t  i 1 i  m     i 1 m 1  m t 2x  i   i 1 i m    i 1 t i  x i    m   i 1 x i    M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 111 Predictive Models • • • • Markov Models Linear Regression Principal Components Regression Bayesian Clustering October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 112 56 10/10/2010 Principal Components Regression (PCR) • Principal Component Analysis (PCA) is applied prior to estimating the linear coefficients: – Identifies temporal evolution patterns in the pages rankings – Highlights Web pages’ similarities and differences – Reduces the number of dimensions with controlled loss of information • Input: a data matrix whose columns represent the variables (time in our case) • Output: a transformed matrix of lower dimension containing only uncorrelated variables October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 113 PCA : Method • Subtract the mean from each dimension (DataAdjust) (A-M)>A’ • Calculate the covariance matrix (A’A’T)->W • Calculate the eigenvectors and eigenvalues of the covariance matrix: W= ULVT • Choose the most important components and form a feature vector (FeatureVector) Wk= UkLkVkT • Derive the new data set: FeatureVector’ x DataAdjust’ (Ak=UkLk ) October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 114 57 10/10/2010 PCA + Regression (I) Xk=Akzk+μ+ε • Xk: the rank time series/pattern (n-dim vector), A nxq matrix, zk latent vector, µ is the mean of Xks. The noise vector ε is drawn from a zero-mean isotropic Gaussian N(0, σ2, In), • latent vector zk is drawn from N(0, Iq). • By marginalization: unconditional probability distribution of Xk p(Xk)=N(µk,AΑT+σ2Ι) • probability distribution is Gaussian with full covariance matrix AAT + σ2I thus correlation of different Web pages in the n-rank vector Xk represented. October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 115 PCA + Regression (II) two-stage learning process. - Apply PCA to X to find the q principal components of A - We form the low dimensional projection of the data computing the low dimensional latent vectors: Zk = AT (Xk−μ), k = 1, . . . ,m. - Train q independent linear regression models in the low dimensional space: - Z is a qxm matrix that collects together all the latent vectors. each linear regression model is constructed for any row of Z: zjk = aj tk+bj+ε k = 1, . . . ,m – The parameters (aj , bj) for each j are obtained by solving a linear system using least squares October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 116 58 10/10/2010 PCA + Regression (III) • To predict the future nrank values X∗, – We estimate the latent vector z∗ at time t∗ by zjk = aj t*+bj+ε k = 1, . . . ,m – prediction: X∗ = Az∗ + µ – Choose k out of m eigenvectors such that the variation losss/error is: m  j  k 1 m  l 1 October 2010 j  15% l M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 117 Predictive Models • • • • Markov Models Linear Regression Principal Components Regression Bayesian Clustering October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 118 59 10/10/2010 Bayesian Clustering • We assume that Web pages follow one of J different behaviour/evolution types (clusters) • Clusters correspond to different dominant Web page behaviors’ • Clustering is viewed as training a mixture model: – A cluster type j is first chosen with probability πj for each Web page October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 119 Bayesian Clustering -I • Assume n-rank falls J different clusters • Clustering can be viewed as training a mixture probability model. xik = aj tk+bj+εk k = 1, . . . ,m • xik the n-rank value for the ith Web page at time tk, cluster type j with probability πj and ε Gaussian noise N(0, σ^2) • n-rank values are drawn from the following product of Gaussians: m p(xi | j)   N(x ik | ajtk  bj ,  2j ) k 1 • By marginalization, xi unconditional probability: J p( xi )    p(x j i | j) j 1 October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 120 60 10/10/2010 Bayesian Clustering: Training • Objective: train the model and estimate parameters:   {πj , aj, b j , σ 2j }Jj 1 • through log likelihood maximization of a dataset consisting of N Web pages is: N L( )  log  p( xi ) n 1 • The maximization of L(θ) is achived via the Expectation – Maximization (EM) algorithm M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 October 2010 121 Bayesian Clustering: Training with Expectation Maximization • E-Step computes the posterior probabilities: R ij   j p(xi | j ) J  1 j: clusters,i: pages, k: time   p( xi |  ) • M-Step updates the parameters (t: vector containing all time stamps) j   j2   N i 1 m  N n 1 R nj  R ij  k  1 x ik  a j t k  b j  2 mN  j a j  1    b    j October 2010 1 N j t T t  T t I tT I  m 1   N R ij x iT t    Ni  1 i T    i  1 R j x i I  M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 122 61 10/10/2010 Bayesian Clustering: Prediction • To predict the nrank of page x*i at time t* based on a hostory of oservations xi we express the posterior probabilty (p(x*i |xi) according to Bayes Rule: J p( x i * | x i )  R n j N( x i * | a j t*  b j ,  2j ) j 1 j  arg max k Rki THUS x i *  a j t*  b j where the cluster maximizing the posterior probability M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 October 2010 123 Tutorial Outline • Introduction, Motivation Web Data nature (graph, time evolving) Examples of data sets (snap – our data set) Aspects: content, structure and behavior mining. • • • • Data Preprocessing – – – • Content preprocessing (text to matrix) Dimensionality reduction algorithms Feature selection fundamentals Ranking Web data – – Graph based ranking – Pagerank, HITS Text based ranking • Unsupervised mining algorithms and their foundations: • Predictive learning – – – – – Graph based Clustering Markov Models, regression models, variants of regression with probabilistic clustering and Principal Components Analysis (PCA) EM based approach with Bayesian learning. • Top-k similarity measures • Predictive learning case studies – K-Sim, Osim, NDCG, Spearman correlation. – October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 124 62 10/10/2010 Evaluating the predictions • Boils down to top-k lists similarity (predicted vs. actual) • OSim(A, B): Indicates the degree of overlap between the top-k elements of two sets A B OSim( A, B) = A∩ B k • KSim(A, B) : Based on Kendall’s distance. Indicates the degree that the relative orderings of two top-k lists are in concordance: KSim( A, B) = October 2010 c a b b c {(u, v) : A', B' agree on order} A∪ B ( A∪ B a d 1) d M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 125 Top-k lists similarity measures - OSim • Let Α, Β two top-k lists of web pages then we define:  OSim(A,B) = A∩ B k i.e. the portion of common pages weighted over the size of the lists.  Example: Consider the top-6 lists Α={5, 1, 2, 4, 10, 8} Β={ 4, 3, 9, 1, 5,10}, each number represents a web page id (order represents the ranking). Then Α ∩ Β = {5,1,4,10} and as Α ∩ Β| = 4 and k = 6: OSim(A,B) = 4/6 = 2/3 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 126 63 10/10/2010 Top-k lists similarity measures - KSim |{(u,v): u in A΄, v in Β΄ agree on order}|  KSim(A,B) = where: |A U B| (|A U B| - 1) Α΄= ΑU(B – A) and Β΄= ΒU(Α – Β): all pairs that maintain their relative ordering in both sets: (Α΄, Β΄)  Example: Let the top-6 lists Α={5, 1, 2, 4, 10, 8} and Β={ 4, 3, 9, 1, 5,10}, then Α΄= {5, 1, 2, 4, 10, 8, 3, 9} and Β΄={4, 3, 9, 1, 5, 10, 2, 8}. The pairs maintaining their relative order in Α΄, Β΄ are: (5,2), (5,10), (5,8), (1,2), (1,10), (1,8), (2,8), (4,10), (4,8), (4,3), (4,9), (10,8), (3,9) (13 all of them). Hence KSim(A,B) = 13/56 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 127 Top-k lists similarity measures - NDCG  NDCG (= Normalized Discounted Cumulative Gain): i) real(pi) = i ≤ k, predicted(pi) = j ≤ k  score G(i) = k – |i – j| ≤ k G(i), for i=1 ii) CG(i) = iii) DCG(i) = iv) IDCG (ideal DCG)  DCG k with the optimal (descending) elements order (max) CG(i-1) + G(i), if i>1 G(i), i < b CG(i-1) + G(i) / logbi, i ≥ b Let DCGk = Σi=1 DCG(i), then we define:   NDCG = DCGk IDCGk Example: Let the top-5 lists: Α={4, 2, 3, 1, 5} (real) and B = {2, 1, 5, 4, 3} (predicted), then (k=5) G = ( 3, 4, 3, 2, 3). For b=2, it is : CG = (3, 7, 10, 12, 15), DCG = (3, 5, 8.89, 12, 13.89) and ΙDCG = (4, 7, 8.89, 11.5, 13.86), Thus: DCG5 = 42.786 and ΙDCG5 = 45.254. Hence: NDCG = 0.9455 Observation: To compute IDCG and ΙDCG5, we assumed G: (4, 3, 3, 3, 2) M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 128 64 10/10/2010 Top-k lists similarity measures – Spearman’s rank correlation coefficient (ρ)  Spearman’ s rank Correlation Coefficient (ρ): i) Based on the correlation among two variables X and Y ii) Perfect Spearman correlation : one variable can be represented as a monotone function of the other (ρ = ±1, xi = ayi+b) k iii) If di = xi – yi , i = 1,2,…,k, then we define ρ = 1 – ρ=1 October 2010 6 Σi=1 di2 k (k2 – 1) ρ = -1 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 129 Top-k lists similarity measures– Kendall tau rank correlation • Let Α={a1, a2, …, ak}, Β ={b1, b2, …, bk} two top-k lists and (x1, y1), (x2, y2), …, (xk, yk) the sets of of pairs, where xi and yi are unique and represent the rank of the page pi στις in the Α, Β lists respectively. i. The pairs (xi, yi) and (xj, yj) agree on the relative order (concordant), iff (xi < yi and xj < yj ) or (xi > yi and xj > yj) ii. (xi, yi) and (xj, yj) have a different relative order (discordant), iff (xi < yi and xj > yj ) or (xi > yi and xj < yj ). iii. Then we define Kendall tau rank correlation (τ): # concordant pairs - # discordant pairs τ= October 2010 k (k-1) / 2 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 130 65 10/10/2010 Top-k lists similarity measures– Kendall tau rank correlation (2)  Example: Let two top-4 lists Α = {2, 4, 3, 1} and Β = {1, 4, 3, 2} where xi, yi the order of element i in Α and Β respectively. Let x1 = 4, x2 = 1, x3 = 3, x4 = 2 και y1 = 1, y2 = 4, y3 = 3, y4 = 2. Thus we form the pairs: (xi, yi) = {(4,1), (1,4), (3,3), (2,2)} (with k= 4= | (xi, yi) |). Comparing all the possible pairs (xi, yi) we find:  # concordant pairs = 1: (3,3) , (2,2)  # disconcordant pairs = 5  Therefore τ = (1-5)/[0.5(4·3)] = -1/6  This measure represents the correlation between two ranked lists October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 131 Top-k lists similarity measures – Kendall tau rank correlation (3)  Observations: 1) -1 ≤ τ ≤ 1 2) We can as well define it as : τ= Number of concordant pairs i k (k-1) / 2 then in this case: 0 ≤ τ ≤ 1. 3) Based on this definition for the example of the previous slide: τ = 1/6. October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 132 66 10/10/2010 Top-k lists similarity measures – RSim  RSim(A,B) = 1 – CPS (A ,B) CPSmax (A ,B) , where CPS(A ,B) = Σi=1 |Ai – Bi|·(k + 1 – Ai) and Ai the real and Βi the predicted ranking of page i.  Observations: - Αi, Bi ϵ {1, 2,..., k} - RSim(A,B) = 1  the best case ≠ RSim(A,B) → 0 - Assume for(α)): Αi = i, i ϵ {1, 2,..., k}, then CPS(A ,B) ≤ Σi=1 |i - Bi| (k + 1 – i) = Σi=1 (k + 1 – i)2 = Σi=1 i2 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 133 Tutorial Outline • Introduction, Motivation Web Data nature (graph, time evolving) Examples of data sets (snap – our data set) Aspects: content, structure and behavior mining. • • • • Data Preprocessing – – – • Content preprocessing (text to matrix) Dimensionality reduction algorithms Feature selection fundamentals Ranking Web data – – Graph based ranking – Pagerank, HITS Text based ranking • Unsupervised mining algorithms and their foundations: • Predictive learning – – – – – Graph based Clustering Markov Models, regression models, variants of regression with probabilistic clustering and Principal Components Analysis (PCA) EM based approach with Bayesian learning. • Top-k similarity measures • Predictive learning case studies – K-Sim, Osim, NDCG, Spearman correlation. October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 134 67 10/10/2010 Datasets Description • Internet Archive (Global rankings): – Comprises of 500.000 pages referring to weekly collections of UK government websites – 24 snapshots were obtained between March 2004 and January 2006 • Wikipedia (Query-based rankings): – Represents a dynamic large-scale collection concerning Wikipedia’s English version – 55 consecutive monthly snapshots – 1/2002 to 7/2006. – very dynamic large-scale collection (1,467,982 articles in the last snapshot) – query-based top-k results, maintain for each of the queries selected (selection criteria follow) the top-k results, for each of the 55 monthly snapshots. October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 135 Wikipedia: Query Classification • Frequency: The proportion of documents containing a given term (TFwi(tj), where wi a specific snapshot) – Wide queries: Correspond to frequent terms • Dynamism: In terms of the rate of increase (or decrease) in the number of documents that contain the query term as time progresses SDwi (t j )  TFwi (t j )  TFwi 1(t j ) TFwi (t j ) – Dynamic positive queries : D(tj) >0 – Dynamic negative queries : D(tj)<0 – Least dynamic queries (Dtj~0) October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 136 68 10/10/2010 Ranking Function for Querying Wikipedia – tf/idf(t,d): tf/idf value of the term for the specific page. – title(t, d) = 2 if t appears in title of d, 1 otherwise, – pr(d) the PageRank score for the page d. scoret(d) = (tf/idf(t, d)*title(t, d))1.5*pr(d) October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 137 Yahoo! dataset • Rankings were collected for a load of 24 queries : – Random queries from Wikipedia – Popular queries that appeared in Google Trends – Popular current queries (such as “euro 2008”) • 37 consecutive daily top-1000 ranked lists, from April 2008 to May 2008 computed with the Yahoo! Search Web Services October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 138 69 10/10/2010 Data Sets Preprocessing • Choice of the appropriate values for the regression-based models: nrank vs. racer • Distribution of the Mean Square Error (MSE): – Quantifies the divergence of an estimator from the true value being estimated – Large values depict less accurate predictions 2 – Defined as: MSE ( )  1 n  θ  θ j  n October 2010   j j 1   M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 139 MSE – Yahoo! Racer Values October 2010 NRank Values M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 140 70 10/10/2010 Data Sets Preprocessing – Bayesian Clustering • Tested clusters’ sizes from 2 to 10 for each query • Chose each time the number of clusters which maximized the overall quality of clustering, defined as: bc/wc: – bc (between cluster variation): Measured by the distance between cluster centers – wc (within cluster variation): Look at the sum of squares of distances from each point to the center of the cluster it belongs to October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 141 Experimental Evaluation • Usage of two baseline schemes: – Static: Just returns the top-k list of the previous snapshot – Random Markov State: Assumes that the racer value of a Web page is uniformly chosen among the available states of the learned “MMs” • All experiments were run using a 10-fold cross validation: 10% of the dataset as test and 90% as training • Results were evaluated using all similarity measures: OSim, KSim and RSim October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 142 71 10/10/2010 Internet Archive OSim KSim RSim M. Vazirgiannis “Web Mining - a predictive approach “BDA 2010 October 2010 143 Wikipedia (total queries) OSim KSim RSim October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 144 72 10/10/2010 Wikipedia – Wide queries OSim KSim RSim October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 145 Wikipedia – Dynamic/Positive queries OSim KSim RSim October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 146 73 10/10/2010 Wikipedia – Dynamic/Negative queries KSim OSim October 2010 RSim M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 147 Wikipedia – Least Dynamic queries OSim October 2010 RSim M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 KSim 148 74 10/10/2010 Yahoo! OSim KSim RSim October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 149 Conclusions • Web Mining is a hot research are with prominent industrial applications (DIGITEO CHAIR GRANT, France) – Predicting the ranking of web pages, social net users – Advertising ! • Promising directions – Graph mining and relevant metrics for Social networks and intranet information systems. – Monitoring and measuring temporal evolution of Web and Social graph dynamics – Dynamic personalization algorithms October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 150 75 10/10/2010 Thank you! For more information check: http://www.db-net.aueb.gr/michalis/ October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 151 Text Ranking - References R. Baeza-Yates and B. Ribeiro-Neto (1999). Modern Information Retrieval. Addison Wesley. C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, G. Hullender (2005). Learning to rank using gradient descent. In Proc. 22nd ICML, pp. 89-96. S. Büttcher, C. Clarke, B. Lushman. Term proximity scoring for ad-hoc retrieval on very large text collections. In Procs SIGIR 2006, pp 621–622. K. Crammer, Y. Singer (2001). Pranking with ranking. In Advances in Neural Information Processing Systems 14, Vol. 14, pp. 641-647. D. Hiemstra (1998). A linguistically motivated probabilistic model of information retrieval. In Proc. ECDL'98, pp. 569-584. J. Lafferty, C. Zhai (2001). Document language models, query models, and risk minimization for information retrieval. In Proc. SIGIR, pp. 111-119 C. Macdonald, V. Plachouras, B. He, C. Lioma, I. Ounis (2005). University of Glasgow at WebCLEF 2005: Experiments in per-field normalisation and language specific stemming. In Proceedings of the Cross Language Evaluation Forum (CLEF) 2005. D. Metzler, W.B. Croft (2005). A Markov Random Field for Term Dependencies. In Proc of ACM SIGIR’05. October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 152 76 10/10/2010 Text Ranking – References I D. Metzler (2008). Generalized Inverse Document Frequency. In Proceedings of the 17th CIKM’08, pp 399-408. K. Papineni (2001). Why inverse document frequency? In Proc 2nd North American Chapter of the Assn. for Computational Linguistics on Language Technologies, pp. 1–8. J. Peng, C. Macdonald, B. He, V. Plachouras, I. Ounis. Incorporating term dependency in the DFR framework. In Proc. SIGIR 2007, pp. 843–844. J.M. Ponte, W.B. Croft (1998). A language modeling approach to information retrieval. In Proc. SIGIR, pp. 275-281. S. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, M. Gatford (1994). Okapi at TREC-3. In Proc of the 3rd Text REtrieval Conference (TREC-1994). S. Robertson (2004). Understanding Inverse Document Frequency: On theoretical arguments for IDF. Journal of Documentation, 60(5):503-520. G. Salton and M.J. McGill (1983). Introduction to Modern Information Retrieval. McGraw-Hill. G. Salton, E.A. Fox, H. Wu (1983). Extended Boolean information retrieval. CACM 26(11):10221036. October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 153 Text Ranking – References II G. Tsatsaronis and V. Panagiotopoulou (2009). A Generalized Vector Space Model for Text Retrieval Based on Semantic Relatedness. In Proc EACL 2009 Student Research Workshop, pp 70–78. S.K.M. Wong, W. Ziarko, P.C.N. Wong (1985). Generalized vector space model in information retrieval. In Proc 8th ACM SIGIR, pp 18-25. Y. Yue, T. Finley, F. Radlinski, T. Joachims (2007). A support vector method for optimizing average precision. In Proc. 30th SIGIR, pp. 271-278. H. Zaragoza, N. Craswell, M.J. Taylor, S. Saria, S. Robertson (2004). Microsoft Cambridge at TREC 13: Web and Hard Tracks. In Proc of TREC 2004. October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 154 77 10/10/2010 Graph Clustering - References [1] http://en.wikipedia.org/wiki/Jaccard_index [2] ROCK: A Robust Clustering Algorithm for Categorical Attributes Sudipto Guha,Rajeev Rastogi,Kyuseok Shim [3] http://en.wikipedia.org/wiki/Gomory%E2%80%93Hu_tree [4] http://en.wikipedia.org/wiki/Minimum_cut [5] http://en.wikipedia.org/wiki/Cut_%28graph_theory%29 [6] http://en.wikipedia.org/wiki/Edmonds%E2%80%93Karp_algorithm [7] http://mathworld.wolfram.com/LaplacianMatrix.html [8] A Tutorial on Spectral Clustering Ulrike von Luxburg [9] Community structure in social and biological networks M. Girvan and M. E. J. Newman [10] Community detection in graphs Santo Fortunato [11] Modularity from Fluctuations in Random Graphs and Complex Networks Roger Guimer`a, Marta Sales-Pardo, and Lu´ıs A. Nunes Amara October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 155 Top-k similarity measures – References 1. 2. 3. 4. 5. 6. 7. Chen, Y.-Y., Gan, Q., & Suel, T. (2004). Local Methods for Estimating PageRank Values. Proceedings CIKM. Washington, USA. Deshpande, M., & Karypis, G. (2004). Selective Markov models for predicting Web page accesses. ACM Trans. Internet Technol. , 163-184. Sayyadi, H., & Getoor, L. (2009). Future Rank: Ranking Scientific Articles by Predicting their Future PageRank. 2009 SIAM International Conference on Data Mining (SDM09) (pp. 533-544). Sparks, Nevada: SIAM. Vazirgiannis, M., Drosos, D., Senellart, P., & Vlachou, A. (2008). Web Page Rank Prediction with Markov Models. WWW poster. Beijing, China. Zacharouli, P., Titsias, M., & Vazirgiannis, M. (2009). Web Page Rank Prediction with PCA and EM Clustering. Proceedings of the 6th International Workshop on Algorithms and Models for the Web-Graph (pp. 104-115). Barcelona, Spain: Springer-Verlag Berlin, Heidelberg. Jarvelin K. & Kekalainen J. (2002). Cumulated Gain-based Evaluation of IR Techniques. ACM Transactions on Information Systems. Vol. 20, No. 4 (pp. 422336). Jaime Teevan, Susan T. Dumais, Eric Horvitz. (2007). Characterizing the value of personalizing search. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, Amsterdam. M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 156 Web Mining 78 10/10/2010 References - Pagerank prediction & Web Dynamics 1. Baeza-Yates, R. A., Castillo, C., & Saint-Jean, F. (2004). Web Dynamics, Structure, and Page Quality. In M. Levene, & A. Poulovasillis (Eds.), Web Dynamics (pp. 93-112). Springer. 2. Broder, A. Z., Lempel, R., Maghoul, F., & Pedersen, J. (2006). Efficient PageRank approximation via graph aggregation. Information Retrieval , 9 (2), 123-138. 3. Chen, Y.-Y., Gan, Q., & Suel, T. (2004). Local Methods for Estimating PageRank Values. Proceedings CIKM. Washington, USA. 4. Chien, S., Dwork, C., Kumar, R., Simon, D. R., & Sivakumar, D. (2003). Link Evolution: Analysis and Algorithms. Internet Mathematics , 1 (3), 277-304. 5. Davis, J. V., & Dhillon, I. S. (2006). Estimating the global PageRank of Web communities. Proceedings KDD. Philadelphia, USA. 6. Deshpande, M., & Karypis, G. (2004). Selective Markov models for predicting Web page accesses. ACM Trans. Internet Technol. , 163-184. 7. Langville, A. N., & Meyer, C. D. (2004). Updating PageRank with iterative aggregation. Proceedings of the 13th international World Wide Web conference on Alternate track papers \& posters (pp. 392-393). New York, USA: ACM. 8. Piatetsky-Shapiro, G., & Connell, C. (1984). Accurate estimation of the number of tuples satisfying a condition. SIGMOD Rec. , 14 (2), 256-276. 9. Sayyadi, H., & Getoor, L. (2009). Future Rank: Ranking Scientific Articles by Predicting their Future PageRank. 2009 SIAM International Conference on Data Mining (SDM09) (pp. 533-544). Sparks, Nevada: SIAM. 10. Vazirgiannis, M., Drosos, D., Senellart, P., & Vlachou, A. (2008). Web Page Rank Prediction with Markov Models. WWW poster. Beijing, China. 11. Vlachou, A., Berberich, K., & Vazirgiannis, M. (2006). Representing and quantifying rank-change for the Web graph. Algorithms and Models for the Web-Graph, Fourth International Workshop, WAW 2006. 4936, pp. 157-165. Banff, Canada: Springer. 12. Zacharouli, P., Titsias, M., & Vazirgiannis, M. (2009). Web Page Rank Prediction with PCA and EM Clustering. Proceedings of the 6th International Workshop on Algorithms and Models for the Web-Graph (pp. 104-115). Barcelona, Spain: SpringerVerlag Berlin, Heidelberg. October 2010 M. Vazirgiannis “Web Mining - a predictive approach “- BDA 2010 157 79

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Web Mining – a predictive approach Tutorial Outline