Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
sdbi – winter 2001 Data Mining For Hypertext: A Tutorial Survey Based on a paper by: Soumen Chakrabarti Indian Institute Of technology Bombay. [email protected] Lecture by: Noga Kashti Efrat Daum 11/11/01 Lets start with definitions … Hypertext - a collection of documents (or "nodes") containing cross-references or "links" which, with the aid of an interactive browser program, allow the reader to move easily from one document to another. Data Mining - Analysis of data in a database using tools which look for trends or anomalies without knowledge of the meaning of the data. 11/11/01 sdbi - winter 2001 2 Two Ways For Getting Information From The Web : Clicking On Hyperlinks Searching Via Keyword Queries 11/11/01 sdbi - winter 2001 3 Some History … Before the popular Web, Hypertext has been used by ACM, SIGIR, SIGLINK/SIGWEB and DIGITAL LIBRARIES. The old IR (Information retrieval) deals with documents whereas the Web deals with semi-structured data. 11/11/01 sdbi - winter 2001 4 Some Numbers .. The Web exceeds 800 million HTML pages on about three million servers. Almost a million pages are added daily. 11/11/01 sdbi - winter 2001 A typical page changes in a few months. Several hundred gigabytes change every month. 5 Difficulties With Accessing Information On The Web: Usual problems of text search (synonymy, polysemy, text sensitivity) become much more severe. Semi-structured data. Sheer size and flux. No consistent standard or style. 11/11/01 sdbi - winter 2001 6 The Old Search Process Is Often Unsatisfactory! Deficiency of scale. Poor accuracy (low recall and low precision). 11/11/01 sdbi - winter 2001 7 Better Solutions: Data Mining And Machine Learning NL Techniques. Statistical Techniques for learning structure in various forms from text hypertext and semi-structured data. 11/11/01 sdbi - winter 2001 8 Issues We’ll Discuss Models Supervised learning Unsupervised learning Semi-supervised learning Social network analysis 11/11/01 sdbi - winter 2001 9 Models For Text Representation for text with statistical analyses only (bag-of-words): 11/11/01 The vector space model The binary model The multi-nominal model sdbi - winter 2001 10 Models For Text (cont.) The vector space model: Documents -> tokens->canonical forms. Canonical token is an axis in a Euclidean space. The t-th coordinate of d is n(d,t) 11/11/01 t is a term d is a document sdbi - winter 2001 11 The Vector Space Model: Normalize The Document Length To 1 d 1 nd , t t d2 nd , t 2 t d max nd , t t Nt IDF(t) 1 log N n d, t TFIDF IDF(t) d1 11/11/01 sdbi - winter 2001 12 More Models For Text The Binary Model : A document is a set of terms, which is a subset of the lexicon. Word counts are not significant. The multinomial model : a die with |T| faces. Every face has a probability θt of showing up when tossed. Deciding of total word count, the author tosses the die while writing the term that shows up. 11/11/01 sdbi - winter 2001 13 Models For Hypertext Hypertext: text with hyperlinks. Varying levels of detail. Example: Directed Graph(D,L) 11/11/01 D – The set of nodes/documents/pages L – The set of links sdbi - winter 2001 14 Models For Semi-structured Data A point of convergence for the web(documents) and database(data) communities 11/11/01 sdbi - winter 2001 15 Models For Semi-structured Data(cont.) like Topic Directories with treestructured hierarchies. Examples: Open Directory Project , Yahoo! Another representation: XML. 11/11/01 sdbi - winter 2001 16 Supervised Learning (classification) Algorithm Initialization: training data, each item is marked with a label or class from a discrete finite set. Input: unlabeled data. Algorithm roll: guess the data labels. 11/11/01 sdbi - winter 2001 17 Supervised Learning (cont.) Example: topic directories Advantages: help structure, restrict keyword search, can enable powerful searches. 11/11/01 sdbi - winter 2001 18 Probabilistic Models For Text Learning Let c1,…,cm be m classes or topics with some training documents Dc. Prior probability of a class: Dc D c c T : the universe of terms in all the training documents. 11/11/01 sdbi - winter 2001 19 Probabilistic Models For Text Learning (cont.) Naive Bayes classification: 11/11/01 Assumption: for each class c, there is binary text generator model. Model parameters: Φc,t – the probability that a document in class c will mention term t at lease once. sdbi - winter 2001 20 Naive Bayes classification (cont.) Pr( d | c) c ,t td (1 c ,t ) tT ,td Problems: 11/11/01 short documents are discouraged. Pr (d|c) estimation is likely to be greatly distorted. sdbi - winter 2001 21 Naive Bayes classification (cont.) With the multinomial model: d1 n ( d ,t ) c ,t Pr(d | c) n(d , t ) td 11/11/01 sdbi - winter 2001 22 Naive Bayes classification (cont.) Problems: Again, short documents are discouraged. Inter-term correlation ignored. Multiplicative Φc,t ‘surprise’ factor. Conclusion: 11/11/01 Both model are effective. sdbi - winter 2001 23 More Probabilistic Models For Text Learning Parameter smoothing and feature selection. Limited dependence modeling. The maximum entropy technique. Support vector machines (SVMs). Hierarchies over class labels. 11/11/01 sdbi - winter 2001 24 Learning Relations Classification extension : a combination of statistical and relational learning. Improve accuracy. The ability to invent predicates. Can represent hyperlink graph structure and word statistics of neighbor documents. Learned rules will not be dependent on specific keywords. 11/11/01 sdbi - winter 2001 25 Unsupervised learning hypertext documents a hierarchy among the documents What is a good clustering? 11/11/01 sdbi - winter 2001 26 Basic clustering techniques Techniques for Clustering: 11/11/01 kmeans hierarchical agglomerative clustering sdbi - winter 2001 27 Basic clustering techniques documents unweighted vector space TFIDF vector space similarity between two documents 11/11/01 cos(), = the angle between their corresponding vectors the distance between the vectors lengths (normalized) sdbi - winter 2001 28 kmeans clustering the kmeans algorithm: input: output: 11/11/01 d1,…,dn - set of n documents k - the number of clusters desired (kn) C1,…,Ck – k clusters with the n classifier documents sdbi - winter 2001 29 kmeans clustering the kmeans algorithm (cont.): initial: guess k initial means: m1,…mk Until there are no changes in any means: 11/11/01 For each document d - d is in ci if ||d-mi|| is the minimum of all the k distances. For 1ik - replace mi with the means of all the documents for ci. sdbi - winter 2001 30 kmeans clustering the kmeans algorithm – Example: K=2 m2 m2 m2 m2 m1 K=3 m1 m m2 2 m3 m1 m1 m1 m2 m3 m3 m1 m1 11/11/01 sdbi - winter 2001 31 kmeans clustering (cont.) Problem: high dimensionality e.g.: if 30000 dimensions has only two possible values, the vector space size is 230000 Solution: 11/11/01 Projecting out some dimensions sdbi - winter 2001 32 Agglomerative clustering documents are merged into superdocuments or groups until only one group is left Some definitions: s(d1 , d 2 ) = the similarity between documents d1 and d2 the self-similarity of group A: 1 s( A) s(d1 , d 2 ) A A 1 d1 ,d 2A d1 d 2 11/11/01 sdbi - winter 2001 33 Agglomerative clustering The agglomerative clustering algorithm: input: output: 11/11/01 d1,…,dn - set of n documents G – the final group with a nested hierarchy sdbi - winter 2001 34 Agglomerative clustering (cont.) The agglomerative clustering algorithm: Initial: G := {G1,…,Gn}, where Gi={di} while |G|>1: Find A and B in G such as s(A B) is maximized G := (G – {A,B}) {A B} Times: O(n2) 11/11/01 sdbi - winter 2001 35 Agglomerative clustering (cont.) The agglomerative clustering algorithm Example: - 0.1 Step Initial: 1: 2: 6: 5: 4: 3: s(b,c)=0.7 s(f,g)=0.6 s(b-c,d)=0.5 a s(e,f-g)=0.4 s(a,b-c-d)=0.3 s(a-b-c-d,e-f-g)=0.1 b - 0.2 - 0.3 e d - 0.4 a e c f sdbi - winter 2001 - 0.5 d g b 11/11/01 -0 c - 0.6 f g 36 - 0.7 - 0.8 Techniques from linear algebra Documents and terms are represented by vectors in Euclidean space. Applications of linear algebra to text analysis: 11/11/01 Latent semantic indexing (LSI) Random projections sdbi - winter 2001 37 Co-occurring terms Exemple: car Linear potentiometer for a racing car gearbox… gearbox 11/11/01 auto Auto Transmission interchange W/404 to 504?? … transmission sdbi - winter 2001 38 Latent semantic indexing (LSI) Vector Space model of documents: 11/11/01 Let m=|T|, the lexicon size Let n=the number of documents Define Amxn = term-bydocument matrix where: aij = the number of occurrences of term i in document j. sdbi - winter 2001 39 Latent semantic indexing (LSI) terms documents d1 d 2 ... t1 t2 car tm dn ... similarity auto How to reduce it? 11/11/01 sdbi - winter 2001 40 Singular Value Decomposition (SVD) Let ARmxn, m n be a matrix. The singular value decomposition of A is the factorization A=UDVT, where: U and V are orthogonals, UTU=VTV=In D=diag(1,… n) with i0, 1in then, U=[u1,…un], u1,…un are the left singular vectors V=[v1,…vn], v1,…vn are the right singular vectors 1,… n are the singular values of A. 11/11/01 sdbi - winter 2001 41 Singular Value Decomposition (SVD) AAT=(UDVT)(VDTUT)=UDIDUT=UD2UT AATU=UD2=[12u1,…,n2un] for 1in, AATui=i2ui the columns of U are the eigenvectors of AAT. Similary, ATA=VD2VT the columns of V are the eigenvectors of ATA. The eigenvalues of AAT (or ATA) are 12,…,n2 11/11/01 sdbi - winter 2001 42 Singular Value Decomposition (SVD) A UDV T 1 0 u1 , u 2 ,..., u n 0 0 ... 2 ... ... 0 0 v1T 0 v2T n vn T 1v1T 2 v2 T T T T u1 , u 2 ,..., u n 1u1v1 2u 2 v2 ... n u n vn T v n n Let Ak u v 2u2v2 ... k uk vk , k n be the k-truncated SVD. T 1 1 1 11/11/01 T T rank(Ak)=k ||A-AK||2 ||A-MK||2 for any matrix Mk of rank k. sdbi - winter 2001 43 Singular Value Decomposition (SVD) Note: A, Ak Rmxn A U D V mxn mxn nxn T nxn reduction A U D k mxn 11/11/01 mxk k kxk sdbi - winter 2001 k T Vk kxn 44 LSI with SVD Define qRm – a query vector. Then, ATq Rn, the answer vector. qi0 if term i is a part of the query. (ATq)j0 if document j contains one or more terms in the query. How to do it better? 11/11/01 sdbi - winter 2001 45 LSI with SVD Use Ak instead of A: calculate AkTq Now, query on ‘car’ will return a document containing the word ‘auto’. 11/11/01 sdbi - winter 2001 46 Random projections Theorem: let: v Rn - a unit vector H - a randomly oriented -dimensional subspace through the origin X - random variable of the square of the length of the projection of v on H l E X then: n and if is chosen between log n and O n l l l l Pr X 2 l e n n 11/11/01 l 1 2 4 sdbi - winter 2001 where 0 1 2 47 Random projections A projection of a set of points to a randomly oriented subspace. Small distortion in inter-points distances The technique: 11/11/01 reducing the dimensionality of the points speed up the distances computation sdbi - winter 2001 48 Semi-supervised learning Real-life applications: labeled documents unlabeled documents Between supervised and unsupervised learning 11/11/01 sdbi - winter 2001 49 Learning from labeled and unlabeled documents Expectation Maximization (EM) Algorithm: Initial: train a naive Bayes classifier using only labeled data. Repeat EM iteration until near convergence : Estep: 1 nd , t Pr c d c ,t d T Pr c d nd , d Mstep: assign class probabilities Pr(c/d) to all documents not labeled by the c,t estimates. error is reduced by a third in the best cases. 11/11/01 sdbi - winter 2001 50 Relaxation labeling The hypertext model: documents are nodes in a hypertext graph. There are other sources of information induced by the links. ? 11/11/01 ? sdbi - winter 2001 51 Relaxation labeling c=class, t=term, N=neighbors In supervised learning: Pr(t|c) In hypertext, using neighbors’ terms: Pr( t(d),t(N(d)) |c) Better model, using neighbors’ classes : Pr( t(d),c(N(d)) |c] Circularity 11/11/01 sdbi - winter 2001 52 Relaxation labeling Resolve the circularity: Initial: Pr(0)(c|d) to each document dN(d1) where d1 is a test document (use text-only) Iterations: Pr (c | d , N (d )) ( r 1) Pr (r ) c( N (d )) Pr( r ) (c | d , c( N (d ))) c ( N ( d )) 11/11/01 sdbi - winter 2001 53 Social network analysis Social networks: between between between between pages. academics by coauthoring, advising. movie personnel by directing and acting. people by making phone calls web pages by hyperlinking to other web Applications 11/11/01 Google HITS sdbi - winter 2001 54 p PageRank (u ) PageRank (v) (1 p) N u v OutDegree (u ) where: means “link to” N = total number of nodes in the Web graph simulated a random walk on the web graph used a score of popularity the popularity score is precomputed independent of the query 11/11/01 sdbi - winter 2001 55 Hyperlink induced topic search (HITS) Depended on a search engine For each node u in the graph calculated Authorities scores (au) and Hubs scores (hu): Initialize hu=au=1 Repeat until convergence: av : hu and u v h u u 11/11/01 and hu : av u v a are normalized to 1 v v sdbi - winter 2001 56 Interesting page include links to others interesting pages. The goal: 11/11/01 many relevant pages few irrelevant pages fast sdbi - winter 2001 57 Conclusion Supervised learning Probabilistic models Unsupervised learning Techniques for clustering: Techniques for reducing: k-means (top-down) agglomerative (bottom-up) LSI with SVD Random projections Semi-supervised learning 11/11/01 The EM algorithm Relaxation labeling sdbi - winter 2001 58 referance http://www.engr.sjsu.edu/~knapp/HCIRDFSC/C/k_m eans.htm http://ei.cs.vt.edu/~cs5604/cs5604cnCL/CL-illus.html http://www.cs.utexas.edu/users/inderjit/Datamining Scatter/Gather: A Clusterbased Approach to Browsing Large Document Collections (Cutting, Karger, Pedersen, Tukey) 11/11/01 sdbi - winter 2001 59