* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download nips2000a - Department of Computer Science and Engineering
Numerical weather prediction wikipedia , lookup
Predictive analytics wikipedia , lookup
Pattern recognition wikipedia , lookup
History of numerical weather prediction wikipedia , lookup
Computer simulation wikipedia , lookup
General circulation model wikipedia , lookup
K-nearest neighbors algorithm wikipedia , lookup
Machine learning wikipedia , lookup
Generalized linear model wikipedia , lookup
Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering IIT Bombay www.cse.iitb.ernet.in/~soumen Traditional supervised learning Training instance x1 , x2 ,, xn ; y Learner Test instance x1 , x2 ,, xn Independent variables x mostly continuous, maybe categorical Predicted variable y discrete (classification) or continuous (regression) Statistical models, inference rules, or separators Learner Prediction y 2 Traditional unsupervised learning No training / testing phases Input is a collection of records with independent attributes alone x1 , x2 ,, xn Measure of similarity Partition or cover instances using clusters with large “self-similarity” and small “cross-similarity” Hierarchical partitions Small crosssimilarity Large selfsimilarity 3 Learning hypertext models Entities are pages, sites, paragraphs, links, people, bookmarks, clickstreams… occurs(term, page, cnt) cites(page, page) Transformed into simple models and relations • • • • Vector space/bag-of-words Hyperlink graph Topic directories is-a(topic, topic) Discrete time series example(topic, page) 4 Challenges Large feature space in raw data • Structured data sets: 10s to 100s • Text (Web): 50 to 100 thousand Most features not completely useless • Feature elimination / selection not perfect • Beyond linear transformations? Models used today are simplistic • Good accuracy on simple labeling tasks • Lose a lot of detail present in hypertext to fit known learning techniques 5 Challenges Complex, interrelated objects • Not a structured tuple-like entity • Explicit and implicit connections • Document markup sub-structure • Site boundaries and hyperlinks • Placement in popular directories like Yahoo! Traditional distance measures are noisy • How to combine diverse features? (Or, a link is worth a ? words) • Unreliable clustering results 6 This session Semi-supervised clustering (Rich Caruana) • Enhanced clustering via user feedback Kernel methods (Nello Cristianini) • Modular learning systems for text and hypertext Reference matching(Andrew McCallum) • Recovering and cleaning implicit citation graphs from unstructured data 7 This talk: Two examples Learning topics of hypertext documents • Semi-supervised learning scenario • Unified model of text and hyperlinks • Enhanced accuracy of topic labeling Segmenting hierarchical tagged pages • • • • Topic distillation (hubs and authorities) Minimum description length segmentation Better focused topic distillation Extract relevant fragments from pages 8 Classifying interconnected entities Early examples: • Some diseases have complex lineage dependency • Robust edge detection in images ? How are topics interconnected in hypertext? Maximum likelihood graph labeling with many classes Finding edge pixels in a differentiated image ? .3 red ? .7 blue ? ? ? 0.6 0.4 0.3 0.7 9 Naïve Bayes classifiers Decide topic; topic c is picked with prior probability (c); c(c) = 1 Each c has parameters (c,t) for terms t Coin with face probabilities t (c,t) = 1 Fix document length n(d) and toss coin Naïve yet effective; can use other algos Given c, probability of document is n( d ) (c, t ) n ( d ,t ) Pr[ d | n(d ), c] {n(d , t )} td 10 Enhanced models for hypertext c=class, d=text, N=neighbors Text-only model: Pr(d|c) Using neighbors’ text to judge my topic: Pr(d, d(N) | c) Better recursive model: Pr(d, c(N) | c) Relaxation labeling over Markov random fields Or, EM formulation ? 11 Hyperlink modeling boosts accuracy 40 35 30 %Error 9600 patents from 12 classes marked by USPTO Patents have text and prior art links Expand test patent to include neighborhood ‘Forget’ and reestimate fraction of neighbors’ classes (Even better for Yahoo) 25 20 15 10 5 0 0 50 100 %Neighborhood known Text Link Text+Link 12 Hyperlink Induced Topic Search Query h h Keyword Search engine a h a a = ETh h = Ea ‘Hubs’ and ‘authorities’ h a Response a Radius-1 expanded graph 13 “Topic drift” and various fixes Some hubs have ‘Thick’ links ‘mixed’ content Authority ‘leaks’ through mixed hubs Activation from good to bad window Query term pages Clever: match query Vector-space Centroid with anchor text to document Cut-off model favor some edges radius B&H: eliminate outlier documents × 14 Document object model (DOM) Hierarchical graph <html><head> <title>Portals</title> model for semi</head><body><ul> structured data <li><a href=“…”>Yahoo</a></li> <li><a href=“…”>Lycos</a></li> Can extract </ul></body></html> reasonable DOM from HTML html A fine-grained view head body of the Web title ul Valuable because li li page boundaries are a a less meaningful now 15 A model for hub generation Global hub score distribution 0 w.r.t. given query Authors use DOM nodes to specialize 0 into local I At a certain ‘cut’ in the DOM tree, local distribution directly generates hub scores Global distribution Model frontier Progressive ‘distortion’ Other pages 16 Optimizing a cost measure Reference distribution 0 Distribution distortion cost is v 0 KL(v || 0 ) log v v 1 v (for Poisson distribution) Hv Data encoding cost is roughly log Pr(h | ) hH v v 17 Modified topic distillation algorithm Initialize DOM graph Let only root set authority scores be 1 Repeat until reasonable convergence: Authority-to-hub score propagation MDL-based hub score smoothing Hub-to-authority score propagation Normalization of authority scores Segment and rank micro-hubs Present annotated results Will this (non-linear) system converge? Will segmentation help in reducing drift? 18 Convergence 1.00E-01 Mean auth score change 1.00E-02 1.00E-03 1.00E-04 1.00E-05 1.00E-06 1.00E-07 0 2 4 6 Iterations 8 10 28 queries used in Clever and by B&H 366k macro-pages, 10M micro-links Rank converges within 15 iterations 19 ‘Expanded’ implies authority diffusion arrested As nodes outside rootset start participating in the distillation… • #Expanded increases • #Pruned decreases Prevents authority leaks via mixed hubs Smoothing statistics Effect of micro-hub segmentation 2500 2000 1500 1000 500 0 Expanded 1 Pruned 2 3 4 5 6 7 8 9 Iterations 20 Rank correlation with B&H 0.025 0.02 Our authority score Positively correlated Some negative deviations Pseudoauthorities downgraded by our algorithm These were earlier favored by mixed hubs 0.015 0.01 0.005 0 0 0.005 0.01 Authority score B&H 0.015 (Axes not to same scale) 21 Conclusion Hypertext and the Web pose new modeling and algorithmic challenges Locality exists in many guises Diverse sources of information: text, links, markup, usage Unifying models needed Anecdotes suggest that synergy can be exploited 22