Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Machine Learning Methods for Text / Web Data Mining Byoung-Tak Zhang School of Computer Science and Engineering Seoul National University E-mail: [email protected] This material is available at http://scai.snu.ac.kr./~btzhang/ Overview z Introduction 4Web Information Retrieval 4Machine Learning (ML) 4ML Methods for Text/Web Data Mining z Text/Web Data Analysis 4Text Mining Using Helmholtz Machines 4Web Mining Using Bayesian Networks z Summary 4Current and Future Work 2 1 Web Information Retrieval Classification System Preprocessing and Indexing Text Data Text Classification Information Filtering Information Extraction DB Template Filling & Information Extraction System Information Filtering System DB Record Location user profile filtered data Date question answer feedback DB 3 Machine Learning z Supervised Learning 4Estimate an unknown mapping from known inputoutput pairs 4Learn fw from training set D={(x,y)} s.t. f w (x) = y = f (x) 4Classification: y is discrete 4Regression: y is continuous z Unsupervised Learning 4Only input values are provided 4Learn fw from D={(x)} s.t. f w (x) = x 4Density Estimation 4Compression, Clustering 4 2 Machine Learning Methods z Neural Networks 4 Multilayer Perceptrons (MLPs) 4 Self-Organizing Maps (SOMs) 4 Support Vector Machines (SVMs) z Probabilistic Models 4 Bayesian Networks (BNs) 4 Helmholtz Machines (HMs) 4 Latent Variable Models (LVMs) z Other Machine Learning Methods 4 Evolutionary Algorithms (EAs) 4 Reinforcement Learning (RL) 4 Boosting Algorithms 4 Decision Trees (DTs) 5 ML for Text/Web Data Mining z z z z z z z Bayesian Networks for Text Classification Helmholtz Machines for Text Clustering/Categorization Latent Variable Models for Topic Word Extraction Boosted Learning for TREC Filtering Task Evolutionary Learning for Web Document Retrieval Reinforcement Learning for Web Filtering Agents Bayesian Networks for Web Customer Data Mining 6 3 Preprocessing for Text Learning From: [email protected] Newsgroups:comp.graphics Subject: Need specs on Apple QT I need to the specs, or at least a very verbose interpretation of the specs, for QuckTime. Technical articles from magazines and references to books would be nice, too. I also need the specs in a format usable on a Unix or MS-Dos system. I can’t do much with the QuickTime stuff they have on.. 0 baseball 0 car 0 clinton 0 computer 0 graphics 0 hockey 2 quicktime . . 1 references 0 space 3 specs 1 unix . 7 Text Mining: Data Sets z Usenet Newsgroup Data 420 categories 41000 documents for each category 420000 documents in total. z TDT2 Corpus 4Target detection and tracking (TDT): NIST 4Used 6,169 documents in experiments 8 4 Text Mining: Helmholtz Machine Architecture h1 h2 … hm : recognition weight : generative weight P (hi = 1) = d1 d2 d3 dn … P (d i = 1) = 1 n 1 + exp − bi − ∑ wij d j j =1 1 m 1 + exp − bi − ∑ wij h j j = 1 4Latent nodes 4 [Chang and Zhang, 2000] 4 Input nodes • Binary values • Extract the underlying causal structure in the document set. • Capture correlations of the words in documents. • Binary values • Represent the existence or absence of words in documents. 9 Text Mining: Learning Helmholtz Machines 4Introduce a recognition network for estimation of a generative network. ( ) ( ( ) ) T T P d (t ) ,α (t ) | θ log( D | θ ) = ∑ log ∑ P d ( t ) , α (t ) | θ = ∑ log ∑ Q α ( t ) Q α (t ) t =1 α ( t ) t =1 α (t ) (t ) (t ) T , | P d α θ ≥ ∑∑ Q α (t ) log Q α (t ) t =1 α ( t ) ( ( ) ) ( ( ) ) 4Wake-Sleep Algorithm • Train the recognition and generative models alternately. • Update the weight in network iteratively by simple local delta rule. wijnew = wijold + ∆wij ∆wij = γ si ( s j − p( s j = 1)) 10 5 Text Mining: Methods z Text Categorization 4 Train a Helmholtz machine for each category. 4 Total N machines for N categories. 4 Once the N machines have been estimated, classification of a test document proceeds by estimating the likelihood of the document for each machine. cˆ = arg max[log P(d | c)] c∈C z Topic Words Extraction 4 For the entire document sets, train a Helmholtz machine. 4 After training, examine the weights of connections from a latent node to input nodes, that is words. 11 Text Mining: Categorization Results 4Usenet Newsgroup Data • 20 categories, 1000 documents for each category, 20000 documents in total. 12 6 Text Mining: Topic Words Extraction Results 4TDT2 Corpus 46,169 documents 1 tabacco, smoking, gingrich, newt, trent, republicans, congressional, republicans, attorney, smokers, lawsuit, senate, cigarette, morris, nicotine 2 warplane, airline, saudi, gulf, wright, soldiers, yitzhak, tanks, stealth, sabah, stations, kurds, mordechai, separatist, governor 3 olympics, nagano, olympic, winter, medal, hockey, atheletes, cup, games, slalom, medals, bronze, skating, lillehammer, downhill 4 netanyahu, palestinian, arafat, israeli, yasser, kofi, annan, benjamin, palestinians, mideast, gaza, jerusalem, eu, paris, israel 5 India, pakistan, pakistani, delhi, hindu, vajpayee, nuclear, tests, atal, kashmir, indian, janata, bharatiya, islamabad, bihari 6 Suharto, habibie, demonstrators, riots, indonesians, demonstrations, soeharto, resignation, jakarta, rioting, electoral, rallies, wiranto, unrest, megawati 7 imf, monetary, currencies, currency, rupiah, singapore, bailout, traders, markets, thailand, inflation, investors, fund, banks, baht 8 pope, cuba, cuban, embargo, castro, lifting, cubans, havana, alan, invasion, reserve, paul, output, vatican, freedom 13 Web Mining: Customer Analysis z KDD-2000 Web Mining Competition 4Data: 465 features over 1700 customers • Features include friend promotion rate, date visited, weight of items, price of house, discount rate, … • Data was collected during Jan. 30 – March 30, 2000 • Friend promotion was started from Feb. 29 with TV advertisement. 4Aims: Description of heavy/low spenders 14 7 15 16 8 Web Mining: Feature Selection z Features selected by various ways [Yang & Zhang, 2000] DecisionTree+Factor Analysis Decision Tree Discriminant Model V368 (Weight Average) V243 (OrderLine Quantity Sum) V245 (OrderLine Quantity Maximum) F1 = 0.94*V324 + 0.868*V374 + 0.898*V412 F2 = 0.829*V234 + 0.857*V240 F3 = -0.795*V237+ 0.778*V304 V13 (SendEmail) V234 (OrderItemQuantity Sum% HavingDiscountRange(5 . 10)) V237 (OrderItemQuantitySum% Having DiscountRange(10.)) V240 (Friend) V243 (OrderLineQuantitySum) V245 (OrderLineQuantity Maximum) V304 (OrderShippingAmtMin) V324 (NumLegwearProduct Views) V368 (Weight Average) V374 (NumMainTemplateViews) V412 (NumReplenishable Stock Views) V240 (Friend) V229 (Order-Average) V304 (OrderShippingAmtMin.) V368 (Weight Average) V43 (Home Market Value) V377 (NumAcountTemplate Views) + V11 (Which DoYouWearMostFrequent) V13 (SendEmail) V17 (USState) V45 (VehicleLifeStyle) V68 (RetailActivity) V19 (Date) 17 Web Mining: Bayesian Nets z Bayesian network 4 DAG (Directed Acyclic Graph) 4 Express dependence relations between variables 4 Can use prior knowledge on the data (parameters) A B C P(A,B,C,D,E) = P(A)P(B|A)P(C|B) P(D|A,B)P(E|B,C,D) D E 4 Examples of conjugate priors: Dirichlet for multinomial data, Normal-Wishart for normal data 18 9 Web Mining: Results z A Bayesian net for KDD web data z V229 (Order-Average) and V240 (Friend) directly influence V312 (Target) z V19 (Date) was influenced by V240 (Friend) reflecting the TV advertisement. 19 Summary z We study machine learning methods, such as 4 Probabilistic neural networks 4 Evolutionary algorithms 4 Reinforcement learning z Application areas include 4 Text mining 4 Web mining 4 Bioinformatics (not addressed in this talk) z Recent work focuses on probabilistic graphical models for web/text/bio data mining, including 4 Bayesian networks 4 Helmholtz machines 4 Latent variable models 20 10 21 Bayesian Networks: Architecture L B G M P ( L, B, G , M ) = P ( L) P ( B | L) P (G | L, B ) P ( M | L, B, G ) = P ( L) P ( B ) P (G | B ) P ( M | B, L) z A Bayesian network represents the probabilistic relationships between the variables. n P ( X) = ∏ P ( X i | pa i ) i =1 pai is the set of parent nodes of Xi. 22 11 Bayesian Networks: Applications in IR – A Simple BN for Text Classification C • C: document class • ti: ith term t1 z z z t2 t8754 The network structure represents the naïve Bayes assumption. All nodes are binary. [Hwang & Zhang, 2000] 23 Bayesian Networks: Experimental Results z Dataset 4The acq dataset from Reuters-21578 48754 terms were selected by TFIDF. 4Training data: 8762 documents 4Test data: 3009 documents z Parametric Learning 4Dirichlet prior assumptions for the network parameter distributions. p (θ ij | S h ) = Dir (θ ij | α ij1 ,..., α ijri ) 4Parameter distributions are updated with training data. p(θ ij | D, S h ) = Dir(θ ij | αij1 + Nij1 ,...,αijri + Nijri ) 24 12 Bayesian Networks: Experimental Results For training data z 4Accuracy: 94.28% Recall (%) Precision (%) Positive examples 96.83 75.98 Negative examples 93.76 99.32 Recall (%) Precision (%) Positive examples 95.16 89.17 Negative examples 96.88 98.67 For test data z 4Accuracy: 96.51% 25 Latent Variable Models: Architecture z z [Shin & Zhang, 2000] Maximize log-likelihood N M L = ∑∑ n(d n , wm ) log P(d n , wm ) n =1 m =1 N M K = ∑∑ n(d n , wm ) log ∑ P( zk ) P( wm | zk ) P(d n | zk ) n =1 m =1 Document Clustering k =1 4Update P( zk ) , P(wm | zk ) , P(d n | zk ) . 4With EM Algorithm Topic-Words Extraction Latent Variable Model for Topic Words Extraction and Document Clustering 26 13 Latent Variable Models: Learning z EM (Expectation-Maximization) Algorithm 4Algorithm to maximize pre-defined log-likelihood z Iteration of E-Step and M-Step 4E-Step M-Step N P( zk | d n , wm ) = P( zk ) P(d n | zk ) P( wm | zk ) K ∑ P( z ) P( d k =1 k n P( wm | z k ) = | zk ) P( wm | zk ) ∑ n( d n =1 M N , wm ) P( z k | d n , wm ) n ∑∑ n(d m =1 n =1 M P(d n | z k ) = ∑ n( d m =1 M N n M , wm ) P( z k | d n , wm ) , wm ) P( z k | d n , wm ) ∑∑ n(d m =1 n =1 P( z k ) = n n , wm ) P( z k | d n , wm ) 1 M N ∑∑ n(d n , wm ) P( z k | d n , wm ), R m=1 n =1 N R ≡ ∑∑ n(d n , wm ) 27 m =1 n =1 Latent Variable Models: Applications in IR – Experimental Results z z Topic Words Extraction and Document Clustering with a subset of TREC-8 data TREC-8 adhoc task data 4 Documents: DTDS, FR94, FT, FBIS, LATIMES 4 Topics: 401-450 (401, 434, 439, 450) 4 401: Foreign Minorities, Germany 4 434: Estonia, Economy 4 439: Inventions, Scientific discovery 4 450: King Hussein, Peace 28 14 Latent Variable Models: Applications in IR – Experimental Results Label (assigned to zk with Maximum P(di|zk) ) Topic (#Docs) z2 z4 z3 z1 Precision Recall 401 (300) 279 1 0 20 0.902 0.930 434 (347) 20 238 10 79 0.996 0.686 439 (219) 7 0 203 9 0.953 0.927 450 (293) 3 0 0 290 0.729 0.990 Topics Extracted Topic Words (top 35 words with highest P(wj|zk) Cluster 2 (z2 ) Cluster 4 (z4) Cluster 3 (z3) Cluster 1 (z1) german, germani, mr, parti, year, foreign, people, countri, govern, asylum, polit, nation, law, minist, europ, state, immigr, democrat, wing, social, turkish, west, east, member, attack, … percent, estonia, bank, state, privat, russian, year, enterprise, trade, million, trade, estonian, econom, countri, govern, compani, foreign, baltic, polish, loan, invest, fund, product, … research, technology, develop, mar, materi, system, nuclear, environment, electr, process, product, power, energi, countrol, japan, pollution, structur, chemic, plant, … jordan, peac, isreal, palestinian, king, isra, arab, meet, talk, husayn, agreem, presid, majesti, negoti, minist, visit, region, arafat, secur, peopl, east, washington, econom, sign, relat, jerusalem, rabin, syria, iraq, … 29 Boosting: Algorithms z z A general method of converting rough rules into a highly accurate prediction rule Learning procedure 4 Examine the training set 4 Derive a rough rule (weak learner) 4 Re-weight the examples in the training set, concentrating on the hard cases for previous rules 4 Repeat T times Importance weights of training documents Learner Learner Learner Learner h1 h2 h3 h4 f ( h1 , h2 , h3 , h4 ) 30 15 Boosting: Applied to Text Filtering z Naïve Bayes 4 Traditional algorithm for text filtering c NM = arg max c j ∈{ relevant , irrelevant } P (c j ) P ( d i | c j ) n = arg max P (c j )∏ P ( wik | c j ) cj Assume independence among terms k =1 = arg max P (c j ) P ( wi1 =" our"| c j ) P ( wi 2 =" approach"| c j ) ... cj P ( win =" trouble"| c j ) z Boosting naïve Bayes 4 Using naïve Bayes classifiers as weak learners 4 [Kim & Zhang, SIGIR-2000] 31 Boosting: Applied to Text Filtering – Experimental Results z TREC (Text Retrieval Conference) 4 Sponsored by NIST z TREC-7 filtering datasets 4 Training Documents • AP articles (1988) • 237 MB, 79919 documents 4 Test Documents • AP articles (1989~1990) • 471 MB, 162999 documents z TREC-8 filtering datasets 4 Training Documents • Financial Times (1991~1992) • 167 MB, 64139 documents 4 Test Documents • Financial Time (1993~1994) • 382 MB, 140651 documents 4 No. of topics: 50 4 No. of topics: 50 Example of a document 32 16 Boosting: Applied to Text Filtering – Experimental Results Compared with the state-of-the-art text filtering systems TREC-7 Averaged Scaled F1 Averaged Scaled F3 Boosting ATT NTT PIRC Boosting ATT NTT PIRC 0.474 0.461 0.452 0.500 0.467 0.460 0.505 0.509 TREC-8 Averaged Scaled LF1 Averaged Scaled LF2 Boosting PLT1 PLT2 PIRC Boosting CL PIRC Mer 0.717 0.712 0.713 0.714 0.722 0.721 0.734 0.720 33 Evolutionary Learning: Applications in IR - Web-Document Retrieval z [Kim & Zhang, 2000] Link Information, HTML Tag Information Retrieval <TITLE> <H> <B> … <A> ww11 ww22 ww33 … wn w1 w2 w3 … … wwnn chromosomes ww11 ww22 ww33 … wn w1 w2 w3 … … wwnn Fitness 34 17 Evolutionary Learning: Applications in IR – Tag Weighting z Crossover z chromosome X x1 x2 x3 Mutation chromosome Y … xn y1 y2 y3 … yn chromosome X x1 x2 x3 change value w.p. Pm zi = (xi + yi ) / 2 w.p. Pc z1 z2 z3 … zn x1 chromosome Z (offspring) z … xn x2 x3 … xn chromosome X’ Truncation selection 35 Evolutionary Learning : Applications in IR - Experimental Results z Datasets 4 TREC-8 Web Track Data 4 2GB, 247491 web documents (WT2g) 4 No. of training topics: 10, No. of test topics: 10 z Results 36 18 Reinforcement Learning: Basic Concept Agent 1. State st 2. Action at Reward rt 3. Reward rt+1 Environment 4. State st+1 37 Reinforcement Learning: Applications in IR - Information Filtering [Seo & Zhang, 2000] WAIR •retrieve documents •calculate similarity 2. Actioni (modify profile) Rewar (user profile) di 1. Statei User profile 3. Rewardi+1 (relevance feedback) User 4. Statei+1 Document filtering ... Filtered documents 38 19 Reinforcement Learning: Experimental Results (Explicit Feedback) (%) 39 Reinforcement Learning: Experimental Results (Implicit Feedback) (%) 40 20