Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Web Mining for Unknown Term Translation Wen-Hsiang Lu (盧文祥) Department of Computer Science and Information engineering [email protected] http://myweb.ncku.edu.tw/~whlu Web Mining Research Problems • Difficulties in automatic construction of multilingual translation lexicons – Techniques: Parallel/comparable corpora – Bottlenecks: Lacking diverse/multilingual resources • Difficulties in query translation for cross-language information retrieval (CLIR) – Techniques: Bilingual dictionary/machine translation/parallel corpora – Bottlenecks: Multiple-senses/short/diverse/unknown query • Challenges – Web queries are often • Short: 2-3 words (Silverstein et al. 1998) • Diverse: wide-scoped topic • Unknown (out of vocabulary): 74% is unavailable in CEDICT Chinese-English electronic dictionary containing 23,948 entries. – E.g. • Proper name: 愛因斯坦 (Einstein), 海珊 (Hussein) • New terminology: 嚴重急性呼吸道症候群 (SARS), 院內感染 (Nosocomial infections) Cross-Language Information Retrieval • Query in source language and retrieve relevant documents in target languages SARS 愛因斯坦 老年癡呆症 National Palace Museum ? Source Query Query Translation Target Translation Information Retrieval Target Document s Difficulties in Web Query Translation Using Machine Translation Chinese translation: 全國宮殿博物館 English source query : National Palace Museum Research Paradigm New approach Live Translation Lexicon Web Mining Anchor-Text Mining Internet Search-Result Mining Term-Translation Extraction Applications Cross-Language Information Retrieval Cross-Language Web Search Multilingual Anchor-Texts Language-Mixed Texts in Search Result Pages Anchor-Text Mining with Probabilistic Inference Model • Asymmetric translation models: P(s t ) P(s t ) P( s ) • Symmetric model with link information: Conventional translation model Co-occurrence n P( s t ) P( s t ) P( s t ) P( s t | ui ) P(ui ) i 1 n P( s t | ui ) P(ui ) i 1 n P( s t | ui ) P(ui ) i 1 n [ P( s | ui ) P(t | ui ) P( s t | ui )] P(ui ) i 1 n P( s | ui ) P(t | ui ) P(ui ) i 1 n [ P( s | ui ) P(t | ui ) P( s | ui ) P(t | ui )] P(ui ) i 1 where P(ui ) L(ui ) n L(uj ) j 1 , L(uj ) the number of uj ' s in-link Page authority Transitive Translation Model for Multilingual Translation • Direct Translation Model s Pdirect ( s, t ) P( s t ) P( s t ) : probabilis tic inference Direct Translatio n model t 新力 ソニー (Traditional Pdirect ( s, t ) log P( s t ) (Japanese) Chinese) • Indirect Translation Model Pindirect ( s, t ) m P( s m, m t ) P(m) m P ( s m) P ( m t ) P ( m) P(m) : occurrence probabilit y in the corpus • Transitive Translation Model Pdirect ( s, t ), if Pdirect ( s, t ) Ptrans ( s, t ) Pindirect ( s, t ), otherwise. : predefined threshol d value. m Sony (English) … s : source term t : target translation m: intermediate translation Indirect Translatio n Promising Results for Automatic Construction of Multilingual Translation Lexicons Source terms (Traditional Chinese) 新力 耐吉 史丹佛 雪梨 網際網路 網路 首頁 電腦 資料庫 資訊 English Sony Nike Stanford Sydney internet network homepage computer database information Simplified Chinese 索尼 耐克 斯坦福 悉尼 互联网 网络 主页 计算机 数据库 信息 Japanese ソニー ナイキ スタンフォード シドニー インターネット ネットワーク ホームページ コンピューター データベース インフォメーション Search-Result Mining • • Goal: Improve translation coverage for diverse queries Idea – Chi-square test: co-occurrence relation – Context-vector analysis: context information • Chi-square similarity measure • N (a d b c) 2 S 2 ( s, t ) (a b) (a c) (b d ) (c d ) • 2-way contingency table Context-vector similarity measure m SCV ( s, t ) • i 1 wsi wti 2 2 im1 ( wsi ) im1 ( wti ) Weighting scheme: TF*IDF wti f (ti , d ) N log( ) max j f (t j , d ) n t ~t s a b f (ti ,d ) : the frequency of ti in search result page d , ~s c d N : the total number of Web pages, n : the number of pages including ti . Workshop on Web Mining Technology and Applications (Dec. 13, 2006) Panel Web Mining: Recent Development and Trends 曾新穆 教授 (Vincent S. Tseng) 成功大學 資訊工程系 Main Categories of Web Mining • Web content mining • Web usage mining • Web structure mining Web Content Mining • Trends – – – – Deep web mining Semantic web mining Vertical search Web multimedia content mining • Web image/video search • Web image/video annotation/classification/clustering • Web multimedia content filtering – Example: YouTube • Integration with web log mining Web Usage Mining • Developed techniques – Mining of frequent usage patterns • Association rules, sequential patterns, traversal patterns, etc. • Trends – Personalization – Recommendation • Web Ads – – – – Incorporation of content semantics/ontology Considerations of Temporality Extension to mobile web applications Multidiscipline integration Problems: Under-utilization of Clickstream Data • Shop.org: U.S.-based visits to retail Web sites exceeded 10% of total Internet traffic for the first time ever on Thanksgiving, 2004 • Top five sites: eBay, Amazon.com, Dell.com, Walmart.com, BestBuy.com, and Target.com • Aberdeen Group: – 70% of site companies use Clickstream data only for basic website management! Challenges for Clickstream Data Mining - Arun Sen et al., Communications of ACM, Nov. 2006 • Problems with data – Data incompleteness – Very large data size – Messiness in the data – Integration problems with Enterprise Data • Too Many Analytical Methodologies – Web Metric-based Methodologies – Basic Marketing Metric-based Methodologies – Navigation-based Methodologies – Traffic-based Methodologies • Data Analysis Problems – Across-dimension analysis problems – Timeliness of data mining under very large data size – Determination of useful/actionable analysis under thousands of metrics Web Information Extraction: The Issues for Unsupervised Approaches Dr. Chia-Hui Chang (張嘉惠) Department of Computer Science and Information Engineering, National Central University, Taiwan (Talk given at 2006 網路探勘技術與趨 勢研討會 ) Outline • Web Information Extraction – The key to web information integration • Three Dimensions – Task definition – Automation degree – Technology • Focused on Template Pages IE task – Issues for record-level IE – Techniques for solving these issues Introduction • The coverage of Web information is very wide and diverse – – – – – The Web has changed the way we obtain information. Information search on the Web is not enough anymore. The stronger need for Web information integration has increased than ever (both for business and individuals). Understanding those Web pages and discovering valuable information from them is called Web content mining. Information extraction is one of the keys for web content mining. Web Information Integration • From information search to information extraction, to information mapping 1. Focused crawling / Web page gathering • Information search 2. Information (Data) extraction • Discovering structured information from input 3. Schema matching • With a unified interface / single ontology Three Dimensions to See IE • Task Definition – Input (Unstructured free texts, semi-structured Web pages) – Output Targets (record-level, page-level, site-level) • Automation Degree – Programmer-involved, annotation-based or annotation-free approaches • Techniques – Learning algorithm: specific/general to general/specific – Rule type: regular expression rules vs logic rules – Deterministic finite-state transducer vs probabilistic hidden Markov models IE from Nearly-structured Documents Google search result Multiple-records Web page IE from Nearly-structured Documents Amazon.com book pages Single-record Pages IE from Semi-structured Documents Ungrammatical snippets A publication list Selected articles Information Extraction From Free Texts Filling slots in a database from sub-segments of text. October 14, 2002, 4:00 a.m. PT Named entity extraction, For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. IE NAME Bill Gates Bill Veghte Richard Stallman TITLE ORGANIZATION CEO Microsoft VP Microsoft founder Free Soft.. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… [Excerpted from Cohen & MaCallum’s talk]. Information Extraction From Free Texts As a family of techniques: Information Extraction = segmentation + classification + association + clustering October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… * Microsoft Corporation CEO Bill Gates * Microsoft Gates * Microsoft Bill Veghte * Microsoft VP Richard Stallman founder Free Software Foundation [Excerpted from Cohen & MaCallum’s talk]. Dimension 1: Task Definition - Input Dimension 1: Task Definition Output • Attribute level (single-slot) – Named entity extraction, concept annotation • Record level – Relation between slots • Page level – All data embedded in a dynamic page • Site level – All information about a web site Template Page Generation & Extraction • Generation/Encoding Template (T) Database CGI (T,x) …… …… …… Output Pages • Extraction/Decoding: A reverse engineering Dimension 2: Automation Degree • Programming-based – For programmers • Supervised learning – A bunch of labeled examples • Semi-supervised learning/Active learning – Interactive wrapper induction • Unsupervised learning – Mostly for template pages only Tasks vs. Automation Degree • High Automation Degree (Unsupervised) – Template page IE • Semi-Automatic / Interactive – Semi-structured document IE • Low Automation Degree (Supervised) – Free text IE Dimension 3: Technologies • Learning Technology – Supervised: rule generalization, hypothesis testing, statistical modeling – Unsupervised learning: pattern mining, clustering • Features used – Plain text information: tokens, token class, etc. – HTML information: DOM tree path, sibling, etc. – Visual information: font, style, position, etc. • Rule Types (Expressiveness of the rules) – Regular expression, first-order logic rules, HMM model Issues for Unsupervised Approaches • For Record-level Extraction 1. Data-rich Section Discovery 2. Record Boundary (Separator) Mining 3. Schema Detection & Data Annotation • For Page-level Extraction – Schema Detection - differentiate template from data tokens Data-Rich Section Record Boundary Attribute Attribute Some Related Works on Unsupervised Approaches • Record-level – – – – – IEPAD {Chang and Liu, WWW2001] DeLa [Wang and Lochovsky, WWW2003] DEPTA [Zhai and Liu, WWW2005] ViPER [Simon and Lausen, CIKM 2005] ViNT[Zhao et al, WWW 2005] • Page-level – Roadrunner [Crescenzi, VLDB2001] – EXALG [Arasu and Garcia-Molina, SIGMOD2003] – MSR [Zhao et al., VLDB 2006] Issue 1: Data-Rich Section Discovery • Comparing a normal page with no-result page • Comparing two normal pages – Locate static text lines, e.g. • • • • • • ViNT [Zhao, et al. WWW2005] Books Related Searches Narrow or Expand Results Showing Results … MSE [Zhao, et al. VLDB2006] Issue 1: Data-Rich Section Discovery (Cont.) • Similarity between two adjacent leaf nodes • 1-dimension clustering • Pitch Estimation HL(R) [Papadakis, et al., SAINT2005] Issue 2: Record Boundary Mining • String Pattern Mining • Tree Pattern Mining <html><body><b>T</b><ol> <li><b>T</b>T<b>T</b>T</li> <li><b>T</b>T<b>T</b></li> </ol></body><html> IEPAD [Chang and Liu, WWW2001] <P><A>T</A><A>T</A> T</P><P><A>T</A>T</P> <P><A>T</A>T</P> <P><A>T</A>T</P> DeLa [Wang and Lochovsky, WWW2003] DEPTA [Zhai and Liu, WWW2005] Issue 2: Record Boundary Mining (Cont.) • • Finding repeat separators from visual encoded context lines Heuristics – – – – • Visual cues Line following an HR-LINE A unique line in a block that starting with a number Line in a block has the smallest position code (Only one). Line following the BLANK line is the first line. ViNT [Zhao, et al. WWW2005] ViPER [Simon and Lausen, CIKM05] Issue 3: Data Schema Detection • Alignment of the multiple records found – Handling missing attributes, multiple-value attributes – String alignment or tree alignment – Examining two records at a time • Differentiate template from data tokens with some assumptions – Tag tokens are considered part of templates – Text lines are usually part of data except for static text lines • Similar to the problem of page-level IE tasks Page-level IE: EXALG [Arasu and Garcia-Molina, SIGMOD 2003] • Identifying static markers (tag&word tokens) from multiple pages Critical point: Tags are not – Occurrence vector for each token • Differentiating token roles easy to differentiate as compared to text lines used in [Zhao, et al, VLDB206] – By DOM tree path – By position in the EC class • Equivalent class (EC) – Group tokens with the same occurrence vector • LFECs form the template – e.g. <1,1,1,1>: {<html>, <body>, <table>, </table>, </body>, </html>} On the use of techniques • From supervised to unsupervised approaches • From string alignment (IEPAD, RoadRunner) to tree alignment (DEPTA, Thresher) • From two page summarization (MSE) to multiple page summarization (EXALG) Summary • Content of this talk – Web Information Extraction – Three Dimensions – Focused on IE for template pages IE task • • Issues for unsupervised approaches Techniques for solving these issues • Content not in this talk – Probabilistic model for free text IE tasks Personal Vision • From information search to information integration • Better UI for information integration – Information collection: focused crawling – Information extraction – Schema matching and integration • Not only for business but also for individuals References – Record Level • • • • • C.-H. Chang, S.-C. Lui, IEPAD: Information Extraction based on Pattern Discovery, WWW01 B. Liu, R. Grossman and Y. Zhai, Mining Data Records in Web Pages, SIGKDD03 Y. Zhai, B. Liu. Web Data Extraction Based on Partial Tree Alignment, WWW05 K. Simon and G. Lausen, ViPER: Augmenting Automatic Information Extraction with Visual Perceptions, CIKM05 H. Zhao, W. Meng, V. Raghavan, and C. Yu, Fully Automatic Wrapper Generation for Search Engines, WWW05 References – Page Level & Survey • • • • • A. Arasu, H. Garcia-Molina, Extracting Structured Data from Web Pages, SIGMOD03 V. Crescenzi, G. Mecca, P. Merialdo. RoadRunner: Towards Automatic Data Extraction from Large Web Sites, VLDB01 H. Zhao, W. Meng, and C. Yu, Automatic Extraction of Dynamic Record Sections From Search Engine Result Pages, VLDB06 A. Laender, B. Ribeiro-Neto, A. da Silva, J. Teixeira. A Brief Survey of Web Data Extraction Tools. ACM SIGMOD Record02. C.-H. Chang, M. Kayed, M. R. Girgis, K. Shaalan, A Survey of Web Information Extraction Systems, IEEE TKDE06. Taxonomic Information Integration: Challenges and Applications Cheng-Zen Yang (楊正仁) Department of Computer Sci. and Eng. Yuan Ze University [email protected] Outline • Introduction • Problem statement • Integration approaches – Flattened catalog integration – Hierarchical catalog integration • Applications • Conclusions and future work Introduction • As the Internet develops rapidly, the number of on-line Web pages becomes very large today. – Many Web portals offer taxonomic information (catalogs) to facilitate information search [AS2001]. • These catalogs may need to be integrated if Web portals are merged. – B2B electronic marketplaces bring together many online suppliers and buyers. • An integrated Web catalog service can help users – gain more relevant and organized information in one catalog, and – can save them much time to surf among different Web catalogs. B2C e-commerce: Amazon The taxonomic information integration problem • Taxonomic information integration is more than a simple classification task. • When some implicit source information is exploited, the integration accuracy can be highly improved. • Past studies have shown that – the Naïve Bayes classifier, SVMs, and the Maximum Entropy model enhance the accuracy of Web catalog integration in a flattened catalog integration structure. The problem statement (1/2) • Flattened catalog integration – The source catalog S containing a set of categories S1 , S2 , … , Sm is to be integrated into the destination catalog D consisting of categories D1 , D2 , …, Dn. Source Catalog Destination Catalog Document D11 Document S11 Document S12 Document D12 Integrated S1 D1 Document D1k Document S1k Integrated Integrated S2 D2 Sm Dn The problem statement (2/2) • Hierarchical catalog integration Catalog D URL Category D1 URL Category D2 URL Category D3 Catalog S URL Category S1 URL Category S2 Category S1 URL f URL g f g Category S2 URL h URL i h i Category D1 URL a URL b a b Category D2 URL b URL c b c Category D3 URL d URL e d e Integration Approaches for Flattened Catalogs The enhanced naïve Bayes approach • The pioneer work [AS2001] – They exploit the implicit source information and improve the integration accuracy. – Naïve Bayes Approach d : Test document in source catalog Pr(Ci ) Pr( d | Ci ) Pr(Ci | d ) Pr( d ) Ci : Category in Destination Catalog S : Category in Source catalog – The Enhanced Naïve Bayes Approach Pr(Ci | d , S ) Pr(Ci | S ) Pr( d | Ci ) Pr( d | S ) Probabilistic enhancement and topic restriction • NB and SVM [TCCL2003] • Probabilistic Enhancement Pr(vt | x) Pr(vt | s ) vPE ( x) arg max vt H 2 Pr(vt ) x : Test document in source catalog vt : Label of class in Destination Catalog s : The class label of x in Source Catalog • Topic Restriction Catalog D URL Category D1 URL Category D2 URL Category D3 Catalog S URL Category S1 URL Category S2 Category S1 URL f URL g f g Category S2 URL h URL i h i Category D1 URL a URL f a f Category D2 URL b URL f b f Category D3 URL d URL e d e The pseudo relevance feedback approach • Iterative-Adapting SVM [CHY2005] An Application Example Searching for multi-lingual news articles • Many Web portals provide monolingual news integration services. • Unfortunately, users cannot effectively find the related news in other languages. The basic idea • Web portals have grouped related news articles. • These articles should be about the same main story. • Can we discover these mappings? Techniques in our current work • Machine translation • Taxonomy integration Mapping Finding Taxonomy integration • The cross-training process [SCG2003] – To make better inferences about label assignments in another taxonomy English News Features Chinese News Features 1st SVM Semantically Overlapped Features 2nd SVM English-Chinese News Category Mappings Mapping decision • The SVM-BCT classifiers then calculate the positively mapped ratios as the mapping score (MSi) to predict the semantic overlapping. [YCC2006] • The mapping score MSi of Si Dj • Then we can rank the mappings according to their scores. Performance evaluation • NLP resources – Standard Segmentation Corpus from ACLCLP • 42023 segmented words – Bilingual wordlists (version 2.0) from Linguistic Data Consortium (LDC) • Chinese-to-English version 2 (ldc2ce) with about 120K records • English-to-Chinese (ldc2ec) with 110K records Experimental datasets • Properties – news reports in the international news category of Google News Taiwan and U.S. version – May 10, 2005 - May 23, 2005 – 20 news event categories per day – Chinese-to-English • 46.9MB – English-to-Chinese • 80.2MB – 29182 news stories Conclusions and Future Work Conclusions • Taxonomic information integration is an emerging issue for Web information mining. • New approaches for flattened catalog integration and hierarchical catalog integration are still in need. • Our approaches are in the first stage for taxonomic information integration. Future work • Taxonomy alignment – Heterogeneous catalog integration (Jung 2006) • Incorporated with more conceptual information – Wordnet, Sinica BOW, etc. • Evaluation on other classifiers – EM, ME, etc. References • • • • • • [AS2001] Agrawal, R., Srikant., R.: On Integrating Catalogs. Proc. the 10th WWW Conf. (WWW10), (May 2001) 603–612 [BOYAPATI2002] Boyapati, V.: Improving Hierarchical Text Classification Using Unlabeled Data. Proc. The 25th Annual ACM Conf. on Research and Development in Information Retrieval (SIGIR’02), (Aug. 2002) 363–364 [CHY2005] I.-X. Chen, J.-C. Ho, and C.-Z. Yang.: An iterative approach for web catalog integration with support vector machines. Proc. of Asia Information Retrieval Symposium 2005 (AIRS2005), (Oct. 2005) 703–708 [DC 2000] Dumais, S., Chen, H.: Hierarchical Classification of Web Content. Proc. the 23rd Annual ACM Conf. on Research and Development in Information Retrieval (SIGIR’00), (Jul. 2000) 256–263 [HCY2006] J.-C. Ho, I.-X. Chen, and C.-Z. Yang.: Learning to Integrate Web Catalogs with Conceptual Relationships in Hierarchical Thesaurus. Proc. The 3rd Asia Information Retrieval Symposium (AIRS 2006), (Oct. 2006) 217-229 [JOACHIMS1998] Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Proc. the 10th European Conf. on Machine Learning (ECML’98), (1998) 137–142 • • • • • • [JUNG2006] Jung, J. J.: Taxonomy Alignment for Interoperability Between Heterogeneous Digital Libraries. Proc. The 9th Int’l Conf. on Asian Digital Library (ICADL 2006), (Nov. 2006), 274-282 [KELLER1997] Keller,A. M.: Smart Catalogs and Virtual Catalogs. In Ravi Kalakota and Andrew Whinston, editors, Readings in Electronic Commerce. Addison-Wesley. (1997) [KKL2002] Kim, D., Kim, J., and Lee, S.: Catalog Integration for Electronic Commerce through Category-Hierarchy Merging Technique. Proc. the 12th Int’l Workshop on Research Issues in Data Engineering: Engineering eCommerce/e-Business Systems (RIDE’02), (Feb. 2002) 28–33 [MLW 2003] Marron, P. J., Lausen, G., Weber, M.: Catalog Integration Made Easy. Proc. the 19th Int’l Conf. on Data Engineering (ICDE’03), (Mar. 2003) 677–679 [RR2001] Rennie, J. D. M., Rifkin, R.: Improving Multiclass Text Classification with the Support Vector Machine. Tech. Report AI Memo AIM2001-026 and CCL Memo 210, MIT (Oct. 2001) [SCG2003] Sarawagi, S., Chakrabarti S., Godbole., S.: Cross-Training: Learning Probabilistic Mappings between Topics. Proc. the 9th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, (Aug. 2003) 177–186 • • • • • • [SH2001] Stonebraker, M. and Hellerstein, J. M.: Content Integration for eCommerce. Proc. of the 2001 ACM SIGMOD Int’l Conf. on Management of Data, (May 2001) 552–560 [SLN2003] Sun, A. ,Lim, E.-P., and Ng., W.-K. :Performance Measurement Framework for Hierarchical Text Classification. Journal of the American Society for Information Science and Technology (JASIST), Vol. 54, No. 11, (June 2003) 1014–1028 [TCCL2003] Tsay, J.-J., Chen, H.-Y., Chang, C.-F., Lin, C.-H.: Enhancing Techniques for Efficient Topic Hierarchy Integration. Proc. the 3rd Int’l Conf. on Data Mining (ICDM’03), (Nov. 2003) (657–660) [WTH2005] Wu, C.-W., Tsai, T.-H., and Hsu, W.-L.: Learning to Integrate Web Taxonomies with Fine-Grained Relations: A Case Study Using Maximum Entropy Model. Proc. of Asia Information Retrieval Symposium 2005 (AIRS2005), (Oct. 2005) 190–205 [YCC2006] C.-Z. Yang, C.-M. Chen, and I.-X. Chen.: A Cross-Lingual Framework for Web News Taxonomy Integration. Proc. The 3rd Asia Information Retrieval Symposium (AIRS 2006), (Oct. 2006), 270-283 [YL1999] Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. Proc. the 22nd Annual ACMConference on Research and Development in Information Retrieval, (Aug. 1999) 42–49 • [ZADROZNY2002] Zadrozny., B.: Reducing Multiclass to Binary by Coupling • • Probability Estimates. In: Dietterich, T. G., Becker, S., Ghahramani, Z. (eds): Advances in Neural Information Processing Systems 14 (NIPS 2001). MIT Press. (2002) [ZL2004WWW] Zhang, D., Lee W. S.: Web Taxonomy Integration using Support Vector Machines. Proc. WWW2004, (May 2004) 472–481 [ZL2004SIGIR] Zhang, D., Lee W. S.: Web Taxonomy Integration through Co-Bootstrapping. Proc. SIGIR’04, (July 2004) 410–417 Search Mining Integration Mining in the Middle: From Search to Integration on the Web Kevin C. Chang Joint with: the UIUC and Cazoodle Teams To Begin With: What is “the Web”? Or: How do search engines view the Web? Version 0.1– “Web is a SET of PAGES.” Version 1.1– “Web is a GRAPH of PAGES.” But,… What have you been searching lately? Structured Data--- Prevalent but ignored! Version V.2.1: Our View– Web is “Distributed Bases” of “Data Entities”. ? ? ? Challenges on the Web come in “dual”: Getting access to the structured information! Kevin’s 4-quadrants: Access Structure Deep Web Surface Web We are inspired: From search to integration—Mining in the middle! Deep Web Surface Web Access Structure Search Mining Integration Challenge of the Deep Web: Access: How to Get There? MetaQuerier: Holistic Integration over the Deep Web. The previous Web: Search used to be “crawl and index” The current Web: Search must eventually resort to integration MetaQuerier: Exploring and integrating the deep Web Cars.com Amazon.com Explorer • source discovery • source modeling • source indexing Apartments.com 411localte.com FIND sources db of dbs Integrator • source selection • schema integration • query mediation QUERY sources unified query interface The challenge – How to deal with “deep” semantics across a large scale? “Semantics” is the key in integration! • How to understand a query interface? – Where is the first condition? What’s its attribute? • How to match query interfaces? – What does “author” on this source match on that? • How to translate queries? – How to ask this query on that source? Survey the frontier before going to the battle. • Challenge reassured: – – – – 450,000 online databases 1,258,000 query interfaces 307,000 deep web sites 3-7 times increase in 4 years • Insight revealed: – Web sources are not arbitrarily complex – “Amazon effect” – convergence and regularity naturally emerge “Amazon effect” in action… Attributes converge in a domain! Condition patterns converge even across domains! Search moves on to integration. Don’t believe me? See what Google has to say… DB People: Buckle Up! To embrace the burgeoning of structured data on the Web. Challenge of the Surface Web: Structure: What to look for? WISDM: Holistic Search over the Surface Web. Challenge of the surface Web: Despite all the glorious search engines… Are we searching for what we want? What have you been searching lately? • • • • • • • • What is the email of Marc Snir? What is Marc Snir’s research area? Who are Marc Snir’s coauthors? What are the phones of CS database faculty? How much is “Canon PowerShot A400”? Where is SIGMOD 2006 to be held? When is the due date of SIGMOD 2006? Find PDF files of “SIGMOD 2006”? NO! Regardless of what you want, you are searching for pages… Your creativity is amazing: A few examples • WSQ/DSQ at Stanford – use page counts to rank term associations • QXtract at Columbia – generate keywords to retrieve docs useful for extract • KnowItAll at Washington – both ideas in one framework • And there must be many I don’t know yet… Time to distill to build a better “mining” engine? • • • • • • • What is an “entity”? Your target of information– or, anything. Phone number Email address PDF Image Person name Book title, author, … Price (of something) We take an entity view of the Web: How different is “entity search”? How to define such searches? Let’s motivate by contrasting… Page Retrieval Entity Search Consider the entire process: Page Retrieval 4. Output: one page per result. Marc Snir Marc Snir 3. Scope: Each page itself. 2. Criteria: content keywords. 1. Input: pages. Entity search is thus different… Entity Search 4. Output: associative results. 3. Scope: holistic aggregagtes. 2. Criteria: contextual patterns. 1. Input: probabilistic entities. What are technical challenges? Or, how to write (reviewer-friendly) papers? More issues… • Tagging/merging of basic entities? – Application-driven tagging – Web’s redundancy will alleviate accuracy demand. • Powerful pattern language – Linguistic; visual • Advanced statistical analysis – correlation; sampling • Scalable query processing – new components scale? Promises of the Concepts • From page at a time to entity-tuple at a time – getting directly to target info and evidences • From IR to a mining engine – not only page retrieval but also construction • From offline to online Web mining and integration – enable large scale ad-hoc mining over the web • From Web to controlled corpus – enhance not only efficiency but also effectiveness • From passive to active application-driven indexing – enable mining applications Conclusion: Mining in just the middle! Dual Challenges: – Getting access to the deep Web. – Getting structure from the surface Web. Central Techniques: – Holistic mining for both search and integration. Search Mining Integration Search Mining Integration What will such a Mining Engine be? You tell me! Students’ imagination knows no bounds.