Download 幻灯片 1 - Home, WAMDM, Database Group at Renmin

Deep Web Integration: Querying Structured Data on the Deep Web Fangjiao Jiang 1 Outline         Background Access Deep Web MetaQuerier Metasearch engine vs. MetaQuerier Related research groups Conclusion … Some suggestions 2 Part 1 Background 3 The previous Web: things are just on the surface 4 The current Web: Getting “deeper”  A great number of data is hidden behind query forms 5 The Problem for access data from Deep Web ? ? ? ?  Deep = not accessible through traditional search engines 6 Why is it important?  More than 10 million distinct forms 7 Why is it important?  Up to 5,000 billions dynamic result pages 8 Why is it important? ——Google’s Recent Survey [CIDR 2007]  If there are 1 billion web pages 25 million potential Deep Web sources 9 Challenge: How to enable effective access to the Deep Web? Cars.com 10 Part 2 Access the Deep Web 11 Three different manners  Warehouse-like approach Repository Web Database  MetaQuerier Web Database … Web Database Integrated query interface QUERY Web databases  Surfacing the Deep Web 1) Pre-compute appropriate queriers over the forms 2) Insert the resulting pages into a web-search index 12 (1) Warehouse-like approach Web Database Web Database Web Database … Web Database Journal PDF PS Web Database DOC 中文期刊全文数据库 Homepage 国家自然基金信息库 Conf. ……Auhtor Homepage Homepage 13 (2) MetaQuerier MetaQuerier Front-end: Query Execution Schema matching Result processing Query Translation Query Web databases Source Selection Find Web databases MetaQuerier is what we focus on. Deep Web Repository Query Interfaces The Deep Web Query Capabilities Subject Domains Unified Interfaces Source Clustering interface integration Back-end: Semantics Discovery Database Crawler Interface Extraction 14 (3) Surfacing the Deep Web [VLDB’08]  Viewpoint  Many domains and many languages  No human in the loop, no site-specific scripts  Main idea  predicting input values for text boxes  predicting input combinations  Google’s Deep-Web crawling system  Affects more than 1000 queries per second  Enables access to more than a million Deep-Web sites  Spans 50+ languages and 100+ domains 15 Part 3 MetaQuerier 16 A Survey on Deep Web [SIGMOD 2006]  How many deep-Web sources are out there?  307,000 sites, 450,000 DBs, 1,258,000 interfaces.  How structured in Deep Web?  348,000 (structured) : 102,000 (text) == 3 : 1  How do search engines cover them?  covered 10% sources.  What’s the subject distribution of Web databases?  Across all areas  How complex are they?  “Amazon effects” 17 Reported the “Amazon effect”… Attributes converge in a domain! Condition patterns converge even across domains! 18 Technical Challenges  How to discover the query interface?  Which form is the query interface of a Web database?  How to understand a query interface?  Where is the first condition? What’s its attribute?  How to match query interfaces?  What does “author” on this source match on that?  How to translate queries?  How to ask this query on that source? 19 Technical Challenges  How to extract the query results?  According to vision information?  How to identify the same entity?  Especially the large-scale entity identification.  How to annotate the query results?  How to specify the semantic of the data? 20 Part 4 Metasearch Engine VS. Metaquerier 21 Preliminary Online data Data Search Engine Surface Web Deep Web Metasearch Engine Metaquerier Example: mamma.com Search Engine 1 Example: Addall.com Web database 1 Search Engine 2 Web database 2 …… …… Search Engine n Web database n 22 Search Engine VS. Web Database  Search Engine  Document search engine  Key technology  Crawling the Web  Re-crawl  Web Database  Database search engine OK  Changed  added  Indexing Web Pages  Index terms  Stop words  Stemming  Invert file structure  Term (p,w) 23 Search Engine VS. Web Database  Search Engine  Document search engine  Key technology  Ranking Page  Web Database  Database search engine OK  Similar (Query, Page)  Linkage information (Pagerank)  Result Organization  Matching score (descending)  Clustering/categorizing  Large  “apple”  Effective and Efficient Retrieval  Recall-precision curve 24 Metasearch Engine VS. MetaQuerier Online data Data Search Engine Surface Web Deep Web Metasearch Engine Metaquerier Example: mamma.com Search Engine 1 Example: Addall.com Web database 1 Search Engine 2 Web database 2 …… …… Search Engine n Web database n 25 Metasearch Engine VS. Metaquerier  Search Engine Selection  Query interface integration  Search Result Extraction  Database selection  Result Merging  Query translation  Result Extraction , Entity Identification and Annotation 26 Part 5 Main research groups 27 Main research groups Yiyao Lu Weiyi Meng Professor Binghamton University Eduard Dragut Hai He Interface extraction, interface integration, Query translation, Result annotation, Kevin Chen-Chuan Chang Associate Professor University of Illinois at Urbana-Champaign Bin He Zhen Zhang Interface extraction, interface integration, Query translation 28 Main research groups Jayant Madhavan, Google, Inc. Zaiqing Nie Microsoft Microsoft,Inc. Google base Vertical search Luis Gravano Columbia University Panagiotis G. Ipeirotis New York University Top-k query Classification  Others … 29 Conclusion: Our works toward large scale integration  Completed several key subtasks:  Deep Web Data Extraction [TKDE 2009， WEBDB 2006, WISE 2005, WAIM 2005]  Query translation [DASFAA 2009, DASFAA 2007, SKG 2008]  Deep Web survey [VLDB Workshop 2006, 计算机学报 2007]  Schema matching [计算机学报 2008]  Database selection [软件学报 2008]  Moving forward to exciting system issues:  System integration for building an integration system  Web data integration in mobile environment 30 Part 6 Some suggestions 31 Four years ago…  How to find a paper? Google enough?  What are the theories we should to be familiar with first? 32 Find the papers …          Google Google scholar DBLP Bibliography C-DBLP Libra Academic Search ACM Digital Library Citeseer Authors’ homepage Send the Email to author 33 Find the papers … Conferences/Workshop Journal:              SIGMOD/ WebDB VLDB ICDE EDBT WWW SIGIR CIKM/WIDM WISE DASFAA TOIS TODS VLDB J. TKDE 34 Read the books …      Information Retrieval Data Mining Machine Learning Statistics Theory of probability … 35 Three years ago…  How to find a problem?  Which problem is significant? 36 Two years ago…  How to write a paper? 37 Helpful points…          Right subject Well-define problem Clear contribution Good Structure and logical flow Proper use of words Notice format, equation, reference… Ask others to read your paper Record your own mistake Not leave out the important related work 38 Take some time to learn…  Latex  Matlab or Gnuplot (for the chart if necessary) 39 Thanks for Your Attentions (Q&A) 40

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 幻灯片 1 - Home, WAMDM, Database Group at Renmin