Download 幻灯片 1 - Home, WAMDM, Database Group at Renmin

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microsoft Access wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Concurrency control wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Database model wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Clusterpoint wikipedia , lookup

Transcript
Deep Web Integration:
Querying
Structured Data on the Deep Web
Fangjiao Jiang
1
Outline








Background
Access Deep Web
MetaQuerier
Metasearch engine vs. MetaQuerier
Related research groups
Conclusion
…
Some suggestions
2
Part 1
Background
3
The previous Web:
things are just on the surface
4
The current Web: Getting “deeper”
 A great number of data is hidden behind query forms
5
The Problem for access data from
Deep Web
?
?
?
?
 Deep = not accessible through traditional search engines
6
Why is it important?
 More than 10 million distinct forms
7
Why is it important?
 Up to 5,000 billions dynamic result pages
8
Why is it important?
——Google’s Recent Survey [CIDR 2007]
 If there are 1 billion web pages
25 million potential Deep Web sources
9
Challenge: How to enable effective
access to the Deep Web?
Cars.com
10
Part 2
Access the Deep Web
11
Three different manners
 Warehouse-like
approach
Repository
Web
Database
 MetaQuerier
Web
Database
…
Web
Database
Integrated query interface
QUERY Web databases
 Surfacing the
Deep Web
1) Pre-compute appropriate queriers over the forms
2) Insert the resulting pages into a web-search index
12
(1) Warehouse-like approach
Web
Database
Web
Database
Web
Database
…
Web
Database
Journal
PDF
PS
Web
Database
DOC
中文期刊全文数据库
Homepage
国家自然基金信息库
Conf.
……Auhtor
Homepage
Homepage
13
(2) MetaQuerier
MetaQuerier
Front-end: Query Execution
Schema matching
Result
processing
Query
Translation
Query Web databases
Source
Selection
Find Web databases
MetaQuerier is what we focus on.
Deep Web Repository
Query Interfaces
The Deep Web
Query Capabilities
Subject Domains
Unified Interfaces
Source
Clustering
interface
integration
Back-end: Semantics Discovery
Database
Crawler
Interface
Extraction
14
(3) Surfacing the Deep Web [VLDB’08]
 Viewpoint
 Many domains and many languages
 No human in the loop, no site-specific scripts
 Main idea
 predicting input values for text boxes
 predicting input combinations
 Google’s Deep-Web crawling system
 Affects more than 1000 queries per second
 Enables access to more than a million Deep-Web sites
 Spans 50+ languages and 100+ domains
15
Part 3
MetaQuerier
16
A Survey on Deep Web [SIGMOD 2006]
 How many deep-Web sources are out there?
 307,000 sites, 450,000 DBs, 1,258,000 interfaces.
 How structured in Deep Web?
 348,000 (structured) : 102,000 (text) == 3 : 1
 How do search engines cover them?
 covered 10% sources.
 What’s the subject distribution of Web databases?
 Across all areas
 How complex are they?
 “Amazon effects”
17
Reported the “Amazon effect”…
Attributes converge
in a domain!
Condition patterns converge
even across domains!
18
Technical Challenges
 How to discover the query interface?
 Which form is the query interface of a Web database?
 How to understand a query interface?
 Where is the first condition? What’s its attribute?
 How to match query interfaces?
 What does “author” on this source match on that?
 How to translate queries?
 How to ask this query on that source?
19
Technical Challenges
 How to extract the query results?
 According to vision information?
 How to identify the same entity?
 Especially the large-scale entity identification.
 How to annotate the query results?
 How to specify the semantic of the data?
20
Part 4
Metasearch Engine VS. Metaquerier
21
Preliminary
Online data
Data Search Engine
Surface Web
Deep Web
Metasearch Engine
Metaquerier
Example:
mamma.com
Search Engine 1
Example:
Addall.com
Web database 1
Search Engine 2
Web database 2
……
……
Search Engine n
Web database n
22
Search Engine VS. Web Database
 Search Engine
 Document search engine
 Key technology
 Crawling the Web
 Re-crawl
 Web Database
 Database search engine
OK
 Changed
 added
 Indexing Web Pages
 Index terms
 Stop words
 Stemming
 Invert file structure
 Term (p,w)
23
Search Engine VS. Web Database
 Search Engine
 Document search engine
 Key technology
 Ranking Page
 Web Database
 Database search engine
OK
 Similar (Query, Page)
 Linkage information (Pagerank)
 Result Organization
 Matching score (descending)
 Clustering/categorizing
 Large
 “apple”
 Effective and Efficient Retrieval
 Recall-precision curve
24
Metasearch Engine VS. MetaQuerier
Online data
Data Search Engine
Surface Web
Deep Web
Metasearch Engine
Metaquerier
Example:
mamma.com
Search Engine 1
Example:
Addall.com
Web database 1
Search Engine 2
Web database 2
……
……
Search Engine n
Web database n
25
Metasearch Engine VS. Metaquerier
 Search Engine Selection  Query interface
integration
 Search Result
Extraction
 Database selection
 Result Merging
 Query translation
 Result Extraction ,
Entity Identification and
Annotation
26
Part 5
Main research groups
27
Main research groups
Yiyao Lu
Weiyi Meng Professor
Binghamton University
Eduard Dragut
Hai He
Interface extraction, interface integration, Query translation,
Result annotation,
Kevin Chen-Chuan Chang Associate Professor
University of Illinois at Urbana-Champaign
Bin He
Zhen Zhang
Interface extraction, interface integration, Query translation
28
Main research groups
Jayant Madhavan,
Google, Inc.
Zaiqing Nie
Microsoft
Microsoft,Inc.
Google base
Vertical search
Luis Gravano
Columbia University
Panagiotis G. Ipeirotis
New York University
Top-k query
Classification
 Others …
29
Conclusion: Our works toward large
scale integration
 Completed several key subtasks:
 Deep Web Data Extraction [TKDE 2009, WEBDB 2006,
WISE 2005, WAIM 2005]
 Query translation [DASFAA 2009, DASFAA 2007, SKG 2008]
 Deep Web survey [VLDB Workshop 2006, 计算机学报 2007]
 Schema matching [计算机学报 2008]
 Database selection [软件学报 2008]
 Moving forward to exciting system issues:
 System integration for building an integration system
 Web data integration in mobile environment
30
Part 6
Some suggestions
31
Four years ago…
 How to find a paper? Google enough?
 What are the theories we should to be familiar
with first?
32
Find the papers …









Google
Google scholar
DBLP Bibliography
C-DBLP
Libra Academic Search
ACM Digital Library
Citeseer
Authors’ homepage
Send the Email to author
33
Find the papers …
Conferences/Workshop
Journal:













SIGMOD/ WebDB
VLDB
ICDE
EDBT
WWW
SIGIR
CIKM/WIDM
WISE
DASFAA
TOIS
TODS
VLDB J.
TKDE
34
Read the books …





Information Retrieval
Data Mining
Machine Learning
Statistics
Theory of probability
…
35
Three years ago…
 How to find a problem?
 Which problem is significant?
36
Two years ago…
 How to write a paper?
37
Helpful points…









Right subject
Well-define problem
Clear contribution
Good Structure and logical flow
Proper use of words
Notice format, equation, reference…
Ask others to read your paper
Record your own mistake
Not leave out the important related work
38
Take some time to learn…
 Latex
 Matlab or Gnuplot (for the chart if necessary)
39
Thanks for Your Attentions
(Q&A)
40