Download Wrapping the access path of deep Web databases

Document related concepts

Corecursion wikipedia , lookup

Pattern recognition wikipedia , lookup

Neuroinformatics wikipedia , lookup

Transcript
Tutorial in SIGMOD’06
Large-Scale Deep Web Integration:
Exploring and Querying
Structured Data on the Deep Web
Kevin C. Chang
Still challenges
on the Web?
Google is only the start of search
(and MSN will not be the end of it).
2
Structured Data--- Prevalent but ignored!
3
Challenges on the Web come in “dual”:
Getting access to the structured information!
Kevin’s 4-quardants:
Access
Structure
Surface Web
Deep Web




4
Tutorial Focus: Large Scale Integration of
structured data over the Deep Web


That is: Search-flavored integration.
Disclaimer-- What it is not:

Small-scale, pre-configured, mediated-querying settings


Text databases (or, meta-search)


Several related but “text-oriented” issues in meta-search
 eg, Stanford, Columbia, UIC
 more in the IR community (distributed IR)
And, never a “complete” bibliography!!


many related techniques  some we will relate today
http://metaquerier.cs.uiuc.edu/ “Web Integration” bibliography
Finally, no intention to “finish” this tutorial.
5
An evidence in Beta: Google Base.
6
When Google speaks up…
“What is an “Attribute”,” says Google!
7
And things are indeed happening!
8
9
10
The Deep Web:
Databases on the Web
11
The previous Web:
Search used to be “crawl and index”
12
The current Web:
Search must eventually resort to integration
13
How to enable effective access to the deep Web?
Cars.com
Apartments.com
411localte.com
Amazon.com
Biography.com
401carfinder.com
14
Survey the frontier:
BrightPlanet.com, March 2000 [Bergman00]

Overlap analysis of search engines.
n0 nb

na N

“Search sites” not clearly defines.

Estimated 43,000 – 96,000 deep Web sites.
Content size 500 times that of surface Web.

15
Survey the frontier
UIUC MetaQuerier, April 2004 [ChangHL+04]

Macro: Deep Web at large


Data: Automatically-sampled 1 million IPs
Micro: per-source specific characteristics


Data: Manually-collected sources
8 representative domains, 494 sources
Airfare (53), Autos (102), Books (69), CarRentals (24)
Hotels (38), Jobs (55), Movies (78), MusicRecords (75)

Available at http://metaquerier.cs.uiuc.edu/repository
16
They wanted to observe…

How many deep-Web sources are out there?


How many structured databases?


“Google does it all.”– Or, “InvisibleWeb.com does it all.”
How hidden are they?


“There are just (or, much more) text databases.”
How do search engines cover them?


“The dot-com bust has brought down DBs on the Web.”
“It is the hidden Web.”
How complex are they?


“Queries on the Web are much simpler, even trivial.”
“Coping with semantics is hopeless– Let’s Just wait till the
semantic Web.”
17
And their results are…

How many deep-Web sources are out there?


How many structured databases?



Google covered 5% fresh and 21% state objects.
InvisibleWeb.com covered 7.8% sources.
How hidden are they?


348,000 (structured) : 102,000 (text) == 3 : 1
How do search engines cover them?


307,000 sites, 450,000 DBs, 1,258,000 interfaces.
CarRental (0%) > Airfares (~4%) > … > MusicRec > Books > Movies (80%+)
How complex are they?

“Amazon effects”
18
Reported the “Amazon effect”…
Attributes converge
in a domain!
Condition patterns converge
even across domains!
19
Google’s Recent Survey
[courtesy Jayant Madhavan]
20
Driving Force:
The Large Scale
21
Circa 2000: Example System–
Information Agents
[MichalowskiAKMTT04, Knoblock03]
22
Circa 2000: Example System–
Comparison Shopping Engines [GuptaHR97]
Virtual Database
23
System:
Example Applications
24
Vertical Search Engines—”Warehousing” approach
e.g., Libra Academic Search [NieZW+05] (courtesy MSRA)



Integrating information from multiple types of sources
Ranking papers, conferences, and authors for a given query
Handling structured queries
Web
Database
Web
Database
Web
Database
Web
Database
Web
Database
Journal
Homepage
PDF
PS
…
DOC
Conf.
Homepage
Auhtor
Homepage
25
On-the-fly Meta-querying Systems—
e.g., WISE [HeMYW03], MetaQuerier
[ChangHZ05]
MetaQuerier@UIUC :
FIND sources
Cars.com Amazon.com
db of dbs
Apartments.com
QUERY sources
411localte.com
unified query interface
26
What needs to be done? Technical Challenges:





Source Modeling & Selection
Schema Matching
Source Querying, Crawling, and Obj Ranking
Data Extraction
System Integration
27
The Problems:
Technical Challenges
28
Technical Challenges
1. Source Modeling &
Selection
How to describe a source and find right sources
for query answering?
29
Source Modeling: Circa 2000

Focus:


Design of expressive model mechanism.
Techniques:


View-based mechanisms: answering queries using
views, LAV, GAV (see [Halevy01] for survey).
Hierarchical or layered representations for modeling
in-site navigations ([KnoblockMA+98], [DavulcuFK+99]).
30
Source Modeling & Selection: for Large Scale Integration

Focus: Discovery of sources.


Focus: Extraction of source models.




Focused crawling to collect query interfaces [BarbosaF05,
ChangHZ05].
Hidden grammar-based parsing [ZhangHC04].
Proximity-based extraction [HeMY+04].
Classification to align with given taxonomy [HessK03,
Kushmerick03].
Focus: Organization of sources and query routing


Offline clustering [HeTC04, PengMH+04].
Online search for query routing [KabraLC05].
31
Form Extraction: the Problem

Output all the conditions, for each:


Grouping elements (into query conditions)
Tagging elements with their “semantic roles”
attribute operator
value
32
Form Extraction: Parsing Approach [ZhangHC04]
A hidden syntactic model exist?

Observation: Interfaces share “patterns” of presentation.

Hypothesis:
Interface Creation
query capabilities
Grammar

Now, the problem:

Given
, how to find
?
33
Best-Effort Visual Language Parsing Framework
Input:
HTML query form
2P Grammar
Productions
Tokenizer
Layout Engine
Preferences
BE-Parser
X
Ambiguity
Resolution
Error
Handling
Output:
semantic structure
34
Form Extraction: Clustering Approach
[HessK03, Kushmerick03]
Concept: A form as a Bayesian network.
 Training: Estimate the Bayesian probabilities.
 Classification: Max-likelihood predictions given terms.
35
Technical Challenges
2. Schema Matching
How to match the schematic structures
between sources?
36
Schema Matching: Circa 2000

Focus:


Generic matching without assuming Web sources
Techniques: [RahmB01]
37
Schema Matching: for Large Scale Integration

Focus: Matching large number of interface schemas, often
in a holistic way.





Statistical model discovery [HeC03]; correlation mining [HeCH04,
HeC05].
Query probing [WangWL+04].
Clustering [HeMY+03, WuYD+04].
Corpus-assisted [MadhavanBD+05]; Web-assisted [WuDY06].
Focus: Constructing unified interfaces.

As a global generative model [HeC03].

Cluster-merge-select [HeMY+03].
38
WISE-Integrator:
Cluster-Merge-Represent
[HeMY+03]
39
WISE-Integrator:
Cluster-Merge-Represent

Matching attributes:



[HeMY+03]
Synonymous label: WordNet, string similarity
Compatible value domains (enum values or type)
Constructing integrated interface:


form = initial empty
until all attribtes covered:


take one attribute
select a representative and merge values
40
Statistical Schema Matching: MGS
A hidden statistical model exist? [HeC03, HeCH04, HeC05]

Observation: Schemas share “tendencies” of attribute usage.
α

β
α
η
β
γ
δ
η
Hypothesis:
α
Schema Generation
βα
η βγ
δη
αβ
Statistical Model

γ
η
δ
attribute matchings
Now, the problem:
α

Given
βα
η βγ
δη
, how to find
αβ
γ
η
δ
?
41
Statistical Hypothesis Discovery

Statistical formulation:

Given as observations:
Prob
α
βα
ηβγ
δη
QIs

Find underlying hypothesis:
αβ
γ
η
δ
“Global” approach: Hidden model discovery [HeC03]

Find entire global model at once
“Local” approach: Correlation mining [HeCH04, HeC05]

Find local fragments of matchings one at a time.
42
Technical Challenges
3. Source Querying,
Crawling & Search
How to query a source? How to crawl all
objects and to search them?
43
Source Querying: Circa 2000

Focus: Mediation of cross-source, join-able queries


Query rewriting, planning– Extensive study: e.g.,
[LevyRO96, AmbiteKMP01, Halevy01].
Focus: Execution & optimization of queries

Adaptive, speculative query optimization; e.g.,
[NaughtonDM+01, BarishK03, IvesHW04].
44
Source Querying: for Large Scale Integration
Metaquerying model:
Focus: On-the-fly Querying.
1.


Vertical-search-engine model:
Focus: Source crawling to collect objects.
2.



MetaQuerier Query Assistant [ZhangHC05].
Form submission by query generation/selection e.g.,
[RaghavanG01, WuWLM06].
Focus: Object search and ranking [NieZW+05]
45
On-the-fly Querying:
[ZhangHC05]
Type-locality based Predicate Translation
Source predicate
s
Target template
Type
Recognizer
Domain
Specific
Handler
Text
Handler
Numeric
Handler
P
Predicate
Mapper
Datetime
Handler
Target Predicate t*
Correspondences occur
Translation by type-handler
within localities
46
Source Crawling by Query Selection [WuWL+06]
Author
Title
Category
Ullman
Complier
System
Ullman
Data Mining
Application
Ullman
Han
Automata
Data Mining
Theory
Application
System
Application
Han

Ullman
Theory
Automata
Data Mining
Conceptually, the DB as a graph:



Compiler
Node: Attributes
Edge: Occurrence relationship
Crawling is transformed into graph traversal problem:
Find a set of nodes N in the graph G such that for every node i in
G, there exists a node j in N, j->i. And the summation of the cost
of nodes in N should be minimum.
47
Object Ranking-- Object Relationship Graph
[NieZW+05]

Popularity Propagation Factor for each type of relationship link

Popularity of an object is also affected by the popularity of the Web
pages containing the object
48
Object Ranking-- Training Process
Link Graph
new combination from
neighbors
[NieZW+05]
Initial Combination
of PPFs
PopRank
Calculator
Ranking Distance
Estimator
Better than
the best
?
Yes
Expert Ranking
No
Accept
The worse one
?
Yes
Chosen as
the best

Subgraph selection to approximate rank calculation for
speeding up.
49
Technical Challenges
3. Data Extraction
How to extract result pages into relations?
50
Data Extraction: Circa 2000
Need for rapid wrapper construction well recognized.

Focus:


Semi-automatic wrapper construction.
Techniques:



Wrapper-mediator architecture [Wiederhold92] .
Manual construction:
Mediator
Semi-automatic: Learning-based

HLRT [KushmerickWD97],
Stalker [MusleaMK99],
Softmealy [HsuD98];
Wrapper
Wrapper
Wrapper
51
Data Extraction: for Large Scale
Even more automatic approaches.

Focus:


Even more automatic approaches.
Techniques:

Semi-automatic: Learning-based


[ZhaoMWRY05], [IRMKS06].
Mediator
Automatic: Syntax-based

RoadRunner [MeccaCM01],
Wrapper
ExAlg [ArasuG03],
DEPTA [LiuGZ03, ZhaiL05].
Wrapper
Wrapper
52
HLRT Wrapper: the first “Wrapper Induction”
[KushmerickWD97]
A manual wrapper:
ExtractCCs(page P)
skip past first occurrence of <B> in P
while next <B> is before next <HR> in P
for each <lk,rk>belongs to {< <B>,</B>>,< <I>,</I>>}
skip past next occurrence of lk in P
extract attribute from P to next occurrence of rk
return extracted tuples
A generalized wrapper:
labeled data
Induction
Algorithm
wrapper rules:
(delimiters)
h
l1, r1
l2, r2
……
lk, rk
t
ExecuteHLRT(<h,t,l1,r1,..,lk,rk>,page P)
skip past first occurrence of h in P
while next l1 is before next t in P
for each <lk,rk>belongs to {<l1,r1>,..,< lk, rk >}
skip past next occurrence of lk in P
extract attr from P to next occurrence of rk
return extracted tuples
53
RoadRunner

Basic idea:



[MeccaCM01]
Page generation: filling (encoding) data into a template
Data extraction: as the reverse, decoding the template
Algorithm

Compare two HTML pages at one time


one as wrapper and the other as sample
Solving the mismatches


string mismatch -- content slot
tag mismatch
-- structure variance
54
RoadRunner
55
RoadRunner
the template
56
Technical Challenges
3. System Integration
Putting things together?
57
Our “system” research often ends up with
“components in isolation”
[ChangHZ05]
58
System integration: Sample issues
AA.com
Result of extraction:

New challenges


How will errors in automatic form extraction impact the
subsequent schema matching?
New opportunities

Can the result of schema matching help to correct such errors?

e.g., (adults, children) together form a matching, then?
59
Current agenda: “Science” of system integration
new challenge: error cascading
Cascade
Si
Sj
Sk
Feedback
new opportunity: result feedback
60
Finally, observations
Large scale is not only a
challenge, but also an
opportunity!
61
Observation #1: Large scale introduces
New Problems!

Several issues arise in the context:
Evidences of new problems:
 Source modeling & selection
 Source querying, crawling, ranking:



On-the-fly query translation
Object crawling, ranking
System integration
62
Observation #2: Large scale introduces
New Semantics!

Relaxed metrics possible– even the same problems.
Evidences of new metrics:
 Search-flavored integration– large scale but simplistic





Function: Simple queries
Source: Transparency no more the fundamental doctrine
User: In the loop of querying
Techniques: Automatic but error-likely
Results: Fuzzy, ranked


meta-querying: ranking of matching sources
vertical-search-engine: ranking of objects
63
Observation #2: Large scale introduces
New Insights!

The multitude of sources gives a holistic context for study.
Evidences of new insights:
 Schema matching: Many holistic approaches
 Source modeling: “Lego”-based extraction
 System integration: Holistic error correction/feedback
64
The Web “Trio” (My three circles...)
Search
Integration
Mining
65
Looking Forward
Recall the first time I heard about Google Base.
DB People: Buckle Up!
Our time has finally come…
66
Thank You!
For more information:
http://metaquerier.cs.uiuc.edu
[email protected]
67