Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Tutorial in SIGMOD’06 Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web Kevin C. Chang Still challenges on the Web? Google is only the start of search (and MSN will not be the end of it). 2 Structured Data--- Prevalent but ignored! 3 Challenges on the Web come in “dual”: Getting access to the structured information! Kevin’s 4-quardants: Access Structure Surface Web Deep Web 4 Tutorial Focus: Large Scale Integration of structured data over the Deep Web That is: Search-flavored integration. Disclaimer-- What it is not: Small-scale, pre-configured, mediated-querying settings Text databases (or, meta-search) Several related but “text-oriented” issues in meta-search eg, Stanford, Columbia, UIC more in the IR community (distributed IR) And, never a “complete” bibliography!! many related techniques some we will relate today http://metaquerier.cs.uiuc.edu/ “Web Integration” bibliography Finally, no intention to “finish” this tutorial. 5 An evidence in Beta: Google Base. 6 When Google speaks up… “What is an “Attribute”,” says Google! 7 And things are indeed happening! 8 9 10 The Deep Web: Databases on the Web 11 The previous Web: Search used to be “crawl and index” 12 The current Web: Search must eventually resort to integration 13 How to enable effective access to the deep Web? Cars.com Apartments.com 411localte.com Amazon.com Biography.com 401carfinder.com 14 Survey the frontier: BrightPlanet.com, March 2000 [Bergman00] Overlap analysis of search engines. n0 nb na N “Search sites” not clearly defines. Estimated 43,000 – 96,000 deep Web sites. Content size 500 times that of surface Web. 15 Survey the frontier UIUC MetaQuerier, April 2004 [ChangHL+04] Macro: Deep Web at large Data: Automatically-sampled 1 million IPs Micro: per-source specific characteristics Data: Manually-collected sources 8 representative domains, 494 sources Airfare (53), Autos (102), Books (69), CarRentals (24) Hotels (38), Jobs (55), Movies (78), MusicRecords (75) Available at http://metaquerier.cs.uiuc.edu/repository 16 They wanted to observe… How many deep-Web sources are out there? How many structured databases? “Google does it all.”– Or, “InvisibleWeb.com does it all.” How hidden are they? “There are just (or, much more) text databases.” How do search engines cover them? “The dot-com bust has brought down DBs on the Web.” “It is the hidden Web.” How complex are they? “Queries on the Web are much simpler, even trivial.” “Coping with semantics is hopeless– Let’s Just wait till the semantic Web.” 17 And their results are… How many deep-Web sources are out there? How many structured databases? Google covered 5% fresh and 21% state objects. InvisibleWeb.com covered 7.8% sources. How hidden are they? 348,000 (structured) : 102,000 (text) == 3 : 1 How do search engines cover them? 307,000 sites, 450,000 DBs, 1,258,000 interfaces. CarRental (0%) > Airfares (~4%) > … > MusicRec > Books > Movies (80%+) How complex are they? “Amazon effects” 18 Reported the “Amazon effect”… Attributes converge in a domain! Condition patterns converge even across domains! 19 Google’s Recent Survey [courtesy Jayant Madhavan] 20 Driving Force: The Large Scale 21 Circa 2000: Example System– Information Agents [MichalowskiAKMTT04, Knoblock03] 22 Circa 2000: Example System– Comparison Shopping Engines [GuptaHR97] Virtual Database 23 System: Example Applications 24 Vertical Search Engines—”Warehousing” approach e.g., Libra Academic Search [NieZW+05] (courtesy MSRA) Integrating information from multiple types of sources Ranking papers, conferences, and authors for a given query Handling structured queries Web Database Web Database Web Database Web Database Web Database Journal Homepage PDF PS … DOC Conf. Homepage Auhtor Homepage 25 On-the-fly Meta-querying Systems— e.g., WISE [HeMYW03], MetaQuerier [ChangHZ05] MetaQuerier@UIUC : FIND sources Cars.com Amazon.com db of dbs Apartments.com QUERY sources 411localte.com unified query interface 26 What needs to be done? Technical Challenges: Source Modeling & Selection Schema Matching Source Querying, Crawling, and Obj Ranking Data Extraction System Integration 27 The Problems: Technical Challenges 28 Technical Challenges 1. Source Modeling & Selection How to describe a source and find right sources for query answering? 29 Source Modeling: Circa 2000 Focus: Design of expressive model mechanism. Techniques: View-based mechanisms: answering queries using views, LAV, GAV (see [Halevy01] for survey). Hierarchical or layered representations for modeling in-site navigations ([KnoblockMA+98], [DavulcuFK+99]). 30 Source Modeling & Selection: for Large Scale Integration Focus: Discovery of sources. Focus: Extraction of source models. Focused crawling to collect query interfaces [BarbosaF05, ChangHZ05]. Hidden grammar-based parsing [ZhangHC04]. Proximity-based extraction [HeMY+04]. Classification to align with given taxonomy [HessK03, Kushmerick03]. Focus: Organization of sources and query routing Offline clustering [HeTC04, PengMH+04]. Online search for query routing [KabraLC05]. 31 Form Extraction: the Problem Output all the conditions, for each: Grouping elements (into query conditions) Tagging elements with their “semantic roles” attribute operator value 32 Form Extraction: Parsing Approach [ZhangHC04] A hidden syntactic model exist? Observation: Interfaces share “patterns” of presentation. Hypothesis: Interface Creation query capabilities Grammar Now, the problem: Given , how to find ? 33 Best-Effort Visual Language Parsing Framework Input: HTML query form 2P Grammar Productions Tokenizer Layout Engine Preferences BE-Parser X Ambiguity Resolution Error Handling Output: semantic structure 34 Form Extraction: Clustering Approach [HessK03, Kushmerick03] Concept: A form as a Bayesian network. Training: Estimate the Bayesian probabilities. Classification: Max-likelihood predictions given terms. 35 Technical Challenges 2. Schema Matching How to match the schematic structures between sources? 36 Schema Matching: Circa 2000 Focus: Generic matching without assuming Web sources Techniques: [RahmB01] 37 Schema Matching: for Large Scale Integration Focus: Matching large number of interface schemas, often in a holistic way. Statistical model discovery [HeC03]; correlation mining [HeCH04, HeC05]. Query probing [WangWL+04]. Clustering [HeMY+03, WuYD+04]. Corpus-assisted [MadhavanBD+05]; Web-assisted [WuDY06]. Focus: Constructing unified interfaces. As a global generative model [HeC03]. Cluster-merge-select [HeMY+03]. 38 WISE-Integrator: Cluster-Merge-Represent [HeMY+03] 39 WISE-Integrator: Cluster-Merge-Represent Matching attributes: [HeMY+03] Synonymous label: WordNet, string similarity Compatible value domains (enum values or type) Constructing integrated interface: form = initial empty until all attribtes covered: take one attribute select a representative and merge values 40 Statistical Schema Matching: MGS A hidden statistical model exist? [HeC03, HeCH04, HeC05] Observation: Schemas share “tendencies” of attribute usage. α β α η β γ δ η Hypothesis: α Schema Generation βα η βγ δη αβ Statistical Model γ η δ attribute matchings Now, the problem: α Given βα η βγ δη , how to find αβ γ η δ ? 41 Statistical Hypothesis Discovery Statistical formulation: Given as observations: Prob α βα ηβγ δη QIs Find underlying hypothesis: αβ γ η δ “Global” approach: Hidden model discovery [HeC03] Find entire global model at once “Local” approach: Correlation mining [HeCH04, HeC05] Find local fragments of matchings one at a time. 42 Technical Challenges 3. Source Querying, Crawling & Search How to query a source? How to crawl all objects and to search them? 43 Source Querying: Circa 2000 Focus: Mediation of cross-source, join-able queries Query rewriting, planning– Extensive study: e.g., [LevyRO96, AmbiteKMP01, Halevy01]. Focus: Execution & optimization of queries Adaptive, speculative query optimization; e.g., [NaughtonDM+01, BarishK03, IvesHW04]. 44 Source Querying: for Large Scale Integration Metaquerying model: Focus: On-the-fly Querying. 1. Vertical-search-engine model: Focus: Source crawling to collect objects. 2. MetaQuerier Query Assistant [ZhangHC05]. Form submission by query generation/selection e.g., [RaghavanG01, WuWLM06]. Focus: Object search and ranking [NieZW+05] 45 On-the-fly Querying: [ZhangHC05] Type-locality based Predicate Translation Source predicate s Target template Type Recognizer Domain Specific Handler Text Handler Numeric Handler P Predicate Mapper Datetime Handler Target Predicate t* Correspondences occur Translation by type-handler within localities 46 Source Crawling by Query Selection [WuWL+06] Author Title Category Ullman Complier System Ullman Data Mining Application Ullman Han Automata Data Mining Theory Application System Application Han Ullman Theory Automata Data Mining Conceptually, the DB as a graph: Compiler Node: Attributes Edge: Occurrence relationship Crawling is transformed into graph traversal problem: Find a set of nodes N in the graph G such that for every node i in G, there exists a node j in N, j->i. And the summation of the cost of nodes in N should be minimum. 47 Object Ranking-- Object Relationship Graph [NieZW+05] Popularity Propagation Factor for each type of relationship link Popularity of an object is also affected by the popularity of the Web pages containing the object 48 Object Ranking-- Training Process Link Graph new combination from neighbors [NieZW+05] Initial Combination of PPFs PopRank Calculator Ranking Distance Estimator Better than the best ? Yes Expert Ranking No Accept The worse one ? Yes Chosen as the best Subgraph selection to approximate rank calculation for speeding up. 49 Technical Challenges 3. Data Extraction How to extract result pages into relations? 50 Data Extraction: Circa 2000 Need for rapid wrapper construction well recognized. Focus: Semi-automatic wrapper construction. Techniques: Wrapper-mediator architecture [Wiederhold92] . Manual construction: Mediator Semi-automatic: Learning-based HLRT [KushmerickWD97], Stalker [MusleaMK99], Softmealy [HsuD98]; Wrapper Wrapper Wrapper 51 Data Extraction: for Large Scale Even more automatic approaches. Focus: Even more automatic approaches. Techniques: Semi-automatic: Learning-based [ZhaoMWRY05], [IRMKS06]. Mediator Automatic: Syntax-based RoadRunner [MeccaCM01], Wrapper ExAlg [ArasuG03], DEPTA [LiuGZ03, ZhaiL05]. Wrapper Wrapper 52 HLRT Wrapper: the first “Wrapper Induction” [KushmerickWD97] A manual wrapper: ExtractCCs(page P) skip past first occurrence of <B> in P while next <B> is before next <HR> in P for each <lk,rk>belongs to {< <B>,</B>>,< <I>,</I>>} skip past next occurrence of lk in P extract attribute from P to next occurrence of rk return extracted tuples A generalized wrapper: labeled data Induction Algorithm wrapper rules: (delimiters) h l1, r1 l2, r2 …… lk, rk t ExecuteHLRT(<h,t,l1,r1,..,lk,rk>,page P) skip past first occurrence of h in P while next l1 is before next t in P for each <lk,rk>belongs to {<l1,r1>,..,< lk, rk >} skip past next occurrence of lk in P extract attr from P to next occurrence of rk return extracted tuples 53 RoadRunner Basic idea: [MeccaCM01] Page generation: filling (encoding) data into a template Data extraction: as the reverse, decoding the template Algorithm Compare two HTML pages at one time one as wrapper and the other as sample Solving the mismatches string mismatch -- content slot tag mismatch -- structure variance 54 RoadRunner 55 RoadRunner the template 56 Technical Challenges 3. System Integration Putting things together? 57 Our “system” research often ends up with “components in isolation” [ChangHZ05] 58 System integration: Sample issues AA.com Result of extraction: New challenges How will errors in automatic form extraction impact the subsequent schema matching? New opportunities Can the result of schema matching help to correct such errors? e.g., (adults, children) together form a matching, then? 59 Current agenda: “Science” of system integration new challenge: error cascading Cascade Si Sj Sk Feedback new opportunity: result feedback 60 Finally, observations Large scale is not only a challenge, but also an opportunity! 61 Observation #1: Large scale introduces New Problems! Several issues arise in the context: Evidences of new problems: Source modeling & selection Source querying, crawling, ranking: On-the-fly query translation Object crawling, ranking System integration 62 Observation #2: Large scale introduces New Semantics! Relaxed metrics possible– even the same problems. Evidences of new metrics: Search-flavored integration– large scale but simplistic Function: Simple queries Source: Transparency no more the fundamental doctrine User: In the loop of querying Techniques: Automatic but error-likely Results: Fuzzy, ranked meta-querying: ranking of matching sources vertical-search-engine: ranking of objects 63 Observation #2: Large scale introduces New Insights! The multitude of sources gives a holistic context for study. Evidences of new insights: Schema matching: Many holistic approaches Source modeling: “Lego”-based extraction System integration: Holistic error correction/feedback 64 The Web “Trio” (My three circles...) Search Integration Mining 65 Looking Forward Recall the first time I heard about Google Base. DB People: Buckle Up! Our time has finally come… 66 Thank You! For more information: http://metaquerier.cs.uiuc.edu [email protected] 67