Download \documentstyle[widepage,doublespace]{article}

Part II Large-Scale Web Database Integration Systems Definitions  Web database (database search engine): Webaccessible database (WDB) Characteristics:  Data are structured and are stored in database systems.  Data are accessible through a Web search interface.  Result pages are dynamically generated by wrapping data in HTML files.  Web database integration: the process of enabling unified access to multiple Web databases in the same application domain. An Example Web Database More Examples WDB Integration System vs. MSE  Major differences between Web databases and regular document search engines (DSE):  DSE searches Web pages while WDB searches database entities.  WDB usually has a complex interface while DSE usually has a simple interface.  DSE ranks results by similarity while WDB usually ranks results by some attribute values. WDB Integration System Architecture Web User query WDB List Web Database Discovery WDB Interface Schema Extraction WDB Clustering By Domain Result Domain Mapping Result Merging Integrated Interface Entity Identification Database Selection Result Annotation Query Translation and Dispatch Result Extraction WDB Cluster 1 Interface Integration ...... WDB Cluster n Integrated Interface 1 ...... Integrated Interface n Integrated Interface Generation Module. …… WDB 1 World Wide Web WDB m Query Processing Module. Main Technical Problems         WDB Search Interface Modeling WDB Search Interface Extraction WDB Search Interface Clustering WDB Search Interface Integration Global Query Mapping and Optimization Search Result Extraction and Annotation Online Entity Identification Remaining Research Challenges A Related Book  Eduard Dragut, Weiyi Meng, Clement Yu. Deep Web Query Interface Understanding and Integration. Morgan & Claypool Publishers, June 2012.  Table of Content        Introduction Query Interface Representation and Extraction Query Interface Clustering and Categorization Query Interface Matching Query Interface Attribute Integration Query Interface Integration Summary and Future Research WDB Query Interface Modeling Problem: Represent the information on each interface in a format that is suitable for integration and query submission. An Example WDB Interface An attribute WDB Interface Modeling Different models have been proposed:  WISE Three-Level Model: site-level, attribute-level, and element-level.  Hierarchical Model: A search interface is modeled as an ordered tree of elements.   Hierarchical model is designed to capture the order semantics and the nested grouping of the attributes in an interface. Querying Capability Model: Formally characterize what kinds of queries are valid for a search interface. Hierarchical Model: An Example aa.com 1. Where Do You Want to Go? origin From: City or Airport Code 2. When Do You Want to Go? destination To: City or Airport Code Departure Date depMonth depDay 3. Number of Passengers numAdult Adults Return Date depTime retMonth retDay 4. What are Your Service Preferences? carrier 5. Choose a Carrier numChild Children cabinClass Class of Service retTime maxiumStops Number of Connections Query Interface Extraction   Automatic interface extraction: Automatically extract information described in an interface representation model from any given WDB interface. Primarily two tasks:  Attribute extraction • •  Extract elements and labels from the interface. Group elements and labels into logical attributes. Attribute analysis • Extract and derive meta-information about each attribute based on the interface representation model. WDB Query Interface Clustering Objective: Group WDBs into different clusters such that all WDBs in the same cluster are related to the same domain (e.g., sell the same type of products). Techniques: 1. First, construct a concept hierarchy. 2. Then apply one of the following techniques   Supervised clustering (training required) Unsupervised clustering (no training required) Query Interface Integration   It is related to database schema integration. Schema integration has been studied since 1980s.     Based on different data models: ER model, relational model, object-oriented model, etc. In different context: a single database during database design, or multiple databases in multidatabase/data warehouse systems. Key issues: resolve name conflict, data type conflict, structural conflicts, data inconsistency, etc. Manual approach: Integration rules are manually written. Schema Integration vs. Interface Integration Comparing WDB interface integration and database schema integration.  WDB interface schema is simpler (one table/view versus multiple tables of a database schema).  Attributes in WDB interface are more complex as they may consist of multiple elements.  WDB interface mixes attributes and query conditions while database schema don’t.  Meta-data need to be extracted from WDB interface while they are readily available in database schema.  WDB interface integration needs to integrate element format, attribute layout and external values while database schema integration doesn’t. Attribute Matching A key problem in schema/interface integration is to match attributes from different schemas/interfaces. A general framework for attribute matching [Rahm and Bernstein, VLDB Journal 2001].  Develop a number of matchers based on different information.  Dictionary-level information: attribute names  Schema-level information: data type, key, foreign key, …  Instance-level data: values of attributes  Utilize auxiliary information: Special dictionaries, thesaurus, user-input, … Attribute Integration  After attribute matching, attributes are divided into clusters such that each cluster corresponds to a global attribute in the integrated interface. Remaining issues: 1. Determine the name of the global attribute for each cluster. 2. Determine the domain type of each global attribute. The domain type will determine the format. 3. Determine the external values of each global attribute. Hierarchical Interface Integration (1) An example of hierarchical schema representation 1. Where Do You Want to Go? From: City To: City 2. When Do You Want to Go? Departure Date Jan 1 1 From When … Number … Class … To Departure Return Adult …… 1am 3. Number of Passengers? Adults Children 1 Where … 1am Return Date Jan Root Dmonth Dday Dtime Rmonth Rday Rtime 0 4. Class of Service Economy Business First Class Siblings are ordered! Hierarchical Interface Integration (2) Simple mapping versus complex mapping  Simple mapping: 1-to-1 mapping between two fields  Complex mapping: 1-to-m mapping between one field in one interface and multiple fields in another interface Examples of 1-to-m mappings from date departure date month day year No. of passengers passengers adults children Hierarchical Interface Integration (3) Tree Merging American Express Please tell us about yourself Please tell us about your employment Occupation State Chase Please tell us about your employment Phone Years there Address Country State Company address City Street How to merge? Hierarchical Interface Integration (4)  Grouping Constraint: Given subgroups in different user interfaces, is it possible to find a group such that all elements in each subgroup are in adjacent locations? Example: The following example satisfies this requirement: {state, city, street} {country, state, city, street} {country, state} Hierarchical Interface Integration (5) Preserving ancestor-descendant relationships American Express Please tell us about yourself Please tell us about your employment Occupation Please tell us about your employment Phone Years there Street Phone Address Occupation Company address City Please tell us about yourself Please tell us about your employment Country State Integrated Chase State Years there address Country State City Street Hierarchical Interface Integration (6) Naming attributes  Group Naming Compatibility: Names of attributes within a group in a user interface should be compatible. Example: Compatible naming {adults, children} {adults, children, infants} {adults, infants} Incompatible naming: {adults, children} {adults, children, #infants} {#children, #infants} Search Result Annotation Goal: Identify the semantic meaning of each piece of information within each search result record (SRR).   Before result annotation, SRRs on the result pages returned from search engines need to be extracted first. Some approaches combine result extraction and result annotation in one step. Data annotation is needed for   Comparison-shopping applications: entity identification, result merging, … Deep Web crawling and data collection Result Annotation: Problem Description title authors Entity Identification  Problem: Automatically derive rules to determine if two search result records from different WDBs are in fact the same entity (product).  Entity identification is closely related to entity matching, entity resolution, duplicate detection, and record linkage.  It is a classical problem in federated systems that deal with data from multiple sources. Remaining Research Challenges (1) 1. Automatic WDB discovery Goal: Discover Web database interfaces from the Web automatically. Some issues to consider:  How to identify web pages that have a search interface?     There are already some existing work on this. How to differentiate search interfaces for Web databases from those for text search engines? Is the information from the search interface sufficient? Do we need information from search results? How to learn a classifier? Remaining Research Challenges (2) 2. Extraction and understanding of dynamic query interfaces  An increasing number of query interfaces are dynamic in the sense that the query interface may alter after certain fields are selected. Two types of dynamic changes have been observed.  The change of values of some fields (e.g., values under a selection list).  The structure of the query interface (e.g., some fields are added, deleted or modified).  Current query interface models do not consider dynamic query interfaces. Remaining Research Challenges (3) 3. Handling boundary query interfaces in Web-scale clustering.  There are two challenges in Web-scale clustering of query interfaces [Madhavan et el., 2007; Mahmoud and Aboulnaga, 2010].  The number of domains is unknown in advance, which means that the number of clusters is unknown in advance.  There are likely many query interfaces with unclear domains, i.e., they appear between boundaries of multiple domains.  However, the current solutions are not sufficiently accurate and have significant room to improve. Remaining Research Challenges (4) 4. Web database selection Goal: For any given user query, identify the Web databases that are most likely to return good results. Some issues to consider:  How to summarize the content of a Web database?     Numerical attributes Categorical attributes Textual attributes Relationships among the attributes Remaining Research Challenges (5) Web database selection (continued)  How to obtain the summaries automatically?   How to design sample queries for each type of attributes? How to use the summaries to do Web database selection?   How to measure “usefulness” based on different types of attributes? How to combine “usefulness” across different attributes? Remaining Research Challenges (6) 5. Automatic SRR extraction from complex result pages Goal: Automatically identify the rules to extract search result records from complex result pages. Some characteristics of complex result pages:      Record contains both text and images SRRs may be organized into multiple columns/multiple sections. SRRs have a variety of formats. Have no fixed sections (i.e., some sections only appear in some result pages) Some SRRs are divided into multiple blocks. Remaining Research Challenges (7) 6. Global query processing and optimization Goal: Evaluate global queries efficiently and correctly. Some issues to consider:  It consists of many steps:         Identify relevant Web databases (global cost) Translate/map global queries to local queries (global cost) Submit queries and receive results (communication cost) Evaluate translated queries by local Web databases (local cost) Extract search results from result pages (global cost) Filter out unqualified results (global cost) How to optimize the above process? What are the differences between Web integration systems and multidatabase/federated database systems? The End!

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download \documentstyle[widepage,doublespace]{article}