Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Business intelligence wikipedia , lookup
Expense and cost recovery system (ECRS) wikipedia , lookup
Operational transformation wikipedia , lookup
Clusterpoint wikipedia , lookup
Data vault modeling wikipedia , lookup
Versant Object Database wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Semantic Web wikipedia , lookup
Relational algebra wikipedia , lookup
REPORT: WebTables: Exploring the Power of Tables on the Web Udit Joshi, Swati Verma November 6, 2011 Abstract The Web has traditionally been viewed as a corpus of unstructured documents. But the web also contains structured data like HTML tables. Such tables have been extracted from the web using a generic crawl on the <table> tag. The crawl resulted in some 14.1 billion tables. Statistical classication lters reduced this to just 154M tables containing high quality relational data. These 154M ltered tables with their extracted schema result in a huge corpus of relational databases. The paper looks at techniques for keyword search and ranking on this exclusive corpus of relational databases. The paper also shows interesting applications proposed by leveraging the statistical information gleaned from this corpus. 1 Motivation Structured data on the Web exists in several forms, including HTML tables, HTML lists, and back-end Deep Web databases (such as the books sold on Amazon.com). This paper only focuses on HTML tables. WebTables is a system designed to extract relational-style data from the Web expressed using the HTML <table> tag. However, only about 1% of the content embedded in the HTML <table> tags represents good tables. WebTables focuses on two main problems surrounding these databases. The rst is how to extract them from the Web, as 99% of tables carry no relational data. The second deals with harnessing this resulting huge collection of databases. Extracting these tables has been described in [1]. This paper looks at the second aspect. A simple collection of statistics on schemas in the corpus has been collated. This is called the Attribute Correlation Statistics Database or ACSDb. These statistics are used in relation ranking during search as also in several novel applications proposed. 2 Attribute Correlation Statistics Database (ACSDb) Analysis of the extracted schemas shows a power law distribution wrt schema frequency and the number of attributes. The ACDSb contains a frequency count of how frequently any unique schema occurs as well as of individual attribute occurrences. For example combo_make_model_year = 13 indicates that this particular schema occurs 13 times in the corpus, while single_make = 3068 indicates that the attribute make has 3068 occurrences. These statistics are used for computing probabilities like p(name) or conditional probabilities like ACSDb corpus contains 5.4M unique attribute labels in 2.6M unique schemas. 1 p(name|make) etc. In all the 3 Proposed Solutions 3.1 Ranking Algorithms The paper proposes four algorithms for ranking relations. The rst, naiveRank, simply sends the user's query to a search engine and fetches the top-k pages. It returns extracted relations in the URL order returned by the search engine. If less than k tables are returned, naiveRank . does not address this issue The next algorithm lterRank simply addresses this issue by going a bit deeper into the search to ensure that the top-k tables are returned. , featureRank The third algorithm does not rely on an existing search engine. It uses certain relation-specic features to score each extracted relation in the corpus. The two most heavily-weighted features are the number of hits in each relation's schema, and the number of hits in each relation's leftmost column. It sorts by this score and returns the top-k results. The nal algorithm, schemaRank, is the same as featureRank, except that it also includes the ACSDb-based schema coherency score. A coherent schema is one where the attributes are all tightly related to one another in the ACSDb schema corpus. The coherency score is based on the Pointwise Mutual Information (or PMI) which is designed to give a sense of how strongly two items are related. The coherency score for a schema is the average of all possible attribute-pairwise PMI scores for the schema. The PMI score for two attributes follows: log p(a,b) p(a)∗p(b) a and b is dened as The values of p (a), p (b) and p (a, b) can be obtained form the ACSDb corpus. For indexing purposes the conventional Therefore, instead of the linear text inverted index structure of traditional IR does not suce in this setting. model with a single oset value for each element, a 2 dimensional model with (x,y) osets is implemented. The ranking function thus uses both the horizontal and vertical osets to compute the input scoring features. 3.2 ACSDb APPLICATIONS This paper describes three novel applications. The rst application called schema auto-complete is designed to assist novice database designers when creating a relational schema. The user enters one or more attributes, and the auto completer guesses the rest of the attribute labels, which should be appropriate to the target domain. The I, the best schema S of a given size is the one that maximizes p (S − I|I). Attribute Synonym Finding. The synonym-nder takes a set of context attributes, C, as input. It then computes a list of attribute pairs P that are likely to be synonymous in schemas that contain C. For example, in the context of attributes album, artist, the ACSDb synonym-nder outputs song/track. This heuristic used is that for an input The next application is called algorithm is based on the following observations : a b • Synonymous attributes • The odds of synonymity are higher if • Two synonyms will appear in similar contexts ie for syn(a, b) = + P zA and will never appear together in the same schema ie p (a, b) = 0 despite a large value for a and b p (a, b) = 0 p (a)p (b). and a third attribute z∈ / C , p(z|a, C)≈p(z|b, C) p(a)p(b) (p(z|a, C) − p(z|b, C))2 The third and nal application is called Join Graph Traversal. The goal here is to provide a useful way of navigating this huge graph of 2.6M unique schemas. The basic join graph N,L has a node for each unique schema, and an undirected join link between any two schemas sharing a label. The key here is to reduce join clutter. This is done by formulating a distance metric called join neighbor similarity. 2 This metric measures whether a shared attribute D plays a similar role in its schemas X and Y . If D serves the same role in each of its schemas, then those schemas can be clustered together during join graph traversal. neighborSim(X, Y, D) 4 = 1 |X||Y | P aX,bY p(a,b|D) log( p(a|D)p(b|D) ) Experimental Highlights , Rank-ACDSb Naive • Not surprisingly • Fraction of top-k tables which are relevant showed better performance at higher grain. • Schema Auto Completion, the output schemas were always coherent. Schemas were recalled even for a ab, which pertains to Baseball. Giving WebTables multiple tries allowed it to succeed even for an ambiguous input like name. beats by 78-100%. For non intuitive entity like • For Synonym Finding the fraction of correct synonyms in the top-5 was 80% on average. This average decreases as k increases because the results become more generic. • For Join Graph Traversal the schemas were invariably correctly clustered in the join graph. Only a single exception was indicated. 5 Discussion • The author's claim that one of the major challenges in relation search is that it lacks the incoming hyperlink anchor text used in traditional IR. Therefore Page Rank like metrics cannot be used. algorithms proposed for ranking ie Naive Rank and But the rst two Filter Rank, simply use the top ranked pages of traditional IR, and output the tables associated with the top-k keyword search results. This is a major contradiction with their claim. Moreover, these two algorithms do not propose any new idea. In fact, in the Future Work section, the author's tacitly acknowledge that Page Rank is being indirectly used via the document search results but the author's wish to fully integrate Page Rank like metrics with relation search in the future . • The author's have used heuristics to lter out relational data from the HTML tables. Similar heuristics could also have been used to also propose relationships amongst the attributes in the derived schemas. • In comparing the fraction of high scoring relevant tables in top-k, the naive algorithm has been used as the basis for comparison. Having such a weak base is bound to show other algorithms as better, in one case by 100%. • The idea for schema auto complete is novel but seems to have limited application. Any RDBMS would be employing multiple tables with keys, relations and constraints. Schema auto complete does not help here in any way. • Only two human judges have been used for relation ranking. They have been used to rank more than a thousand (query,relation) pairs divided over 30 queries. But we do not know the nature of these queries. • The dataset used is huge ie 14.1 billion HTML tables out of which 154M contain high quality relational data. Such a big dataset is only accessible to large corporations like Google or Yahoo. The results cannot be challenged by the academic community as there is no way to access such a large dataset. 3 • HTML based tables do not have semantics associated with them - therefore the trend is (web 3.0, semantic web) to supply machine readable data (JSON) to web pages, and then have JavaScript code render the JSON data as a table at the client browser. In such a setting the crawl technique suggested in the paper will fail as it relies on the presence of the actual HTML <table> tag. References [1] M. Cafarella, A. Halevy, Z. Wang, E. Wu, and Y. Zhang, Uncovering the relational web , Eleventh International Workshop on the Web and Databases (WebDB), June 2008. Vancouver, Canada. [2] M. Cafarella, A. Halevy, and J. Madhavan, Structured Data on the Web , Communications of the ACM 54(2): 72-79, 2011. 4