Download REPORT - CSE @ IITD

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Database wikipedia , lookup

Business intelligence wikipedia , lookup

Expense and cost recovery system (ECRS) wikipedia , lookup

Operational transformation wikipedia , lookup

Clusterpoint wikipedia , lookup

Data vault modeling wikipedia , lookup

Versant Object Database wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Semantic Web wikipedia , lookup

Relational algebra wikipedia , lookup

Database model wikipedia , lookup

Relational model wikipedia , lookup

Transcript
REPORT: WebTables: Exploring the Power of Tables on the Web
Udit Joshi, Swati Verma
November 6, 2011
Abstract
The Web has traditionally been viewed as a corpus of unstructured documents. But the web also contains
structured data like HTML tables. Such tables have been extracted from the web using a generic crawl on the
<table> tag. The crawl resulted in some 14.1 billion tables. Statistical classication lters reduced this to just
154M tables containing high quality relational data. These 154M ltered tables with their extracted schema result
in a huge corpus of relational databases. The paper looks at techniques for keyword search and ranking on this
exclusive corpus of relational databases. The paper also shows interesting applications proposed by leveraging the
statistical information gleaned from this corpus.
1
Motivation
Structured data on the Web exists in several forms, including HTML tables, HTML lists, and back-end Deep Web
databases (such as the books sold on Amazon.com). This paper only focuses on HTML tables. WebTables is a
system designed to extract relational-style data from the Web expressed using the HTML <table> tag. However,
only about 1% of the content embedded in the HTML <table> tags represents good tables. WebTables focuses on
two main problems surrounding these databases. The rst is how to extract them from the Web, as 99% of tables
carry no relational data. The second deals with harnessing this resulting huge collection of databases. Extracting
these tables has been described in [1]. This paper looks at the second aspect. A simple collection of statistics on
schemas in the corpus has been collated. This is called the
Attribute Correlation Statistics Database
or ACSDb.
These statistics are used in relation ranking during search as also in several novel applications proposed.
2
Attribute Correlation Statistics Database (ACSDb)
Analysis of the extracted schemas shows a power law distribution wrt schema frequency and the number of attributes.
The ACDSb contains a frequency count of how frequently any unique schema occurs as well as of individual attribute
occurrences. For example combo_make_model_year = 13 indicates that this particular schema occurs 13 times
in the corpus, while single_make = 3068 indicates that the attribute make has 3068 occurrences. These statistics
are used for computing probabilities like
p(name)
or conditional probabilities like
ACSDb corpus contains 5.4M unique attribute labels in 2.6M unique schemas.
1
p(name|make)
etc. In all the
3
Proposed Solutions
3.1
Ranking Algorithms
The paper proposes four algorithms for ranking relations. The rst,
naiveRank, simply sends the user's query to a
search engine and fetches the top-k pages. It returns extracted relations in the URL order returned by the search
engine. If less than
k
tables are returned,
naiveRank
.
does not address this issue
The next algorithm
lterRank
simply addresses this issue by going a bit deeper into the search to ensure that the top-k tables are returned.
, featureRank
The third algorithm
does not rely on an existing search engine. It uses certain relation-specic
features to score each extracted relation in the corpus. The two most heavily-weighted features are the number of
hits in each relation's schema, and the number of hits in each relation's leftmost column. It sorts by this score and
returns the top-k results.
The nal algorithm,
schemaRank,
is the same as featureRank, except that it also includes the ACSDb-based
schema coherency score. A coherent schema is one where the attributes are all tightly related to one another in
the ACSDb schema corpus. The coherency score is based on the Pointwise Mutual Information (or PMI) which is
designed to give a sense of how strongly two items are related. The coherency score for a schema is the average of
all possible attribute-pairwise PMI scores for the schema. The PMI score for two attributes
follows:
log
p(a,b)
p(a)∗p(b)
a
and
b
is dened as
The values of
p (a), p (b)
and
p (a, b)
can be obtained form the ACSDb corpus.
For indexing purposes the conventional
Therefore, instead of the
linear text
inverted index
structure of traditional IR does not suce in this setting.
model with a single oset value for each element, a 2 dimensional model with
(x,y) osets is implemented. The ranking function thus uses both the horizontal and vertical osets to compute
the input scoring features.
3.2
ACSDb APPLICATIONS
This paper describes three novel applications.
The rst application called
schema auto-complete
is designed to
assist novice database designers when creating a relational schema. The user enters one or more attributes, and
the auto completer guesses the rest of the attribute labels, which should be appropriate to the target domain. The
I, the best schema S of a given size is the one that maximizes p (S − I|I).
Attribute Synonym Finding. The synonym-nder takes a set of context attributes,
C, as input. It then computes a list of attribute pairs P that are likely to be synonymous in schemas that contain
C. For example, in the context of attributes album, artist, the ACSDb synonym-nder outputs song/track. This
heuristic used is that for an input
The next application is called
algorithm is based on the following observations :
a
b
•
Synonymous attributes
•
The odds of synonymity are higher if
•
Two synonyms will appear in similar contexts ie for
syn(a, b) =
+
P
zA
and
will never appear together in the same schema ie
p (a, b) = 0
despite a large value for
a
and
b
p (a, b) = 0
p (a)p (b).
and a third attribute
z∈
/ C , p(z|a, C)≈p(z|b, C)
p(a)p(b)
(p(z|a, C) − p(z|b, C))2
The third and nal application is called
Join Graph Traversal.
The goal here is to provide a useful way of navigating
this huge graph of 2.6M unique schemas. The basic join graph
N,L
has a node for each unique schema, and an
undirected join link between any two schemas sharing a label. The key here is to reduce join clutter. This is done
by formulating a distance metric called
join neighbor similarity.
2
This metric measures whether a shared attribute
D
plays a similar role in its schemas
X
and
Y
. If
D
serves the same role in each of its schemas, then those schemas
can be clustered together during join graph traversal.
neighborSim(X, Y, D)
4
=
1
|X||Y |
P
aX,bY
p(a,b|D)
log( p(a|D)p(b|D)
)
Experimental Highlights
, Rank-ACDSb
Naive
•
Not surprisingly
•
Fraction of top-k tables which are relevant showed better performance at higher grain.
•
Schema Auto Completion, the output schemas were always coherent. Schemas were recalled even for a
ab, which pertains to Baseball. Giving WebTables multiple tries allowed it to succeed
even for an ambiguous input like name.
beats
by 78-100%.
For
non intuitive entity like
•
For
Synonym Finding
the fraction of correct synonyms in the top-5 was 80% on average.
This average
decreases as k increases because the results become more generic.
•
For
Join Graph Traversal
the schemas were invariably correctly clustered in the join graph. Only a single
exception was indicated.
5
Discussion
•
The author's claim that one of the major challenges in relation search is that it lacks the incoming hyperlink
anchor text used in traditional IR. Therefore Page Rank like metrics cannot be used.
algorithms proposed for ranking ie
Naive Rank
and
But the rst two
Filter Rank, simply use the top ranked pages of traditional
IR, and output the tables associated with the top-k keyword search results. This is a major contradiction
with their claim. Moreover, these two algorithms do not propose any new idea. In fact, in the
Future Work
section, the author's tacitly acknowledge that Page Rank is being indirectly used via the document search
results but the author's wish to fully integrate Page Rank like metrics with relation search in the future .
•
The author's have used heuristics to lter out relational data from the HTML tables. Similar heuristics could
also have been used to also propose relationships amongst the attributes in the derived schemas.
•
In comparing the fraction of high scoring relevant tables in top-k, the naive algorithm has been used as the
basis for comparison. Having such a weak base is bound to show other algorithms as better, in one case by
100%.
•
The idea for schema auto complete is novel but seems to have limited application. Any RDBMS would be
employing multiple tables with keys, relations and constraints. Schema auto complete does not help here in
any way.
•
Only two human judges have been used for relation ranking.
They have been used to rank more than a
thousand (query,relation) pairs divided over 30 queries. But we do not know the nature of these queries.
•
The dataset used is huge ie 14.1 billion HTML tables out of which 154M contain high quality relational
data. Such a big dataset is only accessible to large corporations like
Google
or
Yahoo.
The results cannot be
challenged by the academic community as there is no way to access such a large dataset.
3
•
HTML based tables do not have semantics associated with them - therefore the trend is (web 3.0, semantic
web) to supply machine readable data (JSON) to web pages, and then have JavaScript code render the JSON
data as a table at the client browser. In such a setting the crawl technique suggested in the paper will fail as
it relies on the presence of the actual HTML <table> tag.
References
[1] M. Cafarella, A. Halevy, Z. Wang, E. Wu, and Y. Zhang, Uncovering the relational web , Eleventh International
Workshop on the Web and Databases (WebDB), June 2008. Vancouver, Canada.
[2] M. Cafarella, A. Halevy, and J. Madhavan, Structured Data on the Web , Communications of the ACM 54(2):
72-79, 2011.
4