Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer Abstract • No human supervision required system • Previous work: 1. Required significant human effort • Their solution: • Requiring 2-5 annotated pages fro 4-6 web sites for training model • No human supervision for the garget web site • Result: • 83.8% and 91.1% for different sites. Introduction • Extracting structured records from detail pages of semi- structured web pages Introduction • Why semi-structured web • Great sources of information • Attribute/value structure: downstream learning or querying systems Related Work • Problem of Previous Work • No labeling example pages, but manual labeling of the output • Irrelevant fields(20 data fields and 7 schema columns) • Dela system(automatically label extracted data) • Problem of labeling detected data fields • A data field does not have a label • Multiple fields of the same data type Methods • Terms: • Domain schema: a set of attributes • Schema column: a single attribute • Detailed page: a page that corresponds to a single data record • Data field: a location within a template for that site • Data values: an instance of that data field Methods • Detecting Data Fields • Partial Tree Alignment Algorithm Methods • Classifying Data Fields • Assign a score to each schema column • c: Data values => data for training schema column • f: data fields => contexts from the training data • Compute the score: • Use a classifier to map data fields to schema column • Use a model • K different feature types Methods • Feature Types • Precontext character 3-grams • Lowercase value tokens • Lowercase value character 3-grams • Value token types Methods • Comparing Distributions of Feature Values • Advantage • Similar data values • Avoid over-fitting • when high-dimensional feature spaces • Small number of training example Methods • KL-Divergence • Smoothed version • Skew Similarity Score Methods • Combining Skew Similarity Scores • Combine skew similarity scores for the dfferent feature types using linear regression model • Stacked classifier model • Labeling the Target Site • Higher for each schema column c Evaluation • Accuracy of automatically labeling new sites • How well it make recommendations to human annotators • Input: a collection of annotated sites for a domain • Method: cross-validation Results by Site Results by Schema Column Identifying Missing Schema Columns • Vacation rentals: 80.0% • Job sites: 49.3% Conclusion