Download pptx

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
BOOTSTRAPPING
INFORMATION EXTRACTION
FROM SEMI-STRUCTURED WEB
PAGES
Andrew Carson and Charles Schafer
Abstract
• No human supervision required system
• Previous work:
1.
Required significant human effort
• Their solution:
• Requiring 2-5 annotated pages fro 4-6 web sites for training model
• No human supervision for the garget web site
• Result:
• 83.8% and 91.1% for different sites.
Introduction
• Extracting structured records from detail pages of semi-
structured web pages
Introduction
• Why semi-structured web
• Great sources of information
• Attribute/value structure: downstream learning or querying systems
Related Work
• Problem of Previous Work
• No labeling example pages, but manual labeling of the output
• Irrelevant fields(20 data fields and 7 schema columns)
• Dela system(automatically label extracted data)
• Problem of labeling detected data fields
• A data field does not have a label
• Multiple fields of the same data type
Methods
• Terms:
• Domain schema: a set of attributes
• Schema column: a single attribute
• Detailed page: a page that corresponds to a single data record
• Data field: a location within a template for that site
• Data values: an instance of that data field
Methods
• Detecting Data Fields
• Partial Tree Alignment Algorithm
Methods
• Classifying Data Fields
• Assign a score to each schema column
• c: Data values => data for training schema column
• f: data fields => contexts from the training data
• Compute the score:
• Use a classifier to map data fields to schema column
• Use a model
• K different feature types
Methods
• Feature Types
• Precontext character 3-grams
• Lowercase value tokens
• Lowercase value character 3-grams
• Value token types
Methods
• Comparing Distributions of Feature Values
• Advantage
• Similar data values
• Avoid over-fitting
• when high-dimensional feature spaces
• Small number of training example
Methods
• KL-Divergence
• Smoothed version
• Skew Similarity Score
Methods
• Combining Skew Similarity Scores
• Combine skew similarity scores for the dfferent feature types using
linear regression model
• Stacked classifier model
• Labeling the Target Site
• Higher
for each schema column c
Evaluation
• Accuracy of automatically labeling new sites
• How well it make recommendations to human annotators
• Input: a collection of annotated sites for a domain
• Method: cross-validation
Results by Site
Results by Schema Column
Identifying Missing Schema Columns
• Vacation rentals: 80.0%
• Job sites: 49.3%
Conclusion