Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
HTML Table Interpretation by Sibling Page Comparison in the Molecular Biology Domain
Cui Tao and David W. Embley
Data Extraction Research Group
Department of Computer Science, Brigham Young University, Provo, UT, 84602
1. Introduction
3. Results
PROBLEMS:
Non-data
tables
Huge evolving number of Bio-databases
Molecular biology database collection
2004: total 548, 162 more than 2003
2005: total 719, 171 more than 2004
Different access capabilities
Syntactic heterogeneity
Semantic heterogeneity
Updated at anytime by independent authorities
GOALS:
Values
To help biologists cross-search various resources
Examples:
Labels
“Find genes which are longer than 5kbp, whose products
have at least two helices, and participate in glycolysis” –
GenBank, PDB, KEGG
“Find genes newly annotated after Jan. 2003 in the fly and
worm genomes” – FlyBase, WormBase
SOLUTION:
2.2.
3. Sibling
SiblingPage
PageComparison
Comparison
Source page understanding – Table interpretation
Sibling pages & sibling tables
Table recognition
Table pattern generalization
Pattern adjustment
Focus of this
poster
Table-Interpretation Steps
HTML table → DOM tree
Tree matching → Find sibling tables
Variable fields ~ values & Fixed fields ~ labels
Infer pattern
Information extraction & semantic annotation
Source location through semantic indexing
Cross-database query processing
2. Table Interpretation
tr
Input: an HTML table
td
Gene Model
td
Status
td
Nucleotides (coding/transcript)
td
Protein
td
Swissprot
td
Amino Acids
table
Output: a formal table notation (Wang notation)
tr
td
F47G6.1 1, 2
td
confirmed by cDNA(s)
td
1773/7391 bp
td
WP:CE26812
td
DTN1_CAEEL
td
590 aa
tr
table
tr
tr
td
Gene Model
td
Status
td
Nucleotides (coding/transcript)
td
Protein
td
Amino Acids
td
F18H3.5a 1, 2
td
confirmed by cDNA(s)
td
1029/3051 bp
td
WP:CE18608
td
342 aa
td
F18H3.5b 1, 2, 3
td
partially confirmed by cDNA(s)
td
1221/1704 bp
td
WP:CE28918
td
406 aa
EXPERIMENTAL RESULTS:
Pattern Combinations
Matches any pre-defined pattern template?
Generates a specific structure pattern for the table
location
structure
2.3. Structure Pattern Generation
label
value
2.1. Table Recognition
Sibling table match percentage: max match score / tree size
> the high threshold: exact match or near exact match
< the low threshold: false match
In between: sibling tables
adjustments  all correct
4. Conclusions
We can:
Recognize data tables
Find labels and values
Infer table patterns
Dynamically adjust table patterns
Domain Generality: work for other domains
Pre-defined structure templates
Dynamically adjust the structure pattern
Consider all tagged tables
Unnest
Filter out tables containing no data:
Test Set: 10 web sites; 100 sibling pages; 862 HTML tables
Table Recognition: correctly eliminated all but 3 non-data tables
Pattern Generation: successfully recognized 28 of 29 patterns
Dynamic Adjustment: 5 location adjustments; 12 structure
An optional
label
Contact Information
Data Extraction Research Group
Department of Computer Science
Brigham Young University
Provo, UT 84602
Cui Tao, [email protected]
David W. Embley, [email protected]
http://www.deg.byu.edu/
An optional
value
Related documents