Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
HTML Table Interpretation by Sibling Page Comparison in the Molecular Biology Domain Cui Tao and David W. Embley Data Extraction Research Group Department of Computer Science, Brigham Young University, Provo, UT, 84602 1. Introduction 3. Results PROBLEMS: Non-data tables Huge evolving number of Bio-databases Molecular biology database collection 2004: total 548, 162 more than 2003 2005: total 719, 171 more than 2004 Different access capabilities Syntactic heterogeneity Semantic heterogeneity Updated at anytime by independent authorities GOALS: Values To help biologists cross-search various resources Examples: Labels “Find genes which are longer than 5kbp, whose products have at least two helices, and participate in glycolysis” – GenBank, PDB, KEGG “Find genes newly annotated after Jan. 2003 in the fly and worm genomes” – FlyBase, WormBase SOLUTION: 2.2. 3. Sibling SiblingPage PageComparison Comparison Source page understanding – Table interpretation Sibling pages & sibling tables Table recognition Table pattern generalization Pattern adjustment Focus of this poster Table-Interpretation Steps HTML table → DOM tree Tree matching → Find sibling tables Variable fields ~ values & Fixed fields ~ labels Infer pattern Information extraction & semantic annotation Source location through semantic indexing Cross-database query processing 2. Table Interpretation tr Input: an HTML table td Gene Model td Status td Nucleotides (coding/transcript) td Protein td Swissprot td Amino Acids table Output: a formal table notation (Wang notation) tr td F47G6.1 1, 2 td confirmed by cDNA(s) td 1773/7391 bp td WP:CE26812 td DTN1_CAEEL td 590 aa tr table tr tr td Gene Model td Status td Nucleotides (coding/transcript) td Protein td Amino Acids td F18H3.5a 1, 2 td confirmed by cDNA(s) td 1029/3051 bp td WP:CE18608 td 342 aa td F18H3.5b 1, 2, 3 td partially confirmed by cDNA(s) td 1221/1704 bp td WP:CE28918 td 406 aa EXPERIMENTAL RESULTS: Pattern Combinations Matches any pre-defined pattern template? Generates a specific structure pattern for the table location structure 2.3. Structure Pattern Generation label value 2.1. Table Recognition Sibling table match percentage: max match score / tree size > the high threshold: exact match or near exact match < the low threshold: false match In between: sibling tables adjustments all correct 4. Conclusions We can: Recognize data tables Find labels and values Infer table patterns Dynamically adjust table patterns Domain Generality: work for other domains Pre-defined structure templates Dynamically adjust the structure pattern Consider all tagged tables Unnest Filter out tables containing no data: Test Set: 10 web sites; 100 sibling pages; 862 HTML tables Table Recognition: correctly eliminated all but 3 non-data tables Pattern Generation: successfully recognized 28 of 29 patterns Dynamic Adjustment: 5 location adjustments; 12 structure An optional label Contact Information Data Extraction Research Group Department of Computer Science Brigham Young University Provo, UT 84602 Cui Tao, [email protected] David W. Embley, [email protected] http://www.deg.byu.edu/ An optional value