Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Growing Parallel Paths for Entity-Page Retrieval Tim Weninger, Cindy Xide Lin, and Jiawei Han Department of Computer Science University of Illinois Urbana-Champaign, Urbana, IL Work Submitted to VLDB'10 Advanced Data Mining May 4, 2010 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Problem: Entity Page Retrieval Given: Reference page Advanced Data Mining May 4, 2010 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Problem: Entity Page Retrieval …Can We find Entity Pages of the same Type? Advanced Data Mining May 4, 2010 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Problem: Entity Page Retrieval …Can We find Entity Pages of the same Type? Advanced Data Mining May 4, 2010 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Definitions: Defn 1: Root to link path: ◊ - hrefX contains HTML-TABLE-TR1—TD-hrefX Defn 2: Parallel Links: Share a root to link path. i.e., lists of links Defn 3: Intra-page parallel paths: ◊ - hrefC ǁ ◊ - hrefB ◊ - hrefC ǁ ◊ - hrefX Advanced Data Mining May 4, 2010 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Definitions: Defn 4: Inter-page parallel ◊ - hrefC in Page A ǁ ◊ - hrefW in Page B Advanced Data Mining May 4, 2010 Defn 5: Parallel Web site paths Share intra or inter-page parallel paths across multiple pages Data and Information Systems Laboratory University of Illinois Urbana-Champaign Properties of Parallel Paths Prop. 1: Equal Path Length Property: Parallel paths must contain the same number of pages. Prop. 2: Parallel Page Property: The test of two paths being in parallel is equivalent to the result of tests of respective pages. Prop. 3: Equal Page Length Property: Parallel paths must have the same number of nodes across pages. Advanced Data Mining May 4, 2010 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Properties of Parallel Paths Prop. 4: Divergent Path Property: Parallel Paths can extend through separate pages Prop. 5: Early Termination Property: The test of two paths can be terminated at the first occurrence of a dissimilar node Advanced Data Mining May 4, 2010 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Finding Paths Naive Method Can be very costly Growing Parallel Paths First find example path Then grow paths which are in parallel to the example Repeat with alternate paths This makes magic happen Advanced Data Mining May 4, 2010 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Repeating with alternate paths k-shortest paths Do k-shortest path search. Explore all of these paths Removing links After exploring a path remove the edges from the graph Advanced Data Mining May 4, 2010 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Interpreting the Output Side Effect of Repeating with Alternate paths Given: Jiawei Han Result: Jiawei Han Cheng Zhai Kevin Chang Dan Roth Vikram Adve Roy Campbell 40 38 38 32 4 3 … Advanced Data Mining May 4, 2010 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Interpreting the Output Side Effect of Path Finding What does the link labels on the path tell us about the entity First path People Faculty Jiawei Han Personal Site Second path Research Data Mining Advanced Data Mining May 4, 2010 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Experiments Top 25 CS Departments in US (according to US News) Find all professors United States Congress Find all senators, representatives, and committees UIUC only Find all courses Final all research groups Baseline Google’s find similar search (essentially TFIDF-type ranking) Advanced Data Mining May 4, 2010 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Results Advanced Data Mining May 4, 2010 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Results Advanced Data Mining May 4, 2010 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Results Advanced Data Mining May 4, 2010 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Conclusions and Future Work Given a reference page and an example entity type we can retrieve all entity pages of the same type Implications: We can use this for information integration Search, retrieval can be enhanced Shortcomings: Most errors due to incorrect list finding Advanced Data Mining May 4, 2010 Data and Information Systems Laboratory University of Illinois Urbana-Champaign Questions? Advanced Data Mining May 4, 2010 Data and Information Systems Laboratory University of Illinois Urbana-Champaign