Download Extending SDARTS: Extracting Metadata from Web Databases

Extending SDARTS: Extracting Metadata from Web Databases and Interfacing with Open Archives Initiative Panagiotis G. Ipeirotis Tom Barry Luis Gravano Computer Science Dept., Columbia University Metasearching? Why? “Surface” Web vs. “Hidden” Web Keywords SUBMIT  “Surface” Web – – Link structure Crawlable  “Hidden” Web – – – – 5/22/2017 CLEAR Documents “hidden” in databases No link structure Search engines do not index them Need to query each collection individually Columbia University Computer Science Dept. 2 Metasearching Challenges  Select good databases for a given query  Evaluate the query at these databases  Merge the results from these databases “Content summaries” of databases Uniform interfaces (frequencies of words) Hidden Web Metasearcher Non-indexed Documents wireless: 2,000 network: 8,000 ... 5/22/2017 Relational Database / Library / etc. wireless: 0 network: Columbia 10 University ... Computer Science Dept. Existing Web Database <% %> wireless: 5 network: 40 ... 3 Outline  Background: SDARTS, SDLIP, STARTS  Extracting content summaries from remote web databases  Interfacing with Open Archives Initiative 5/22/2017 Columbia University Computer Science Dept. 4 SDARTS: SDLIP + STARTS NOT yet another protocol SDLIP interfaces STARTS metadata Metasearcher S M S M grep cat select S M http://…. <% %> S = Search 5/22/2017 Columbia University Computer Science Dept. M = Metadata 5 STARTS: A Metasearching Protocol  Defines:    Query language Results format Metadata for the collection  Complements SDLIP for PubMed content summary metasearching purposes number of documents = 3,868,552  Provides metadata for individual documents  Provides content summaries for databases 5/22/2017 … cancer  1,398,178 heart  281,506 hepatitis  23,481 basketball 907 Columbia University Computer Science Dept. 6 SDARTS: The Toolkit  SDARTS architecture makes new-wrapper implementation easy  SDARTS toolkit includes reference implementations for common types of text databases: Local text databases Local XML databases Remote web databases    Customization requires just editing configuration files, no programming 5/22/2017 Columbia University Computer Science Dept. 7 SDARTS Content Summaries  Detailed content summaries easily extracted from locally available (plain-text or XML) databases  Detailed content summaries so far not available for remote web databases  5/22/2017 No access to full contents Columbia University Computer Science Dept. 8 Extracting Content Summaries from Remote Web Databases  No direct access to remote documents  Resort to document sampling:    VLDB 2002 Send queries to the database Retrieve a representative document sample Use the sample to create an approximation of the content summary  Database selection algorithms work well even with approximate content summaries 5/22/2017 Columbia University Computer Science Dept. 9 Topic-based Sampling: Training  Start with a predefined hierarchy Root and associated, pre-classified documents ...  Train rule-based document Computers ... Health ... classifiers for each node  The output is a set of rules like:  ibm AND computers → Computers  lung AND cancer → Health  …  hepatitis AND liver → Hepatitis  angina → Heart  … 5/22/2017 ... Heart ... Hepatitis ... } Root } Health Columbia University Computer Science Dept. 10 Topic-based Sampling: Probing  Transform each rule into a query HealthRoot metallurgy aids polo oncology (0) (7,530) football liver angina keyboard (1,230) (80) cancer(150) (4,345) (780)chf dna psa ram (32) (24,520) (30) (2,340) (7,700) (140) Sports Heart Health Cancer Computers Science safe AND sex (245) Hepatitis AIDS hiv (5,334) Sampling proceeds in rounds: In each round, the rules associated with each node are turned into queries to the database 5/22/2017  For each query: Send query to database  Record number of matches  Retrieve top-k documents for query  At the end of the round:  Analyze matches for each category  Choose category to focus on  The result is a representative document sample Columbia University Computer Science Dept. 11 Sample Contains “Relative” Word Frequencies  “Liver” appears in 200 out of 300 documents in sample  “Kidney” appears in 100 out of 300 documents in sample  “Hepatitis” appears in 30 out of 300 documents in sample Document frequencies in actual database?  Query “liver” returned 140,000 matches  Query “hepatitis” returned 20,000 matches  “kidney” was not a query probe… Can exploit number of matches from one-word queries 5/22/2017 Columbia University Computer Science Dept. 12 Adjusting Document Frequencies  We know absolute document frequency f of words from one-word queries f = P (r+p) -B Known Frequency  We know ranking r of words according to document frequency in sample ? 140,000 matches Unknown Frequency ? Frequency in Sample (always known)  Mandelbrot’s formula 60,000 matches connects word frequency f and ranking r ? 20,000 matches  We use curve-fitting to estimate the absolute frequency of all words in sample 5/22/2017 ... cancer ... ... liver ... kidneys Columbia University Computer Science Dept. ... ... ? ... stomach hepatitis 13 Implementing Content-Summary Extraction in SDARTS Toolkit  Implemented content-summary extraction module as J2EE-compliant servlet   First, build SDARTS wrapper for remote web database Then, trigger extraction process to generate content summary automatically  Module customizable with any classification scheme   5/22/2017 Toolkit provides 72-node hierarchical scheme and associated classifiers To add new scheme, should define the hierarchy and provide classifiers for the internal nodes Columbia University Computer Science Dept. 14 Fraction of PubMed Content Summary PubMed content summary number of documents = 3,868,552 … cancer  1,398,178 aids  106,512 heart  281,506 angina  26,775 hepatitis  23,481 …  Extracted automatically  ~ 27,500 words in the extracted content summary  Less than 200 queries sent  Retrieved 4 documents per query basketball 907 cpu  487 The extracted content summary accurately represents size and Columbia University 5/22/2017 contentsComputer of theScience database Dept. 15 Topic-based Sampling: Conclusions  SDARTS now supports extraction of detailed content summaries from any database, local or remote  Sophisticated database selection algorithms can now be implemented on top of SDARTS Implemented and available for download: Database Selection Module SDARTS Client with Database Selection 5/22/2017 Columbia University Computer Science Dept. 16 Interfacing with Open Archives Initiative (OAI) “No man is an island, entire of itself; every man is a piece of the continent, a part of the main...…” (John Donne)   Export SDARTS metadata under OAI OAI Service Provider SDARTS/ SDLIP Server OAI Data Provider Access transparently any OAI collection through SDARTS SDARTS Client 5/22/2017 Columbia University Computer Science Dept. 17 Exporting SDARTS Metadata under OAI    SDARTS supports detailed, record-level metadata for each document, for XML and plaintext collections <PAPER> COLUMBIA SDARTS Server <TITLE>The threat of vancomycin resistance</TITLE>  PubMed Publications <AUTHORS>Trish M. Perl MD, MSc</AUTHORS>  Aides Medical Collection <FILENO>ajm_106_05_0489</FILENO> Easy mapping to Dublin Core SDARTS also exports content summaries under OAI  Each SDARTS collection is mapped to an OAI set  We export the content summaries under OAI, as metadata about the set 5/22/2017 <APPEARED>  NOAH: New York Online Access to Health <JRNL>American Journal of Medicine</JRNL> <VOL>106</VOL><ISS>5</ISS>  Cardiovascular Institute of the South <DATE>3 May </DATE> <YEAR>1999</YEAR> </APPEARED>  Columbia's DLI2 Medical Corpus <ABSTRACT> … </ABSTRACT>  Harrisons Online <BODY> … </BODY> </PAPER> Columbia University Computer Science Dept. 18 SDARTS OAI Sever: Details  Uses OCLC OAI Server OAI Service Provider  Uses MySQL –via JDBC– to store OAI records  Records materialized after first request for space efficiency  Distributed as WAR file  SDARTS OAI Interface JDBC Simple configuration: Specify SDARTS/MySQL address SDARTS Server 5/22/2017 Columbia University Computer Science Dept. MySQL RDBMS 19 Searching OAI Collections  OAI is not designed for searching  Possible to restrict only “Date” and “Set”  Need to search OAI collections  Users want to specify “Title”, “Author”, etc. Author = “F. Douglass” OAI Service Provider OAI Data Provider (e.g., Library of Congress ) User Author = “F. Douglass” 5/22/2017 Columbia University Computer Science Dept. 20 Harvesting and Searching OAI within SDARTS   OAI exports metadata records in XML SDARTS can index and search XML collections   (e.g., Library of Congress ) Harvest OAI/XML records Solution:  OAI Data Provider Harvest OAI records (by “Date”, “Set”) Store records locally as XML documents Use SDARTS XML wrapper to index them The OAI collection is searchable as an SDARTS XML database 5/22/2017 Columbia University Computer Science Dept. Index OAI/XML records SDARTS/ SDLIP Server 21 Adding an OAI Collection in SDARTS http://memory.loc.gov/cgi-bin/oai loc 2002-01-01 5/22/2017 Columbia University Computer Science Dept. 22 Distributed Search over OAI  SDARTS treats OAI collections as simple, local XML databases VT Electronic Thesis & Dissertation number of documents = 2,948 …  Exact content summaries are exported for OAI collections study  1,479 thesis  493 …  Possible to build sophisticated distributed search over OAI using SDARTS cancer  13 basketball 2 … SDARTS Content Summary for an OAI collection 5/22/2017 Columbia University Computer Science Dept. 23 Conclusions  SDARTS can now extract rich content summaries from:    Local text and XML databases Remote web databases OAI-compliant collections  SDARTS is now OAI-compliant  SDARTS allows easy integration of any OAI collection into SDARTS  SDARTS supports searching transparently over a wide range of heterogeneous collections No programming required for any of the tasks 5/22/2017 Columbia University Computer Science Dept. 24 We are on the Web :-)  SDARTS executables and documentation  SDARTS source code with documentation  SDARTS web client  SDARTS database selection module  SDARTS-OAI interface tools  Sample SDARTS-compliant databases http://sdarts.cs.columbia.edu/ 5/22/2017 Columbia University Computer Science Dept. 25

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Extending SDARTS: Extracting Metadata from Web Databases