Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Extending SDARTS: Extracting Metadata from Web Databases and Interfacing with Open Archives Initiative Panagiotis G. Ipeirotis Tom Barry Luis Gravano Computer Science Dept., Columbia University Metasearching? Why? “Surface” Web vs. “Hidden” Web Keywords SUBMIT “Surface” Web – – Link structure Crawlable “Hidden” Web – – – – 5/22/2017 CLEAR Documents “hidden” in databases No link structure Search engines do not index them Need to query each collection individually Columbia University Computer Science Dept. 2 Metasearching Challenges Select good databases for a given query Evaluate the query at these databases Merge the results from these databases “Content summaries” of databases Uniform interfaces (frequencies of words) Hidden Web Metasearcher Non-indexed Documents wireless: 2,000 network: 8,000 ... 5/22/2017 Relational Database / Library / etc. wireless: 0 network: Columbia 10 University ... Computer Science Dept. Existing Web Database <% %> wireless: 5 network: 40 ... 3 Outline Background: SDARTS, SDLIP, STARTS Extracting content summaries from remote web databases Interfacing with Open Archives Initiative 5/22/2017 Columbia University Computer Science Dept. 4 SDARTS: SDLIP + STARTS NOT yet another protocol SDLIP interfaces STARTS metadata Metasearcher S M S M grep cat select S M http://…. <% %> S = Search 5/22/2017 Columbia University Computer Science Dept. M = Metadata 5 STARTS: A Metasearching Protocol Defines: Query language Results format Metadata for the collection Complements SDLIP for PubMed content summary metasearching purposes number of documents = 3,868,552 Provides metadata for individual documents Provides content summaries for databases 5/22/2017 … cancer 1,398,178 heart 281,506 hepatitis 23,481 basketball 907 Columbia University Computer Science Dept. 6 SDARTS: The Toolkit SDARTS architecture makes new-wrapper implementation easy SDARTS toolkit includes reference implementations for common types of text databases: Local text databases Local XML databases Remote web databases Customization requires just editing configuration files, no programming 5/22/2017 Columbia University Computer Science Dept. 7 SDARTS Content Summaries Detailed content summaries easily extracted from locally available (plain-text or XML) databases Detailed content summaries so far not available for remote web databases 5/22/2017 No access to full contents Columbia University Computer Science Dept. 8 Extracting Content Summaries from Remote Web Databases No direct access to remote documents Resort to document sampling: VLDB 2002 Send queries to the database Retrieve a representative document sample Use the sample to create an approximation of the content summary Database selection algorithms work well even with approximate content summaries 5/22/2017 Columbia University Computer Science Dept. 9 Topic-based Sampling: Training Start with a predefined hierarchy Root and associated, pre-classified documents ... Train rule-based document Computers ... Health ... classifiers for each node The output is a set of rules like: ibm AND computers → Computers lung AND cancer → Health … hepatitis AND liver → Hepatitis angina → Heart … 5/22/2017 ... Heart ... Hepatitis ... } Root } Health Columbia University Computer Science Dept. 10 Topic-based Sampling: Probing Transform each rule into a query HealthRoot metallurgy aids polo oncology (0) (7,530) football liver angina keyboard (1,230) (80) cancer(150) (4,345) (780)chf dna psa ram (32) (24,520) (30) (2,340) (7,700) (140) Sports Heart Health Cancer Computers Science safe AND sex (245) Hepatitis AIDS hiv (5,334) Sampling proceeds in rounds: In each round, the rules associated with each node are turned into queries to the database 5/22/2017 For each query: Send query to database Record number of matches Retrieve top-k documents for query At the end of the round: Analyze matches for each category Choose category to focus on The result is a representative document sample Columbia University Computer Science Dept. 11 Sample Contains “Relative” Word Frequencies “Liver” appears in 200 out of 300 documents in sample “Kidney” appears in 100 out of 300 documents in sample “Hepatitis” appears in 30 out of 300 documents in sample Document frequencies in actual database? Query “liver” returned 140,000 matches Query “hepatitis” returned 20,000 matches “kidney” was not a query probe… Can exploit number of matches from one-word queries 5/22/2017 Columbia University Computer Science Dept. 12 Adjusting Document Frequencies We know absolute document frequency f of words from one-word queries f = P (r+p) -B Known Frequency We know ranking r of words according to document frequency in sample ? 140,000 matches Unknown Frequency ? Frequency in Sample (always known) Mandelbrot’s formula 60,000 matches connects word frequency f and ranking r ? 20,000 matches We use curve-fitting to estimate the absolute frequency of all words in sample 5/22/2017 ... cancer ... ... liver ... kidneys Columbia University Computer Science Dept. ... ... ? ... stomach hepatitis 13 Implementing Content-Summary Extraction in SDARTS Toolkit Implemented content-summary extraction module as J2EE-compliant servlet First, build SDARTS wrapper for remote web database Then, trigger extraction process to generate content summary automatically Module customizable with any classification scheme 5/22/2017 Toolkit provides 72-node hierarchical scheme and associated classifiers To add new scheme, should define the hierarchy and provide classifiers for the internal nodes Columbia University Computer Science Dept. 14 Fraction of PubMed Content Summary PubMed content summary number of documents = 3,868,552 … cancer 1,398,178 aids 106,512 heart 281,506 angina 26,775 hepatitis 23,481 … Extracted automatically ~ 27,500 words in the extracted content summary Less than 200 queries sent Retrieved 4 documents per query basketball 907 cpu 487 The extracted content summary accurately represents size and Columbia University 5/22/2017 contentsComputer of theScience database Dept. 15 Topic-based Sampling: Conclusions SDARTS now supports extraction of detailed content summaries from any database, local or remote Sophisticated database selection algorithms can now be implemented on top of SDARTS Implemented and available for download: Database Selection Module SDARTS Client with Database Selection 5/22/2017 Columbia University Computer Science Dept. 16 Interfacing with Open Archives Initiative (OAI) “No man is an island, entire of itself; every man is a piece of the continent, a part of the main...…” (John Donne) Export SDARTS metadata under OAI OAI Service Provider SDARTS/ SDLIP Server OAI Data Provider Access transparently any OAI collection through SDARTS SDARTS Client 5/22/2017 Columbia University Computer Science Dept. 17 Exporting SDARTS Metadata under OAI SDARTS supports detailed, record-level metadata for each document, for XML and plaintext collections <PAPER> COLUMBIA SDARTS Server <TITLE>The threat of vancomycin resistance</TITLE> PubMed Publications <AUTHORS>Trish M. Perl MD, MSc</AUTHORS> Aides Medical Collection <FILENO>ajm_106_05_0489</FILENO> Easy mapping to Dublin Core SDARTS also exports content summaries under OAI Each SDARTS collection is mapped to an OAI set We export the content summaries under OAI, as metadata about the set 5/22/2017 <APPEARED> NOAH: New York Online Access to Health <JRNL>American Journal of Medicine</JRNL> <VOL>106</VOL><ISS>5</ISS> Cardiovascular Institute of the South <DATE>3 May </DATE> <YEAR>1999</YEAR> </APPEARED> Columbia's DLI2 Medical Corpus <ABSTRACT> … </ABSTRACT> Harrisons Online <BODY> … </BODY> </PAPER> Columbia University Computer Science Dept. 18 SDARTS OAI Sever: Details Uses OCLC OAI Server OAI Service Provider Uses MySQL –via JDBC– to store OAI records Records materialized after first request for space efficiency Distributed as WAR file SDARTS OAI Interface JDBC Simple configuration: Specify SDARTS/MySQL address SDARTS Server 5/22/2017 Columbia University Computer Science Dept. MySQL RDBMS 19 Searching OAI Collections OAI is not designed for searching Possible to restrict only “Date” and “Set” Need to search OAI collections Users want to specify “Title”, “Author”, etc. Author = “F. Douglass” OAI Service Provider OAI Data Provider (e.g., Library of Congress ) User Author = “F. Douglass” 5/22/2017 Columbia University Computer Science Dept. 20 Harvesting and Searching OAI within SDARTS OAI exports metadata records in XML SDARTS can index and search XML collections (e.g., Library of Congress ) Harvest OAI/XML records Solution: OAI Data Provider Harvest OAI records (by “Date”, “Set”) Store records locally as XML documents Use SDARTS XML wrapper to index them The OAI collection is searchable as an SDARTS XML database 5/22/2017 Columbia University Computer Science Dept. Index OAI/XML records SDARTS/ SDLIP Server 21 Adding an OAI Collection in SDARTS http://memory.loc.gov/cgi-bin/oai loc 2002-01-01 5/22/2017 Columbia University Computer Science Dept. 22 Distributed Search over OAI SDARTS treats OAI collections as simple, local XML databases VT Electronic Thesis & Dissertation number of documents = 2,948 … Exact content summaries are exported for OAI collections study 1,479 thesis 493 … Possible to build sophisticated distributed search over OAI using SDARTS cancer 13 basketball 2 … SDARTS Content Summary for an OAI collection 5/22/2017 Columbia University Computer Science Dept. 23 Conclusions SDARTS can now extract rich content summaries from: Local text and XML databases Remote web databases OAI-compliant collections SDARTS is now OAI-compliant SDARTS allows easy integration of any OAI collection into SDARTS SDARTS supports searching transparently over a wide range of heterogeneous collections No programming required for any of the tasks 5/22/2017 Columbia University Computer Science Dept. 24 We are on the Web :-) SDARTS executables and documentation SDARTS source code with documentation SDARTS web client SDARTS database selection module SDARTS-OAI interface tools Sample SDARTS-compliant databases http://sdarts.cs.columbia.edu/ 5/22/2017 Columbia University Computer Science Dept. 25