Download Extending SDARTS: Extracting Metadata from Web Databases

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Concurrency control wikipedia , lookup

Relational model wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Database model wikipedia , lookup

Clusterpoint wikipedia , lookup

Transcript
Extending SDARTS:
Extracting Metadata from Web Databases
and Interfacing with Open Archives Initiative
Panagiotis G. Ipeirotis
Tom Barry
Luis Gravano
Computer Science Dept., Columbia University
Metasearching? Why?
“Surface” Web vs. “Hidden” Web
Keywords
SUBMIT

“Surface” Web
–
–
Link structure
Crawlable

“Hidden” Web
–
–
–
–
5/22/2017
CLEAR
Documents “hidden” in databases
No link structure
Search engines do not index them
Need to query each collection individually
Columbia University
Computer Science Dept.
2
Metasearching Challenges
 Select good databases for a given query
 Evaluate the query at these databases
 Merge the results from these databases
“Content summaries” of
databases
Uniform interfaces
(frequencies of words)
Hidden Web
Metasearcher
Non-indexed
Documents
wireless: 2,000
network: 8,000
...
5/22/2017
Relational
Database /
Library / etc.
wireless: 0
network: Columbia
10 University
...
Computer Science Dept.
Existing
Web
Database
<%
%>
wireless: 5
network: 40
...
3
Outline
 Background: SDARTS, SDLIP, STARTS
 Extracting content summaries from remote
web databases
 Interfacing with Open Archives Initiative
5/22/2017
Columbia University
Computer Science Dept.
4
SDARTS: SDLIP + STARTS
NOT
yet another
protocol
SDLIP
interfaces
STARTS
metadata
Metasearcher
S M
S M
grep cat
select
S M
http://….
<%
%>
S = Search
5/22/2017
Columbia University
Computer Science Dept.
M = Metadata
5
STARTS: A Metasearching Protocol
 Defines:



Query language
Results format
Metadata for the collection
 Complements SDLIP for
PubMed content summary
metasearching purposes
number of documents = 3,868,552
 Provides metadata for individual
documents
 Provides content summaries for
databases
5/22/2017
…
cancer
 1,398,178
heart
 281,506
hepatitis  23,481
basketball 907
Columbia University
Computer Science Dept.
6
SDARTS: The Toolkit
 SDARTS architecture makes new-wrapper
implementation easy
 SDARTS toolkit includes reference implementations for
common types of text databases:
Local text databases
Local XML databases
Remote web databases



Customization requires just editing configuration files,
no programming
5/22/2017
Columbia University
Computer Science Dept.
7
SDARTS Content Summaries
 Detailed content summaries easily extracted from
locally available (plain-text or XML) databases
 Detailed content summaries so far not available for
remote web databases

5/22/2017
No access to full contents
Columbia University
Computer Science Dept.
8
Extracting Content Summaries from
Remote Web Databases
 No direct access to remote documents
 Resort to document sampling:



VLDB 2002
Send queries to the database
Retrieve a representative document sample
Use the sample to create an approximation of the
content summary
 Database selection algorithms work well even
with approximate content summaries
5/22/2017
Columbia University
Computer Science Dept.
9
Topic-based Sampling: Training
 Start with a predefined hierarchy
Root
and associated, pre-classified
documents
...
 Train rule-based document
Computers
...
Health
...
classifiers for each node
 The output is a set of rules like:
 ibm AND computers → Computers
 lung AND cancer → Health

…

hepatitis AND liver → Hepatitis

angina → Heart

…
5/22/2017
...
Heart
...
Hepatitis
...
} Root
} Health
Columbia University
Computer Science Dept.
10
Topic-based Sampling: Probing
 Transform each rule into a query
HealthRoot
metallurgy
aids
polo
oncology
(0)
(7,530)
football
liver
angina
keyboard
(1,230)
(80)
cancer(150)
(4,345)
(780)chf
dna
psa
ram (32)
(24,520)
(30)
(2,340)
(7,700)
(140)
Sports
Heart
Health
Cancer
Computers
Science safe AND sex
(245)
Hepatitis
AIDS
hiv
(5,334)
Sampling proceeds in rounds:
In each round, the rules associated with each
node are turned into queries to the database
5/22/2017
 For each query:
Send query to database
 Record number of matches
 Retrieve top-k documents for
query
 At the end of the round:
 Analyze matches for each
category
 Choose category to focus on

The result is a representative
document sample
Columbia University
Computer Science Dept.
11
Sample Contains “Relative” Word Frequencies
 “Liver” appears in 200 out of 300 documents in sample
 “Kidney” appears in 100 out of 300 documents in sample
 “Hepatitis” appears in 30 out of 300 documents in sample
Document frequencies in actual database?
 Query “liver” returned 140,000 matches
 Query “hepatitis” returned 20,000 matches
 “kidney” was not a query probe…
Can exploit number of matches from one-word queries
5/22/2017
Columbia University
Computer Science Dept.
12
Adjusting Document Frequencies
 We know absolute
document frequency f of
words from one-word
queries
f = P (r+p) -B
Known Frequency
 We know ranking r of
words according to
document frequency in
sample
?
140,000 matches
Unknown Frequency
?
Frequency in Sample (always known)
 Mandelbrot’s formula
60,000 matches
connects word frequency
f and ranking r
?
20,000 matches
 We use curve-fitting to
estimate the absolute
frequency of all words in
sample
5/22/2017
...
cancer
...
...
liver
...
kidneys
Columbia University
Computer Science Dept.
...
...
?
...
stomach
hepatitis
13
Implementing Content-Summary
Extraction in SDARTS Toolkit
 Implemented content-summary extraction module as
J2EE-compliant servlet


First, build SDARTS wrapper for remote web database
Then, trigger extraction process to generate content summary
automatically
 Module customizable with any classification scheme


5/22/2017
Toolkit provides 72-node hierarchical scheme and associated
classifiers
To add new scheme, should define the hierarchy and provide
classifiers for the internal nodes
Columbia University
Computer Science Dept.
14
Fraction of PubMed Content Summary
PubMed content summary
number of documents = 3,868,552
…
cancer
 1,398,178
aids
 106,512
heart
 281,506
angina
 26,775
hepatitis  23,481
…
 Extracted automatically
 ~ 27,500 words in the extracted
content summary
 Less than 200 queries sent
 Retrieved 4 documents per
query
basketball 907
cpu
 487
The extracted content summary accurately represents size and
Columbia University
5/22/2017
contentsComputer
of
theScience
database
Dept.
15
Topic-based Sampling: Conclusions
 SDARTS now supports extraction of detailed content
summaries from any database, local or remote
 Sophisticated database selection algorithms can now
be implemented on top of SDARTS
Implemented and available for download:
Database Selection Module
SDARTS Client with Database Selection
5/22/2017
Columbia University
Computer Science Dept.
16
Interfacing with Open Archives Initiative (OAI)
“No man is an island, entire of itself;
every man is a piece of the continent,
a part of the main...…”
(John Donne)


Export SDARTS metadata
under OAI
OAI
Service
Provider
SDARTS/
SDLIP
Server
OAI
Data
Provider
Access transparently any OAI
collection through SDARTS
SDARTS
Client
5/22/2017
Columbia University
Computer Science Dept.
17
Exporting SDARTS Metadata under OAI



SDARTS supports detailed,
record-level metadata for each
document, for XML and plaintext collections
<PAPER>
COLUMBIA SDARTS Server
<TITLE>The threat of vancomycin resistance</TITLE>
 PubMed Publications
<AUTHORS>Trish M. Perl MD, MSc</AUTHORS>

Aides Medical Collection
<FILENO>ajm_106_05_0489</FILENO>
Easy mapping to Dublin Core
SDARTS also exports content
summaries under OAI

Each SDARTS collection
is mapped to an OAI set

We export the content
summaries under OAI, as
metadata about the set
5/22/2017
<APPEARED>

NOAH: New York Online Access to Health
<JRNL>American Journal of Medicine</JRNL>
<VOL>106</VOL><ISS>5</ISS>
 Cardiovascular
Institute of the South
<DATE>3 May </DATE> <YEAR>1999</YEAR>
</APPEARED>
 Columbia's DLI2 Medical Corpus
<ABSTRACT> … </ABSTRACT>
 Harrisons Online
<BODY> … </BODY>
</PAPER>
Columbia University
Computer Science Dept.
18
SDARTS OAI Sever: Details
 Uses OCLC OAI Server
OAI
Service
Provider
 Uses MySQL –via JDBC– to
store OAI records

Records materialized after first
request for space efficiency
 Distributed as WAR file

SDARTS
OAI
Interface
JDBC
Simple configuration: Specify
SDARTS/MySQL address
SDARTS
Server
5/22/2017
Columbia University
Computer Science Dept.
MySQL
RDBMS
19
Searching OAI Collections

OAI is not designed for searching
 Possible to restrict only “Date” and “Set”

Need to search OAI collections
 Users want to specify “Title”, “Author”, etc.
Author = “F. Douglass”
OAI
Service
Provider
OAI Data
Provider
(e.g., Library
of Congress )
User
Author = “F. Douglass”
5/22/2017
Columbia University
Computer Science Dept.
20
Harvesting and Searching OAI
within SDARTS


OAI exports metadata records in XML
SDARTS can index and search XML collections


(e.g., Library
of Congress )
Harvest
OAI/XML
records
Solution:

OAI Data
Provider
Harvest OAI records (by “Date”, “Set”)
Store records locally as XML documents
Use SDARTS XML wrapper to index them
The OAI collection is searchable as an
SDARTS XML database
5/22/2017
Columbia University
Computer Science Dept.
Index
OAI/XML
records
SDARTS/
SDLIP
Server
21
Adding an OAI Collection in SDARTS
http://memory.loc.gov/cgi-bin/oai
loc
2002-01-01
5/22/2017
Columbia University
Computer Science Dept.
22
Distributed Search over OAI

SDARTS treats OAI collections as
simple, local XML databases
VT Electronic Thesis & Dissertation
number of documents = 2,948
…

Exact content summaries are
exported for OAI collections
study
 1,479
thesis
 493
…

Possible to build sophisticated
distributed search over OAI using
SDARTS
cancer
 13
basketball 2
…
SDARTS Content Summary for
an OAI collection
5/22/2017
Columbia University
Computer Science Dept.
23
Conclusions
 SDARTS can now extract rich content summaries from:



Local text and XML databases
Remote web databases
OAI-compliant collections
 SDARTS is now OAI-compliant
 SDARTS allows easy integration of any OAI collection into SDARTS
 SDARTS supports searching transparently over a wide range of
heterogeneous collections
No programming required for any of the tasks
5/22/2017
Columbia University
Computer Science Dept.
24
We are on the Web :-)
 SDARTS executables and documentation
 SDARTS source code with documentation
 SDARTS web client
 SDARTS database selection module
 SDARTS-OAI interface tools
 Sample SDARTS-compliant databases
http://sdarts.cs.columbia.edu/
5/22/2017
Columbia University
Computer Science Dept.
25