Download Other Thoughts - Rakesh Agrawal

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Integrating Structured
& Unstructured Data
Goals
 Identify some applications that have
crucial requirement for integration of
unstructured and structured data
 Identify key technical issues in integrating
unstructured and structured data
 Identify potential approaches
Definitions (simplified)

Structured object:
–

<oid, {<name, value>}>
Unstructured object:
–
–

<oid, {word}>
<oid, unknown/complex structure>
Semi-structured object
–
–
<oid, {<name, value>}, {word}>
<name, value> pairs may be
•
•
•
Given (e.g. author, title, etc.)
Extracted (e.g. Date, Zipcode, etc.)
Inferred (e.g. Topic)
Representative Applications











BPI: Messasges- unstructured
Web Applications: unstructured pages
Corporate Portals:
DSS involving Combination of simulation with database system
News syndication: author etc + story
Call centers: customer interaction + structured component of complaint
Mail system/document systems
Tourist information system
Product catalogs/engineering spec sheets
Patents/chenistry documents
Matching Legal documents (with cross citations) with building codes --representative
Key Technical Issues







Query language & data model
– Sharp vs fuzzy / complete vs best-effort
– Boolean vs similarity queries (relationship to “value”)
Integration strategies
– Loose vs. tight coupling Architectures (many possibilities)
– Search engine into DBMS or DBMS into search engine
– Late & early binding (warehousing vs virtual)
– Integration vs articulation (union vs intersection)
Feature extraction from unstructured data
Role of meta data & integrity constraints
Inconsistency of data sources
– Priorty rules for mediation
Management & data organization issues
– Version management , freshness, security
Continuous queries over streams
Strucured:People(firstname, lastname, company, location)
Semi-structured:Papers(title, {authors}, text)
Unstructured: Reviews
Q1: Reviews of papers by Almaden authors on II
Search reviews using Join(People.<fn,ln>, Papers.authors).keywords
Q2: Folks in Almaden and Watson working on same topic
Join of Papers.text followed by joined with names in People
Q3: Papers on privacy & data mining by Agarwal in Watson
Combine ranks of results from People and Papers
Q4: Almaden authors whose papers had negative reviews
Infer sentiment of a review and interesting joins
Q5: Crrent research topics in Almaden
Join People and Papers followed by clustering
Combining Scores
Query
Result
Papers on privacy & data mining
by Agarwal in Watson
 DB:
Chopper Combiner
– Aggarwal, Watson, s1
– Agarwal, Almaden, s2
– Agrawal, Almaden, s3
 IR
DB
IR
– Sigmod 00 paper, r2
– PODS 01 papers, r1
– KDD00 paper, r3
Query Processing
Query
Result
Chopper & Router
DB
IR
Query
Result
Chopper & Router
DB
IR
Approaches (1)
 Query Languages
– XML-based extensions for queries
• W3C working group on Xquery considering extension for
full text
• XXL (Weikum), XIRQL (Fuhr)
– Specialized languages for highly structured data (e.g. chemical
molecules)?
– Graph-based models & languages (RDF, Protégé – Stanford)
– Extended relational (e.g. SQL/MM)
– Inverse queries on business events
– Reasoning systems
– Statistical approaches (approximate/ data mining)
Approaches (2)
 Pluses of tight coupling
– Enforcement of ontologies, schemas
– Security, management, query optimization, integriry
constraints
 Negatives of tight coupling
– Does not address federation issues/autonomy
 Pluses of loose coupling
– Flexibility
 Negatives of loose coupling
And the dinner bell rings …
Concluding Remarks
 We need further discussion on issues and
approaches during the rest of the workshop
Related documents