Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Integrating Structured
& Unstructured Data
Goals
 Identify some applications that have
crucial requirement for integration of
unstructured and structured data
 Identify key technical issues in integrating
unstructured and structured data
 Identify potential approaches
Definitions (simplified)
Structured object:
–
<oid, {<name, value>}>
Unstructured object:
–
–
<oid, {word}>
<oid, unknown/complex structure>
Semi-structured object
–
–
<oid, {<name, value>}, {word}>
<name, value> pairs may be
•
•
•
Given (e.g. author, title, etc.)
Extracted (e.g. Date, Zipcode, etc.)
Inferred (e.g. Topic)
Representative Applications
BPI: Messasges- unstructured
Web Applications: unstructured pages
Corporate Portals:
DSS involving Combination of simulation with database system
News syndication: author etc + story
Call centers: customer interaction + structured component of complaint
Mail system/document systems
Tourist information system
Product catalogs/engineering spec sheets
Patents/chenistry documents
Matching Legal documents (with cross citations) with building codes --representative
Key Technical Issues
Query language & data model
– Sharp vs fuzzy / complete vs best-effort
– Boolean vs similarity queries (relationship to “value”)
Integration strategies
– Loose vs. tight coupling Architectures (many possibilities)
– Search engine into DBMS or DBMS into search engine
– Late & early binding (warehousing vs virtual)
– Integration vs articulation (union vs intersection)
Feature extraction from unstructured data
Role of meta data & integrity constraints
Inconsistency of data sources
– Priorty rules for mediation
Management & data organization issues
– Version management , freshness, security
Continuous queries over streams
Strucured:People(firstname, lastname, company, location)
Semi-structured:Papers(title, {authors}, text)
Unstructured: Reviews
Q1: Reviews of papers by Almaden authors on II
Search reviews using Join(People.<fn,ln>, Papers.authors).keywords
Q2: Folks in Almaden and Watson working on same topic
Join of Papers.text followed by joined with names in People
Q3: Papers on privacy & data mining by Agarwal in Watson
Combine ranks of results from People and Papers
Q4: Almaden authors whose papers had negative reviews
Infer sentiment of a review and interesting joins
Q5: Crrent research topics in Almaden
Join People and Papers followed by clustering
Combining Scores
Query
Result
Papers on privacy & data mining
by Agarwal in Watson
 DB:
Chopper Combiner
– Aggarwal, Watson, s1
– Agarwal, Almaden, s2
– Agrawal, Almaden, s3
 IR
DB
IR
– Sigmod 00 paper, r2
– PODS 01 papers, r1
– KDD00 paper, r3
Query Processing
Query
Result
Chopper & Router
DB
IR
Query
Result
Chopper & Router
DB
IR
Approaches (1)
 Query Languages
– XML-based extensions for queries
• W3C working group on Xquery considering extension for
full text
• XXL (Weikum), XIRQL (Fuhr)
– Specialized languages for highly structured data (e.g. chemical
molecules)?
– Graph-based models & languages (RDF, Protégé – Stanford)
– Extended relational (e.g. SQL/MM)
– Inverse queries on business events
– Reasoning systems
– Statistical approaches (approximate/ data mining)
Approaches (2)
 Pluses of tight coupling
– Enforcement of ontologies, schemas
– Security, management, query optimization, integriry
constraints
 Negatives of tight coupling
– Does not address federation issues/autonomy
 Pluses of loose coupling
– Flexibility
 Negatives of loose coupling
And the dinner bell rings …
Concluding Remarks
 We need further discussion on issues and
approaches during the rest of the workshop