Download lecture 9 information integration

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Business intelligence wikipedia , lookup

Information privacy law wikipedia , lookup

Transcript
Lec. 9
May 13, 2010
ISM 158
Information Integration
Instructor: Pankaj Mehra
Teaching Assistant: Raghav Gautam
In
m ter
es - a
sa pp
ge lic
s at
io
n
Enterprise Information
Instant
messages
E-mail
Application
File
system
Tag
Web
content
server
Enterprise
Information
Schema
Distributed
Query
Optimizer
Integration Hub
Central
archives
2nd-level
cache
CQL
Database
2nd-level
index
SQL
Enter
prise
Infor
Distri
mati
buted 2nd- Integration Hub
on
Centr
Query level
Sche
al
2ndOptim cach
ma 2nd- level archi
izer e
level index ves
meta
data
Co- or sub-repository with
separate data, metadata & index
g
Ta
Web Service
2nd-level
metadata
1st-level
index
Centralized versus Distributed?
• Distributed systems occur naturally
• State of the art does not allow complex queries or deep analysis
against distributed information
• Centralization may also be favored due to lower costs of infrastructure,
license and labor, as well as due to their ability to better enforce tighter
integrity constraints and other information management policies
• Ultimately, the decision needs to take into account issues of ownership
and control
– Technology considerations often are secondary; even so, rational
rules for resolving these considerations exist, as described in
Distributed Computing Economics paper
page 3
Contrasting Business & Technical Information
Business
domain
SQL schema & query
Ad hoc query
Steering
Dashboards
Metadata scaling
Real-time information
Unstructured sources
Central control
Central archive
Inconsistent information
Search federation
Schema evolution
Complex metadata
Simpler data fusion
Data mining
Pivoting
XML or WS schema & query
ETL
Centralized metadata
Heavy data processing
Simple metadata fusion
Stable schemata
Visualization
Data bandwidth scaling
Distributed complex controls
Distributed archives
page 4
Structured sources
File schema & query
Technical
domain
ETL
Pivoting
Deep linguistics
Streaming A/V
The Guiding Principles
• It is a bad idea to address the following as afterthoughts
– Privacy and security
– Business value
– Scale
– Compliance / auditability
– Information
– Availability
quality
– Retention requirements
– Integrity
• The ability to embed function close to data is fundamental to scalable
information processing
• In order to deliver the best performance/$, systems tend to scale out from
technology sweet spot of the day
• Redundancy configured in from the start, as well as mechanisms for early
detection and isolation of faults
• Optimize availability by optimizing recovery
page 5
Scalable Content Processing
• Enterprise information is
complex
connectors
connectors
content
e.g. JCR API
data
storage
• Diversity of information
sources and formats
– Entail complex integration
and processing flows
– Metadata generation and
indexing
– Content indexing
scalable processing
scalable repository
• Protection and security
page 6
Scale out architecture used under cloud information services
Attribute
indexing
Smart Cells
 Scalable distributed system
of self contained, allinclusive data repositories
Storage:
Block,
File,
Object &
Fragment
Smart
Cell
Smart
Cell
Smart
Cell
Smart
Cell
Smart
Cell
Smart
Cell
Smart
Cell
Smart
Cell
Smart
Cell
Smart
Cell
Smart
Cell
Smart
Cell
Smart
Cell
Smart
Cell
Smart
Cell
Smart
Cell
Smart
Cell
Smart
Cell
Smart
Cell
Smart
Cell
Smart
Cell
Smart
Cell
Smart
Cell
Smart
Cell
Smart
Cell
Supported protocols and APIs
Content
indexing
Smart Query Fabric
page 7
Principles
 Scale-out
 Federation
 Intelligence close to data
 Pluggable platforms
supporting proprietary and
3rd-party storage services
Example
 Platforms for Information
Lifecycle Management
services
Considerations in Distributed Information Management
• Information is distributed across heterogeneous sources
and has varied provenance
 Integration
• Information management requires information about
information
 Metadata
• Useful information is timely and findable
 Real-time integration and caching
 Indexing
 Semantic analysis
 Context
page 8
Dimensions of Integration
Optional results caching
for multi-step queries
Information Integration
Methodologies
Optional DQO (chaining,
referral, recruiting, virtual
stored procedures)
Metadata
architecture
Centralized, one-level
Distributed, one-level
Distributed, two-pass
Distributed, 1-pass,
forwarded
Distributed, two-level
Schema
definition
language
Statefulness
Distributed, 1-pass,
flooded
Query
processing
technique
SQL DDL
XML Schema
Centralized
WSDL
Distributed, two-level
GGF DAIS
Distributed, one-level
Centralized
Navigable Filesystem
metadata
Indexing
technique
Tap message flow
Tap change log
Access
Mechanism
Stateful: Local queries on
cached data
Tap streaming data
Stateless
Subscribe to metadata
Stateful: Distributed query;
DQO & lntermediate result
caching
Subscribe to data
Navigable Repository
metadata
SPARQL
Query language
XQuery
SQL DML
Proprietary API
Tap update operations
Proprietary protocol
Triggered crawl
Search Terms
Scheduled crawl
page 9
Ecosystem of integration products
•
Metadata
– Determines
information
richness
Service Orientation
– Determines
protocol richness
•
Future
– Integration as
syndication
– Integration aaS
Metadata
•
JSR 170 ECI
Day
Uniform
access
MOSS, Attivio
XML-based EII
BEA LiquidData, Mark Logic
SQL-based EII
SAP, Oracle, Composite
RSSbased
NewsGator
Pure
EAI
Tibco, SAG
Service-orientedness
WSbased
SOA
Microsoft,
IBM
Points for Discussion in class
• Consider a healthcare
patient information
scenario.
– Is it mainly
transactional or
mainly analytic?
– Would you lean
toward a distributed
(EAI) approach or a
centralized one
(warehouse)?
• Consider a scenario in
which a company wants
to drill down into the
root causes of
customer complaints?
– Again, centralized or
distributed?
• Identifying the root
cause
• Tracking the problem
– Would real-time
integration become a
requirement?
Points to ponder at home
• Pros of integration
– Connecting the dots
– Single view of …
– Quality control over
• Inconsistency
• Staleness
• Gaps
• Cons of integration
– Loss of context
– Often, read only
– Cost
– Duplication
– Scale
– Losing battle?
– Risk
Where to learn more
•
Data Integration: The Relational Logic Approach by Michael
Genesereth, Morgan & Claypool Publishers, 2010
Upcoming guest lectures in May
• Dr. V. Galotra, Oracle
– SOA Deep Dive
• Rahul Nim, Efficient Frontier
– Online marketing
Questions?
•
NEWS PRESENTATION