Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lec. 9 May 13, 2010 ISM 158 Information Integration Instructor: Pankaj Mehra Teaching Assistant: Raghav Gautam In m ter es - a sa pp ge lic s at io n Enterprise Information Instant messages E-mail Application File system Tag Web content server Enterprise Information Schema Distributed Query Optimizer Integration Hub Central archives 2nd-level cache CQL Database 2nd-level index SQL Enter prise Infor Distri mati buted 2nd- Integration Hub on Centr Query level Sche al 2ndOptim cach ma 2nd- level archi izer e level index ves meta data Co- or sub-repository with separate data, metadata & index g Ta Web Service 2nd-level metadata 1st-level index Centralized versus Distributed? • Distributed systems occur naturally • State of the art does not allow complex queries or deep analysis against distributed information • Centralization may also be favored due to lower costs of infrastructure, license and labor, as well as due to their ability to better enforce tighter integrity constraints and other information management policies • Ultimately, the decision needs to take into account issues of ownership and control – Technology considerations often are secondary; even so, rational rules for resolving these considerations exist, as described in Distributed Computing Economics paper page 3 Contrasting Business & Technical Information Business domain SQL schema & query Ad hoc query Steering Dashboards Metadata scaling Real-time information Unstructured sources Central control Central archive Inconsistent information Search federation Schema evolution Complex metadata Simpler data fusion Data mining Pivoting XML or WS schema & query ETL Centralized metadata Heavy data processing Simple metadata fusion Stable schemata Visualization Data bandwidth scaling Distributed complex controls Distributed archives page 4 Structured sources File schema & query Technical domain ETL Pivoting Deep linguistics Streaming A/V The Guiding Principles • It is a bad idea to address the following as afterthoughts – Privacy and security – Business value – Scale – Compliance / auditability – Information – Availability quality – Retention requirements – Integrity • The ability to embed function close to data is fundamental to scalable information processing • In order to deliver the best performance/$, systems tend to scale out from technology sweet spot of the day • Redundancy configured in from the start, as well as mechanisms for early detection and isolation of faults • Optimize availability by optimizing recovery page 5 Scalable Content Processing • Enterprise information is complex connectors connectors content e.g. JCR API data storage • Diversity of information sources and formats – Entail complex integration and processing flows – Metadata generation and indexing – Content indexing scalable processing scalable repository • Protection and security page 6 Scale out architecture used under cloud information services Attribute indexing Smart Cells Scalable distributed system of self contained, allinclusive data repositories Storage: Block, File, Object & Fragment Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Smart Cell Supported protocols and APIs Content indexing Smart Query Fabric page 7 Principles Scale-out Federation Intelligence close to data Pluggable platforms supporting proprietary and 3rd-party storage services Example Platforms for Information Lifecycle Management services Considerations in Distributed Information Management • Information is distributed across heterogeneous sources and has varied provenance Integration • Information management requires information about information Metadata • Useful information is timely and findable Real-time integration and caching Indexing Semantic analysis Context page 8 Dimensions of Integration Optional results caching for multi-step queries Information Integration Methodologies Optional DQO (chaining, referral, recruiting, virtual stored procedures) Metadata architecture Centralized, one-level Distributed, one-level Distributed, two-pass Distributed, 1-pass, forwarded Distributed, two-level Schema definition language Statefulness Distributed, 1-pass, flooded Query processing technique SQL DDL XML Schema Centralized WSDL Distributed, two-level GGF DAIS Distributed, one-level Centralized Navigable Filesystem metadata Indexing technique Tap message flow Tap change log Access Mechanism Stateful: Local queries on cached data Tap streaming data Stateless Subscribe to metadata Stateful: Distributed query; DQO & lntermediate result caching Subscribe to data Navigable Repository metadata SPARQL Query language XQuery SQL DML Proprietary API Tap update operations Proprietary protocol Triggered crawl Search Terms Scheduled crawl page 9 Ecosystem of integration products • Metadata – Determines information richness Service Orientation – Determines protocol richness • Future – Integration as syndication – Integration aaS Metadata • JSR 170 ECI Day Uniform access MOSS, Attivio XML-based EII BEA LiquidData, Mark Logic SQL-based EII SAP, Oracle, Composite RSSbased NewsGator Pure EAI Tibco, SAG Service-orientedness WSbased SOA Microsoft, IBM Points for Discussion in class • Consider a healthcare patient information scenario. – Is it mainly transactional or mainly analytic? – Would you lean toward a distributed (EAI) approach or a centralized one (warehouse)? • Consider a scenario in which a company wants to drill down into the root causes of customer complaints? – Again, centralized or distributed? • Identifying the root cause • Tracking the problem – Would real-time integration become a requirement? Points to ponder at home • Pros of integration – Connecting the dots – Single view of … – Quality control over • Inconsistency • Staleness • Gaps • Cons of integration – Loss of context – Often, read only – Cost – Duplication – Scale – Losing battle? – Risk Where to learn more • Data Integration: The Relational Logic Approach by Michael Genesereth, Morgan & Claypool Publishers, 2010 Upcoming guest lectures in May • Dr. V. Galotra, Oracle – SOA Deep Dive • Rahul Nim, Efficient Frontier – Online marketing Questions? • NEWS PRESENTATION