Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1 IBM Research Impliance: an Information Management Appliance Bishwaranjan Bhattacharjee IBM Watson Research Center Vuk Ercegovac, Joseph Glider, Richard Golding, Guy Lohman, Volker Markl, Hamid Pirahesh, Jun Rao, Robert Rees, Frederick Reiss, Eugene Shekita, Garret Swart Almaden Research Center Impliance -- Information Management Appliance © 2002 IBM Corporation Agenda Motivation: Observations Requirements What is Impliance? How is Impliance different from…? Research opportunities Conclusions 2 Impliance -- Information Management Appliance © 2007 IBM Corporation After all our successes (and last night’s revelry), it’s easy to become self-congratulatory. Sorry, time for… 3 Impliance -- Information Management Appliance © 2007 IBM Corporation Some embarrassing questions: Why is most (>80%) of the world’s data still not in databases Didn’t we “solve” this problem in the 1980s with object-relational systems? Do you use a database to store your data on your laptop? Why not? (You are a database bigot, aren’t you?) Have you ever tried to query (with SQL) a database that: You didn’t create, and… Had more than 500 tables? Just how easy is it to incrementally add DB capacity beyond 1 machine? 100 machines? Have “self-managing” databases significantly simplified administration? 4 Impliance -- Information Management Appliance © 2007 IBM Corporation Observation Requirements (1 of 5) Observation #1: Information converging Many types of data in today’s enterprise Structured (traditional Data Base) Semi-structured (traditional Content Management, XML) Unstructured (text, multimedia) Each needs a different search interface, today SQL JSR-170 Keyword search / Information Retrieval Requirement #1: Store / Search / Analyze all data Need to rapidly relate information of different types With one unified interface! Real use cases in paper Observation Requirements (2 of 5) Observation #2: Awash in data, but not information Typical complaint: “I can’t find what I’m looking for!” But just finding data isn’t enough! Today’s Business Intelligence is too human-intensive Requirement #2: Pro-actively derive useful information Need to glean more business value from enterprise data What sort of analytics exploit unstructured data? Need to automatically extract the semantics of text A rebirth of data mining? Observation Requirements (3 of 5) Obs. #3: Total Cost of Ownership (TCO) is paramount People costs dominate TCO – Hardware often less than 50% of TCO Minimize Time To Value – Databases take too long to set up! Wizards & Advisors simply mask complexity, add brittleness Reqmt. #3: System must be simple, robust, & secure Sacrifice resource utilization for radical simplification of: – Setup / Configuration / Deployment (e.g., Self-Organizing) – Operation + KISS (you know this one) KIWI – Kill It With Iron [Weikum]! Example: “Good enough” plans exploiting massive parallelism Observation Requirements (4 of 5) Observation #4: Data volumes growing fast Data is kept longer Lots of new kinds of data: RFID, email, photos, videos Disk densities improving, but not seek times! – 1 TB disk for $399 (Hitachi) Requirement #4: Simple & massive scale-out 1000s of nodes With low management overhead No single point of failure Observation Requirements (5 of 5) Obs. #5: Today’s Info. Mgmt. software based upon hardware 30 yrs. ago Example: Update-in-place databases due to expensive disk Today: Cheap CPUs, large storage, fast networks Requirement #5: Need new (software) architecture Opportunity to radically rethink Info. Mgmt. software architecture (Stonebraker: “refactor”), based upon: – Hardware economics • e.g., cheap (multi-core) CPUs, storage, memory, network – Software: • Formats (e.g., XML, semi-structured data) • Functionality required (e.g., unstructured search, analytics) – Specified in the right order: • Service requirements Software Hardware IBM Research What is Impliance? Administrator-less: Scalable: Low Time to Value by Self-Organizing Massively parallel scale-out… Low Total Cost of Ownership …to Petabytes! Bundled: Structured Data (Tables) XML Text Manage and Search All Data: Structured, Semi-Structured, … …Even Unstructured Text! Impliance – Information Management Appliance HW & SW Pre-configured Pre-tuned Limited APIs Pro-actively Mine Information: Glean business insight from data © 2007 IBM Corporationi 10 What Does Impliance Actually Do? All enterprise information: Stores & Retrieves (Search / Query) Composes / Integrates / Mashups Finds trends & exceptions (Business Intelligence) 11 Impliance -- Information Management Appliance © 2007 IBM Corporation Think of Impliance as… Content Management on steroids (beyond JSR-170) File System with all content searchable Data Warehouse with all your enterprise’s data Not just structured information Excluding high-rate OLTP (web site) A Jambalaya 12 Impliance -- Information Management Appliance © 2007 IBM Corporation Content Management Impliance XML Un Structured Transaction Ingestion Archiving Products DBMS Structured SemiStructured Types of Data Where does Impliance fit? OLTP Warehousing/OLAP Lifetime of Data Archiving How is Impliance related to… Google Base? Primary data store Appliance (product, i.e., sits in customer site), not a Service Enterprise, not “the masses” DataSpaces / Google “Pay as you go”? Primary data store (vs. lazy federation of existing data sources) Enterprise, not “the web” Database “Appliances” (Netezza, DataAlegro, Green Plum, etc.)? Not just structured (relational) data Discovery of semantics More pro-active 14 Impliance -- Information Management Appliance © 2007 IBM Corporation Research Opportunities Reducing TCO – Make categories of administration just GO AWAY Self-Organizing to obviate database design Exploit appliance’s limited externalized interfaces New HW & SW architectures using off-the-shelf components Achieving fine-grained scale-out Targetting robust, “good enough” designs Exploiting integration of components Data and query models that Unify all data, yet are simple Tolerate “schema chaos” Combine best features of keyword search & SQL Automated discovery of Data & query semantics for Improving precision of queries Organizing data adaptively Trends, exceptions, etc. (pro-active Business Intelligence) 15 Impliance -- Information Management Appliance © 2007 IBM Corporation Conclusions We’ve come a long way towards the autonomic dream incorporating all data But we can do much more! Impliance provides exciting opportunity for DB research To lower TCO for information management To exploit today’s hardware and software advances To rethink information management in a fundamentally new way Join us! 16 Impliance -- Information Management Appliance © 2007 IBM Corporation IBM Research Hindi Thai Traditional Chinese Russian Gracias Spanish Thank You Obrigado English Arabic Grazie Brazilian Portuguese Danke German Merci Italian Simplified Chinese Tamil French Japanese Korean 17 Impliance – Information Management Appliance © 2007 IBM Corporation Appendix 18 Impliance -- Information Management Appliance © 2007 IBM Corporation Redefining Information Systems -- Players Web 2.0 oriented next generation systems (delivered through services or appliances): Google, Yahoo, MSN, (IBM) Google base (a semi-structured/un-structured information base) Google OneBox NextGen systems built by integration of successful open source (Green Plum) Data models: RSS/ATOM/Wiki/… Architecture: DB+Search+Content systems (e.g., MYSQL+Lucene+Jackrabbit) Entrenched HW/Storage/middleware companies Storage-driven: EMC-- Moving up the value chain, brought in a classic Content system IBM– IDS: synergy between classic CM (JCR) and storage Server-driven: Netezza, Datallegro (for BI) Zantaz (for email compliance) Data Power (XSLT filtering) Middleware-driven (IBM, Oracle, Microsoft) Oracle Secure Enterprise Search 19 Impliance -- Information Management Appliance © 2007 IBM Corporation Research Focus 1: Reducing TCO Make entire categories of administration JUST GO AWAY Reducing time-to-value through new design principles Self-organization of “schema chaos” obviates lengthy logical & physical design, REORG Fine-grained scale-out (instead of scale-up) obviates need for load balancing, etc. New software architecture Target robust, highly-predictable, “good enough” utilization (KIWI = Kill It With Iron) Componentization Each component simple, robust, and adaptive Virtual service model Service Broker optimizes resources and assigns the workload Exploit integrated hardware and storage systems to provide Built-in redundancy and availability Automated backup and archiving (ILM) Easy cluster management Schema chaos support at storage level (semantic storage) Ability to use new types of grid elements (cell blade server) seamlessly 20 Impliance -- Information Management Appliance © 2007 IBM Corporation Xaction Stream Research Focus 2: Scalability Transactional Cluster Analytic Grid True Grid Model Off-the-shelf, commodity hardware Dedicate blades to different tasks Transaction Blade Analytic Blade Data Stream Commodity Interconnect Supports Mixed Workloads Analytics, Search, Content, … Data Array Fine-grained scale-out Data Array Different blade types scale independently Data Blade Data Blade Data proc Data proc RAID … Data: storage and simple filtering Analytical: aggregation & mining Transaction: search, transactional get/put From SMB to largest enterprises Content Stream RAID Integrating modern HW & storage, e.g. BC3, intelligent bricks Logic pushdown into storage Archive/ ILM Stream Predicate application Aggregation Redundancy management Data+Content+Search+Digital Media 21 Impliance -- Information Management Appliance © 2007 IBM Corporation Parallel Run-time: Comparison of Plumbing Platform WS XD Application Transactional (composition;no search, no BI) ETL (streaming) DataStage (E2) (cleansing, transformation,composition) Querying model Parallelism Fault tolerance Resource Scheduling limited moderate yes yes rich high yes yes yes limited GPFS Storage extremely limitedextremely high DB2 ESE with DPF Analytics for relational rich high yes yes Google Map/Reduce Analytics for anything (search, transformation, simplistic composition) limited extremely high yes yes Impliance Analytics for anything, Search, Composition rich extremely high yes yes 22 Impliance -- Information Management Appliance © 2007 IBM Corporation Applications content Relational data XML Web page Data Analyzer, Discovery, Query: JCR Data Analyzer Discovery SQL Query Large-scale computation XSLT HTTP Data/ Query Modeler Video Data Modeler Simple, generic Scalable Reliable Runtime Support SRRS Fault tolerant Archive ILM … Objects Resource Modeler DDS Provide reliability Distributed Data Store VSCR Virtual Storage and Computing Resource Commodity HW Security Control 23 Impliance -- Information Management Appliance © 2007 IBM Corporation Research Focus 3: Information Modeling and Querying Simple, rich, unified information model & associated query languages, e.g. Google Base approach promising Defined typed attributes for navigation Defined label for keyword search Infosphere, MUSIC Open community (RSS / Atom / wiki) Automatic schema discovery and integration – self-organizing! Integrating solutions from Infosphere, CLIO Intelligence discovery Automatic discovery of semantics (UIMA, Web Fountain, Avatar) Pro-active, continuous mining (vs. passive BI model) Contextual information supply Including reporting and advanced analytics 24 Impliance -- Information Management Appliance © 2007 IBM Corporation Eliminate Admin Tasks… …Rather than adding layers (1 of 3): Special-purpose, turn-key appliances for basic services vs. today’s general-purpose SW (but still uses off-the-shelf hardware!) Bundled, Pre-installed, Pre-configured, Pre-tuned software! Examples: Information Management appliance Web Server appliance Minimizes interfaces user has to worry about No need to externalize underlying operating system, storage details Eliminates need to install, configure, and tune Self-organizing data systems Automatic discovery of data structure Obviates need to Define logical and physical schema a priori, reducing time to value Migrate schema when organization changes 25 Impliance -- Information Management Appliance © 2007 IBM Corporation Eliminate Admin Tasks (2 of 3): Universal Data Management Today: Plethora of special-purpose data managers: Databases for structured data Content managers for semi-structured data File systems for unstructured data For each, very different User interfaces (SQL, JSR 170, file interface) Degrees of semantic knowledge about the data’s contents Degrees of searchability Consistency semantics (e.g., transactions) when updated Management capabilities and interfaces Tomorrow: Single mechanism for managing all data Uniform interfaces for all types of data, for Searching Updating Management Universal indexing (“Google model”) of all data – default search mechanism Plus more precise searching for auto-discovered (above) structured information Obviates need to Impose naming conventions to find desired data 26 Impliance -- Information Management Appliance © 2007 IBM Corporation Eliminate Admin Tasks (3 of 3): Robust storage mechanisms to eliminate need for backups Never throw out data –keep versions! Update-in-place Is an anachronism from days of expensive disk Increases complexity of transactions Jeopardizes compliance requirements (Sarbanes-Oxley) Versions permit queries “as of” some time Exploits storage density increases (relative to number of disk arms) RAID provides local reliability Widely accepted and deployed Weaver Codes extend to multiple simultaneous failures How provide universal reliability (i.e., against site disasters)? Selective, automated replication of new versions? Cross-site RAID? Universal “Call Home” technology for remote management of Monitoring Problem determination Software maintenance & upgrades 27 Impliance -- Information Management Appliance © 2007 IBM Corporation Observation / Requirements Information converging: Store / Search / Analyze ALL data Structured (traditional Data Base) Semi-structured (traditional Content Management, XML, multi-media, call center records) Unstructured (text) Same advanced functionality required Data volume growing fast: On Demand strategy requires massive scale-out Lots of new data: RFID, email, photos, videos (Deep Internet-scale systems being built) Data is kept longer, due to compliance requirements Total Cost of Ownership (TCO) is paramount: System simple & robust (not smart & fragile) People costs dominate TCO: Hardware often less than 50% of TCO Hence, sacrifice resource utilization for radical simplification Delivered in services or appliances Today’s IM software based upon hardware 30 yrs ago: Need new software architecture Cheap CPUs, large storage, fast network in hardware Opportunity to radically rethink IM software architecture, based upon: Hardware economics (e.g., cheap CPUs, storage, memory, & network) Data: Formats (e.g., XML, semi-structured data) Functionality required (e.g., unstructured search, analytics) 28 Impliance -- Information Management Appliance © 2007 IBM Corporation Total Cost of Ownership is the Driver Cost of management and administration is outpacing spending on new systems $160 35 $140 30 $120 Spending (US$B) 25 $100 20 Installed base (M Units) $80 15 $60 10 $40 5 $20 $0 1996 ’97 ’98 ’99 2000 ’01 ’02 New server spending (US$M) 3% CAGR ’03 ’04 ’05 ’06 ’07 ’08 Source: IDC, On-Demand Enterprises and Utility Computing: A Current Market Assessment and Outlook, IDC #31513, July 2004 Cost of management and administration 10% CAGR 29 Impliance -- Information Management Appliance © 2006 IBM Corporation IBM Research Changing Characteristics of Data Transactions and structured data Text and other human data Actionability Actionability Scale Seat on an airplane: Easy to find, structured Actionability Heterogeneity Heterogeneity Scale Machine-generated and unstructured data Scale LifeScience data - protein folding, gene expression: Difficult to analyze but we know where to look Impliance – Information Management Appliance Heterogeneity Satellite and surveillance data: An infinite space of "patterns" 30 Impliance: A Highly-Scalable, Rich-Functional Information Management Appliance A box with software pre-installed JCR Native content retrieval SQL interfac e Relational data XSLT XML HTTP How delivered to enterprise: appliance or service What functions? Store and manage all information accept all types of enterprises data Deliver all intelligence Integrate cross silo information Advanced analytics with richer semantics Web page Native Impliance update/ Video load interfac e Archive ILM … What properties? Low TCO easy to deploy (“plug & play”) simple and stable Scalability From SMB to Very Large (PetaBytes) (Not for high-end OLTP!) Data+Content+Digital Media 31 Impliance -- Information Management Appliance © 2007 IBM Corporation