Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Data Lake A New Solution to Business Intelligence Agenda • • • • • • • • • • Cas Apanowicz – An Introduction A Little History Traditional DW/BI What is Data Lake Why is better? Architectural Reference New Paradigm and Architectural Reference Future of Data Lake Q&A Appendix A Cas Apanowicz • • • • • • • • Cas is the founder and was the first CEO of Infobright – the first Open Source Data Warehouse company co-owned by San Microsystems and RBC Royal Bank of Canada He is an accomplished IT consultant and entrepreneur, Internationally recognized IT practitioner who has served as a co-chair and speaker on International conferences. Prior to Infobright, Mr. Apanowicz founded Cognitron Technology, which specialized in developing data mining tools, many of which were used in the health care field to assist in customer care and treatment. Previous to Cognitron Technology, Mr. Apanowicz worked in the Research Centre at BCTel where he developed an algorithm that measured customer satisfaction. At the same time, he was working in the Brain Center at UBC in Vancouver applying ground-breaking algorithms for brain reading interpretation. As well, he offered his expertise to Vancouver General Hospital in applying new technology for recognition of different types of epilepsy. Cas Apanowicz has been designing and delivering BI/DW technology solutions for over 18 years. He has created a BI/DW open source software company and has North American patents in this field. Throughout his career, Cas has held consulting roles with Fortune 500 companies across North America, including Royal Bank of Canada, the New York Stock Exchange, the Federal Government of Canada, Honda, and many others. Cas holds a Master's Degree in Mathematics from the University of Krakow. Cas is an author of North American patents and several publications by renowned publishers such as Springer and Sherbrooke Hospital. He is also regularly invited to be a peer-reviewer by Springer publisher of many IT related publications. A Little History Big Data has received much attention over past two years, some calling it Ugly Data. The challenge is dealing with the “mountains of sand” – hundreds, thousands and is cases millions of small, medium, and large data sets which are related, but unintegrated IT is overtaxed and unable to integrate vast majority of data New class of software needed to discover relationships between related yet unintegrated data sets Current BI BI and Hadoop Extensive processes and costs: Data Analyses Data Cleansing Entity Relationship Modeling Dimensional Modeling Database Design & Implementation Database Population through ETL/ELT Downstream Applications linkage - Metadata Maintaining the processes Cloud Source Data Analytical Database Analytical Database Data Marts Analytical Database Analytical Database Analytical Database BI Reference Architecture – Data Lake Enterprise Customer Product Location Promotions Orders Supplier Invoice ePOS Other Unstructured Informational External Data Integration Data Repositories Sqoop Extraction Operational Data Stores MapReduce/PIG Transformation Load / Apply Synchronization Single Source Transport / Messaging HCatalog & Pig CanInformation work with most ETLIntegrity tools on the market Staging HDFS Areas Data Warehouse Data Marts Analytical Data Marts Analytics Collaboration Business Applications Data Sources Metadata HCatalog Data Flow and Workflow Metadata Metadata Management Management - HCatalog Security and Data Privacy System Management and Administration Network Connectivity, Protocols & Access Middleware Hardware & Software Platforms Query & Reporting Data Mining Access Web Browser Portals Modeling Scorecard Devices (ex.: mobile) Visualization Embedded Analytics Web Services BI Reference Architecture Enterprise Customer Product Location Promotions Orders Supplier Invoice ePOS Other Unstructured Informational Data Integration Data Repositories Extraction Operational Data Stores Transformation Load / Apply Data Warehouse Synchronization Data Marts Transport / Messaging Information Integrity Staging Areas Analytics Collaboration Business Applications Data Sources Metadata External Data Flow and Workflow Metadata Management Security and Data Privacy System Management and Administration Network Connectivity, Protocols & Access Middleware Hardware & Software Platforms Query & Reporting Data Mining Access Web Browser Portals Modeling Scorecard Devices (ex.: mobile) Visualization Embedded Analytics Web Services BI Reference Architecture Enterprise Data Integration Data Repositories Extraction Extraction Operational Data Stores Customer Transformation Product is an application used to transfer Location data, usually from relational Promotionsdatabases to a flat file, which can then be use to transport to aLoad landing are of Load//Apply Apply Orders Extraction a Data WarehouseSupplier and ingest into BI/DW environment. Invoice ePOS Other Unstructured Synchronization Synchronization Transport / Messaging Informational Information Information Integrity HCatalog – Hadoop metadata repository and Integrity Externalservice that provides a centralized management way for data processing systems to understand the structure and location of the data stored Data within Apache Hadoop. Data Analytics Collaboration Query & Reporting Business Applications Data Sources Access Web Browser MapReduce – A framework for writing applications that Sqoop – is a command-line interface application for transferring Warehouse processes large amountsdatabases of structured and unstructured data in Data Mining Portals data between relational and Hadoop. It supports parallel across large clusters of machines in a very reliable and faultincremental loads of a single table or a free form SQL query as well as Synchronization – The ETL process takes source data from tolerant saved jobsmanner. which can be run multiple times to import updates made to Staging staging, transforms using business rules and loads into central Data Marts aPig database since thefor last import. Exports can be used to putsets. dataPig from Modeling – A platform processing and analyzing data repository DW. In this scenario, in order to retain large information integrity, DWdatabase. Hadoop into a relational consists a in high-level language (Pig Latin) for & expressing data one has toon put place a synchronization checks correction Current – Currently there is no special approach to the data Devices analysis programs paired with the MapReduce framework for DM mechanism. quality otherthese thanprograms. imbedded intoScorecard the ETL processes (ex.: and mobile) logic. processing Staging DM ThereSource are tools and approaches to implement QA & QC. Areas Staging Proposed BI Landing Visualization Current BI Database extract Source Complex DM as a Hadoop – MoreETL focused approach - WhileHDFS weSqoop use HDFS Web s ftp Synchronization Metadata DW one big “Data Lake” QA and QCEmbedded will be applied at the Data Mart Services Staging Landing Source Target Target LevelExtract where the actual transformations will occur, hence reducing Analytics MapReduce/Pig Complex ETL the overall effort. QA & QC will be an integral part of Data HDFS Current as and a Single – In theofproposed solutionBI HDFS Governance augmented by usage HCatalog. Proposed BI Source DW Flow and Workflow acts as a single source of data so there is no danger of Complex ETL desinhronization. Metadata Management The inconsistencies resulted from duplicated or inconsistent data will be reconciled with assistanceDM of HCatalog and Security and Data Privacy DM proper data governance. System Management and Administration Network Connectivity, Protocols & Access Current BIMiddleware Hardware & Software Platforms Source HDFS Proposed BI DM BI Reference Architecture Enterprise Customer Product Location Promotions Orders Supplier Invoice ePOS Other Unstructured Informational Data Integration Data Repositories Extraction Operational Data Stores Transformation Data Warehouse Load / Apply Information Integrity Collaboration Query & Reporting Access Web Browser Data Mining Portals HCatalog – A Hadoop metadata repository and HDFS Synchronization Transport / Messaging Analytics Business Applications Data Sources management service that provides a centralized way Data Marts for data processing systems to understand the Modeling structure and location of the data stored within Devices Apache Hadoop. Scorecard (ex.: mobile) Staging Areas Visualization HCatalog Metadata External Data Flow and Workflow Metadata Metadata Management Management Hadoop Distributed File System (HDFS) –HCatalog A reliable and Security and Data Privacy distributed Java-based file system that allows large volumes of data to System Management and Administration be stored and rapidly accessed across large clusters of commodity Network Connectivity, Protocols & Access Middleware servers Hardware & Software Platforms Embedded Analytics Web Services BI Reference Architecture Enterprise Customer Product Location Promotions Orders Supplier Invoice ePOS Other Unstructured Informational External Data Integration Data Repositories Analytics Collaboration Sqoop MapReduce/PIG Load / Apply HDFS Single Source Transport / Messaging HCatalog & Pig Can work with Informatica Analytical Data Marts Business Applications Data Sources HCatalog Data Flow and Workflow HCatalog Metadata Management Security and Data Privacy System Management and Administration Network Connectivity, Protocols & Access Middleware Hardware & Software Platforms Query & Reporting Data Mining Access Web Browser Portals Modeling Scorecard Devices (ex.: mobile) Visualization Embedded Analytics Web Services BI Reference Architecture Capability Current BI Proposed BI Expected Change Data Sources Source Applications Source Applications No Extraction from Source DB Export Sqoop On-to-one change Transport/Messaging SFTP SFTP No Staging Area Transformations/Load Complex ETL Code None required eliminated Extract from Staging Complex ETL Code None required eliminated Transformation for DW Complex ETL Code None required eliminated Load to DW Complex ETL, RDBMS None required eliminated MapReduce/Pig simplified transformations from HDFS to DM Yes Data Integration Extract from from DW, Complex ETL code & process to Transformation and feed DM load to DM Data Quality , Balance & Imbedded ETL Code Controls MapReduce/Pig in conjunction with HCatalog; Can also coexist with Informatica BI Reference Architecture Capability Current BI Proposed BI Expected Change Operational Data Stores Additional Data Store (currently sharing resources with BIDW) No additional repository. The BI consumption implemented through appropriated DM Elimination of additional data store Data Warehouse Complex Schema, Expensive platform. Requires complex modeling and design for any new data element Eliminated Staging Areas Complex Schema, Expensive platform. Requires complex design with any new data element Dimensional Schema Eliminated. All data is collected in HDFS and available for feeding all required Data Marts (DM) NO Schema on Write. Eliminated. All data is collected in HDFS and available for creation of Data Marts Dimensional Schema Data Repositories Data Marts Eliminated No change BI Reference Architecture Capability Current BI Proposed BI Metadata Not Implemented HCatalog Security Mature Enterprise Mature Enterprise Less maintenance guaranteed by Cloud provider Analytics WebFocus, Microstrategy, Pentaho, SSRS, etc. Web, mobile, other WebFocus, Microstrategy, Pentaho, SSRS, etc. No change Web, mobile, other No change Access Expected Change Simplified due to simplified processing & existence of native metadata management system. Business Case The client has internally developed BI component strategically positioned in the BI ecosystem. Cas Apanowicz of IT Horizon Corp. was retained to evaluate the solution. The Data Lake approach was recommended resulting in total saving of $778,000 and shortening the implementation time from 6 to 2 month: Solution Component Traditional/Original Proposed DW Discovery Implementation Time 6 Months 2 Months Cost of Implementation $975,000 $197,000 17 4 $195,000 $25,000 Number of Resources involved in Implementation Maintenance Estimated Cost Thank You • Contact information: • Cas Apanowicz • [email protected] • 416-882-5464 • Questions?