Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DMBOK2 DWBI New Content Martin Sykora 2015 April IRMAC Toronto Agenda Introduction In the beginning Reports Data Warehouse – 2 schools BI Tools v1 DMBOK v1 DMBOK v2 New Content Data Vault In a Minute Drama: Exa Forklift Yellow Elephants Visualizing Content Data Sciences Virtualization DMBOK v2 Conceptual Architecture 2 5/22/2017 Introduction Martin Sykora • 25 years in data management • Oracle, SAP, BusinessObjects • Current Director Analytics at NEXJ SYSTEMS • DMBOK2 DWBI and Big Data Sciences • Queens Masters of Analytics 2016 • 3 5/22/2017 In the beginning Google – started ten years later Apple – first PC with GUI+mouse Internet – without www 4 5/22/2017 Reports Simple process complex report development Sources read, integrated, translated and aggregated Aggregate results stored in a table Table contents were read, subtotalled and sent to the report file or printer Data Warehouse – 2 schools Reference Data Historical Reference Data Applications Data Marts App App App Extract Extract Extract Extract DM Analysis* Exploratory Analysis* DW Enterprise Data Model Op DM Operational Analysis ODS Operational Reports (per App) Operational Source Systems DM DM Operational Reports (integrated) Data Presentation Area Data Staging Area SERVICES: Clean Combine Standardize Conform Dimensions Load NO QUERIES Load Ø Ø Ø Ø DATA STORE: Ø Flat Files Ø Relational Tables Ø XML Datasets PROCESSING: Ø Sorting Ø Sequencing Data Mart #1 Data Access Tools Access AD-HOC QUERIES Access REPORT WRITERS Access ANALYTIC APPLICATIONS Access MODELS: Ø Forecasting Ø Scoring Ø Data Mining Data Mart #2 Load DW BUS Conformed Dimensions Raw Detailed Data Integration & Transformation App Inmon • Corporate Information Factory • Normalized tables Data Mart #N Load Kimbal • Marts that satisfy business process • Dimensional Data Model • Conformed Dimensions & Facts BI Tools V1 Focus on tool alignment and usage complexity DMBOKv1 Data Warehouse loading process (Kimball or Inmon) Operational Reporting & Analysis (OLAP) Performance Management (Dashboarding & Scorecarding) Internet Traffic What Happens in an Internet Minute? • 5 Exabytes of data transferred monthly • An Exabyte is a unit of information equal to one quintillion (1018) bytes • 5,000,000,000,000,000,000 bytes or 5x1018 • In non-math speak • Dime is 1.22 mm thick • 5 Exa of dimes stacked would reach from the Earth to the Sun 40,775 times Drama: Forklift Load This slide is only for dramatic effect – no data was harmed during the assembly or provisioning process Hadoop - The Yellow Elephant Scalable, Durable, Commodity Hardware Example • Branch customer churn e-mail query Query in place Hadoop can read many different file types without transformation • No need to transport the files to a processing or database server • Hadoop requests processed with MapReduce jobs • Apache Hive QL Interface Invoke a SQL like script • Creates MapReduce jobs • Compiled results returned • MapReduce Sends the algorithm to the data • Applied on best available node-file pair • Results then compiled • Viewing Hadoop Content Naturally we want to view the data Very difficult to infer any relationships Graphing requires understanding of the data elements But we don’t really know what’s there How can we pick a chart if without comprehending the data elements? Tree Map – Churn: Age by Wealth Immediately can see that customers 40 or below are leaving But why is this happening, is this a regional issue? Geographic– Churn by Wealth Churn customers scattered across multiple regions Appears age related, could it be product/service offering? Box Plot – Churn: Product by Age Most impact from churn to investment Actionable insights from data visualizations Data Sciences Visualization is a manual process, DS applied mathematical methods to analyze, process and manage big data results Decomposed in the following sections Data Mining Unsupervised learning Profiling, Data Reduction, Association, Clustering Predictive Analytics Supervised Learning Classification, Decision Trees, CHAID, Regression Advanced Supervised Learning Ensemble, Neural Networks, Support Vector Machines Data Scientists typically develop, train and manage many algorithmic programs referred to as models The input, or variables/parameters vary from model to model and the computed outcome have many business consumers How do we bring these pieces together? Virtualization Virtualization refers to technologies designed to provide a layer of abstraction between computer hardware systems and the software running on them Common virtualization technologies Server virtualization – a single physical server supplies multiple user environments ideal for resource optimization Database virtualization – multiple copies of a single database image ideal for testing activities Data virtualization – integration of any data from disparate data sources into coherent data services 17 5/22/2017 Data Virtualization 18 5/22/2017 BIW Release Virtualization Use virtualization to • Foster agile delivery • Prove concepts with business • Materialize only necessary components DMBOKv2 Conceptual Architecture Thank You DMBOK2 http://www.dama.org/content/body-knowledge IRMAC Data Management Education http://www.irmac.ca/ NEXJ Customer Data Management http://www.nexj.com/products/financial-services/enterprise- customer-view/ Questions? E-mail LinkedIn [email protected] ca.linkedin.com/in/martingsykora Reference http://en.wikipedia.org/ Dime_(Canadian_coin) Canada. Value, 0.10 CAD. Mass, 1.75 g. Diameter, 18.03 mm. Thickness, 1.22 mm Sun Distance to Earth: 149,600,000 km http://www.masswerk.at/googleBBS/ http://www.cisco.com/c/en/us/solutions/collateral/service- provider/ip-ngn-ip-next-generationnetwork/white_paper_c11-481360.html Annual global IP traffic will surpass the zettabyte (1000 exabytes) threshold in 2016. Global IP traffic willreach 1.1zettabytes per year or 91.3 exabytes (one billion gigabytes) per month in 2016. By 2018, global IPtrafficwill reach 1.6 zettabytes per year, or 131.6 exabytes per month.