Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Handout 12 CS-605 – Spring’17 Page 1 of 6 Handout 12 Data Warehousing and Analytics. • • Operational (aka transactional) system – a system that is used to run a business in real time, based on current data; also called a system of record Informational (analytical) system – a system designed to support decision making based on historical point-in-time and prediction data for complex queries or data-mining applications o Collect business operational data o Reduce it to a form that can be used to analyze the behavior of the business. o Not limited to Database, but often using the Database technology. Data warehouse (simple definition) – an archival database for decision support. Operational Databases Decision Support Databases Support day-to-day business operations Read/writeable: records may be inserted, updated, deleted. Not as big as ones used for Decision Support Hold historical information integrated from multiple sources Primarily read-only Updating limited to o Load o Refresh o (i.e. Inserts, some Deletes, almost never Updates) Include a temporal component. Tend to be very large (especially when storing transaction data) Integrity not a big concern Usually designed in ad hoc manner Queries Often involve complex logical expressions in WHERE Require access to many kinds of facts/business objects, i.e. may require many joins. Functionally complex: may involve complex statistical computations Analytically complex: rarely answered in one query. Data Warehouse: A subject-oriented, integrated, time-variant, non-updatable collection of data used in support of management decision-making processes – Subject-oriented: e.g. customers, patients, students, products – Integrated: Consistent naming conventions, formats, encoding structures; from multiple and heterogeneous organizational data sources – Time-variant: Can study trends and changes – Nonupdatable (nonvolatile): Read-only, periodically refreshed -1- Handout 12 CS-605 – Spring’17 Page 2 of 6 Data Mart: – A data warehouse that is limited in scope. Intended for use by a smaller, more specialized group of people Creating a Data Warehouse - ETL (Extract, Transform, Load ) Need to integrate uncoordinated and inconsistent multiple databases in organizations. Need to separate operational and informational systems and data to improve performance of data management Extract Static extract = capturing a snapshot of the source data at a point in time Incremental extract = capturing changes that have occurred since the last static extract Scrub/Cleanse uses pattern recognition and AI techniques to upgrade data quality Problems: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies Figure 9-1 from MDM Examples of heterogeneous data Establishing standard abbreviations and identifiers, replacing synonyms. Transform and consolidate convert data from format of operational system to format of data warehouse split/combine source records synchronize time information: e.g. customer - revenue data stored by fiscal quarter customer - salesperson data stored by calendar quarter can’t tell which salesperson is responsible for what part of the customer revenue -2- Handout 12 CS-605 – Spring’17 Page 3 of 6 Load/Index Place transformed data into the warehouse and create indexes Move the data Initial / Refresh mode: bulk rewriting of target data at periodic intervals Check uniqueness constraints CPU intensive process, especially if many indices are present – drop/reset indices could help. Several Common Data Warehouse Architectures Generic Two-Level Architecture Independent Data Mart Dependent Data Mart and Operational Data Store Logical Data Mart and @ctive Warehouse Generic Two-Level Architecture Operational Databases / One company-wide Warehouse Benefit: single integrated view of organizational data Problem: Periodic extraction data is not completely current in warehouse Independent Data Mart Multiple Data marts - mini-warehouses, limited in scope No single consolidated warehouse. Benefits: easier to create than one integrated warehouse Problems: redundancy, extra work in ETL for each data mart, potential lack of consistency, complex querying across multiple data marts users of individual marts must themselves provide an integrated view – this is difficult and does not add up to having a single warehouse with well-defined known structure. Dependent Data Mart and Operational Data Store Data loaded – from Operational Data Store to single Data Warehouse – from Data Warehouse to Data Marts Benefits: single ETL – no redundancy Logical Data Mart and @ctive Warehouse Data marts are logical views of the warehouse. Works well when data warehouse is not too large. Used in e-commerce applications. Problems: performance degrades with increasing size of the warehouse Benefits: Data in marts always current, no redundancy in storage/ETL -3- Handout 12 CS-605 – Spring’17 Page 4 of 6 Data Warehouse Structure Star-schema: Dimension tables – (often de-normalized for performance reasons) describe major business subjects + Time Period. Fact table – an associative entity of the dimensions. Contains factual and quantitative summary data. Examples (From MDM) Fact table provides statistics for sales broken down by product, period and store dimensions -4- Handout 12 CS-605 – Spring’17 Page 5 of 6 Issues: Dimension table keys must be surrogate (non-intelligent and non-business related) for the following reasons – Object descriptions may change over time e.g.: decided to change size of product with business number 20. – Length/format consistency Across multiple organizational databases, the same product may have different identification numbers/primary keys Granularity of Fact Table – what level of detail do you want? – Transactional grain – finest level – enter every transaction into warehouse – Aggregated grain – more summarized – enter just summary data – Finer grain => better analysis capability more dimension tables => more rows in fact table Modeling dates: Technologies Data Mining Knowledge discovery using a blend of statistical, AI, and computer graphics techniques – Explain observed events or conditions why sudden increase in turkey sales? – Confirm hypotheses do turkey sales increase in November? do more students take Literature courses as sophomores than juniors? – Explore data for new or unexpected relationships what else are the customers that buy turkeys in November likely to buy? which group of customers is likely to be interested in a product? Data visualization – representing data in graphical/multimedia formats for analysis. Often used in conjunction with data mining. Helps identify trends and patterns. -5- Handout 12 CS-605 – Spring’17 Page 6 of 6 Big Data - evolving term - usually refers to voluminous amount of structured, semi-structured and unstructured data - can be mined for information Analytics o Systematic analysis and interpretation of data—typically using mathematical, statistical, and computational tools—to improve our understanding of a real-world domain. Big data characteristics • The Five Vs of Big Data – Volume – much larger quantity of data than typical for relational databases – Variety – lots of different data types and formats – Velocity – data comes at very fast rate (e.g. mobile sensors, web click stream) – Veracity – traditional data quality methods don’t apply; how to judge the data’s accuracy and relevance? – Value – big data is valuable to the bottom line, and for fostering good organizational actions and decisions - Schema on Read, rather than Schema on Write Schema on Write– preexisting data model, how traditional databases are designed (relational databases) Schema on Read – data model determined later, depends on how you want to use it Capture and store the data, and worry about how you want to use it later - Data Lake o A large integrated repository for internal and external data that does not follow a predefined schema o Capture everything, dive in anywhere, flexible access NoSQL = Not Only SQL databases • A category of recently introduced data storage and retrieval technologies not based on the relational model • Supports schema on read • Largely open source • BASE – basically available, soft state, eventually consistent -6-