Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Clusterpoint wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Operational transformation wikipedia , lookup
Data Protection Act, 2012 wikipedia , lookup
Data center wikipedia , lookup
Forecasting wikipedia , lookup
Data analysis wikipedia , lookup
Information privacy law wikipedia , lookup
Database model wikipedia , lookup
3D optical data storage wikipedia , lookup
Data Warehousing : Introduction Data Warehouse : Architecture What is a Data Warehouse? • A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. • It usually contains historical data derived from transaction data, but it can include data from other sources. • It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. Data Warehouse Environment • In addition to a relational database, a data warehouse environment includes an • Extraction • Transportation • Transformation • Loading • An online analytical processing (OLAP) engine • Client analysis tools and other applications that manage the process of gathering data and delivering it to business users. Characteristics of a Data Warehouse • A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon: • Subject Oriented • Integrated • Nonvolatile • Time Variant Subject Oriented • Data warehouses are designed to help you analyze data. • For example, to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer for this item last year?" • This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented. Integrated • Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. • They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated. Nonvolatile • Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred. Time Variant • In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. • A data warehouse's focus on change over time is what is meant by the term time variant. Contrasting OLTP and Data Warehousing Environments OLTP and Data Warehousing Environments • Data warehouses and OLTP systems have very different requirements. Here are some examples of differences between typical data warehouses and OLTP systems: • • • • • Workload Data modifications Schema design Typical operations Historical data Workload • Data warehouses are designed to accommodate ad hoc queries. You might not know the workload of your data warehouse in advance, so a data warehouse should be optimized to perform well for a wide variety of possible query operations. • OLTP systems support only predefined operations. Your applications might be specifically tuned or designed to support only these operations. Data Modifications • A data warehouse is updated on a regular basis by the ETL process (run nightly or weekly) using bulk data modification techniques. The end users of a data warehouse do not directly update the data warehouse. • In OLTP systems, end users routinely issue individual data modification statements to the database. The OLTP database is always up to date, and reflects the current state of each business transaction. Schema design • Data warehouses often use de normalized or partially de normalized schemas (such as a star schema) to optimize query performance. • OLTP systems often use fully normalized schemas to optimize update/insert/delete performance, and to guarantee data consistency. Typical operations • A typical data warehouse query scans thousands or millions of rows. For example, "Find the total sales for all customers last month." • A typical OLTP operation accesses only a handful of records. For example, "Retrieve the current order for this customer." Historical data • Data warehouses usually store many months or years of data. This is to support historical analysis. • OLTP systems usually store data from only a few weeks or months. The OLTP system stores only historical data as needed to successfully meet the requirements of the current transaction. Data Warehouse Applications • As discussed before, a data warehouse helps business executives to organize, analyze, and use their data for decision making. A data warehouse serves as a sole part of a plan-execute-assess "closed-loop" feedback system for the enterprise management. Data warehouses are widely used in the following fields: • Financial services • Banking services • Consumer goods • Retail sectors • Controlled manufacturing Strategic uses of data warehousing Industry Functional areas of use Strategic use Airline Operations; marketing Crew assignment, aircraft development, mix of fares, analysis of route profitability, frequent flyer program promotions Banking Product development; Operations; marketing Customer service, trend analysis, product and service promotions, reduction of IS expenses Credit card Product development; marketing Customer service, new information service, fraud detection Health care Operations Reduction of operational expenses Investment and Insurance Product development; Operations; marketing Risk management, market movements analysis, customer tendencies analysis, portfolio management Retail chain Distribution; marketing Trend analysis, buying pattern analysis, pricing policy, inventory control, sales promotions, optimal distribution channel Telecommunications Product development; Operations; marketing New product and service promotions, reduction of IS budget, profitability analysis Personal care Distribution; marketing Distribution decisions, product promotions, sales decisions, pricing policy Public sector Operations Intelligence gathering Functions of Data Warehouse Tools and Utilities • Data Extraction - Involves gathering data from multiple heterogeneous sources. • Data Cleaning - Involves finding and correcting the errors in data. • Data Transformation - Involves converting the data from legacy format to warehouse format. • Data Loading - Involves sorting, summarizing, consolidating, checking integrity, and building indices and partitions. • Refreshing - Involves updating from data sources to warehouse. Disadvantages of data warehouses • Data warehouses are not the optimal environment for unstructured data. • Because data must be extracted, transformed and loaded into the warehouse, there is an element of latency in data warehouse data. • Over their life, data warehouses can have high costs. Maintenance costs are high. • Data warehouses can get outdated relatively quickly. There is a cost of delivering suboptimal information to the organization. • There is often a fine line between data warehouses and operational systems. Duplicate, expensive functionality may be developed. Or, functionality may be developed in the data warehouse that, in retrospect, should have been developed in the operational systems and vice versa. Data Warehousing Typology • The virtual data warehouse – the end users have direct access to the data stores, using tools enabled at the data access layer • The central data warehouse – a single physical database contains all of the data for a specific functional area • The distributed data warehouse – the components are distributed across several physical databases The architecture Reporting, query, application development, and EIS(executive information system) tools Operational data source1 High summarized data Meta-data Operational data source 2 Lightly summarized data Load Manager Operational data source n Detailed data Query Manage DBMS OLAP(online analytical processing) tools Warehouse Manager Operational data store (ods) Operational data store (ODS) Data mining Archive/backup data Typical architecture of a data warehouse End-user access tools The main components • Operational data sourcesThe data in DW is supplied from mainframe operational data sources like hierarchical and network databases, proprietary file systems, private serves and external systems such as the Internet, commercially available DB, or DB assoicated with and organization’s suppliers or customers • Operational datastore(ODS)It is a repository of current and integrated operational data used for analysis. It is often structured and supplied with data in the same way as the data warehouse, but may in fact simply act as a staging area for data to be moved into the warehouse The main components • load manager Is also called the frontend component, it performance all the operations associated with the extraction and loading of data into the warehouse. These operations include simple transformations of the data to prepare the data for entry into the warehouse • Warehouse managerperforms all the operations associated with the management of the data in the warehouse. The operations performed by this component include analysis of data to ensure consistency, transformation and merging of source data, creation of indexes and views, generation of denormalizations and aggregations, and archiving and backing-up data The main components • Query manageralso called backend component, it performs all the operations associated with the management of user queries. The operations performed by this component include directing queries to the appropriate tables and scheduling the execution of queries • Detailed, lightly and lightly summarized data,archive/backup data • Meta-data • end-user access toolscan be categorized into five main groups: data reporting and query tools, application development tools, executive information system (EIS) tools, online analytical processing (OLAP) tools, and data mining tools Data Warehousing - Schemas • Schema is a logical description of the entire database. It includes the name and description of records of all record types including all associated data-items and aggregates. Much like a database, a data warehouse also requires to maintain a schema. Fact table • In data warehousing, a Fact table consists of the measurements, metrics or facts of a business process. It is located at the center of a star schema or a snowflake schema surrounded by dimension tables. Where multiple fact tables are used, these are arranged as a fact constellation schema. Database schema for a data warehouse • Star schema • Snowflake schema Star Schema • Each dimension in a star schema is represented with only one-dimension table. • This dimension table contains the set of attributes. • The following diagram shows the sales data of a company with respect to the four dimensions, namely time, item, branch, and location. Star Schema Star Schema • Each dimension has only one dimension table and each table holds a set of attributes. For example, the location dimension table contains the attribute set {location_key, street, city, province_or_state,country}. This constraint may cause data redundancy. For example, "Vancouver" and "Victoria" both the cities are in the Canadian province of British Columbia. The entries for such cities may cause data redundancy along the attributes province_or_state and country. Snowflake Schema Snowflake Schema • Note: Due to normalization in the Snowflake schema, the redundancy is reduced and therefore, it becomes easy to maintain and the save storage space. Ease of maintenance / change Ease of Use Query Performance Snowflake Schema Star Schema No redundancy, so snowflake schemas are easier to maintain and change. Has redundant data and hence less easy to maintain/change More complex queries and Lower query complexity hence less easy to and easy to understand understand More foreign keys and hence longer query execution time (slower) Less number of foreign keys and hence shorter query execution time (faster) Good to use for Good for datamarts with datawarehouse core to simple relationships (1:1 or Type of Data warehouse simplify complex 1:many) relationships (many:many) Joins Dimension table When to use Normalization/ De-Normalization Snowflake Schema Star Schema Higher number of Joins Fewer Joins A snowflake schema may have more than A star schema one dimension table for each dimension. contains only single dimension table for each dimension. When dimension table is relatively big in When dimension size, snowflaking is better as it reduces table contains less space. number of rows, we can choose Star schema. Dimension Tables are in Normalized form Both Dimension and but Fact Table is in De-Normalized form Fact Tables are in De-Normalized form Queries?