Download Chapter 4-021112

FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING CHAPTER 4 DATA WAREHOUSE IMPLEMENTATION Chapter 4 Data Warehouse Implementation 4.1 Data Warehouse Scheme i) Star scheme is the multidimensional view of data that is expressed using relational database semantics provided by the database schema. It is organized around a central table (fact table) joined to a few smaller tables (dimension table) using foreign key references. ii) There are three basic elements in the dimensional model projected by star scheme; (1) a central table called the fact table, (2) two or more smaller tables called the dimensional tables, and (3) and a set of relationships called the star join. Figure 1: Components of star scheme The basic premise of star schema can be divided into : i) facts – core data elements being analyzed. Fact tables contains raw numeric items that represent relevant business facts such as units of individual items sold, dollars sold and dollars cost. Facts table are presummarized and aggregated along business dimensions. ii) dimensions – attributed about the facts. Dimension table defines business dimensions contains a non compound primary key. Example of dimensions tables are the product purchased and the period of purchase. Dimensions table typically represents the majority of data elements and are heavily indexed. Page 1 of 10 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING CHAPTER 4 DATA WAREHOUSE IMPLEMENTATION Figure 2: Star schema example (sales) Fact table provides statistics for sales broken down by product, period and store dimensions - In the typical star schema, the fact table is much larger than any of its dimensions tables. This point becomes an important consideration of the performance issues associated with star schemas. - In a typical business analysis problem: Find the share of total sales represented by each product in different markets, categories and periods, compared with the same period a year ago. To do so, you could calculate the percentage – each number is of the total of its column, a simple and common concept. - However, in a classic relational database these calculations and display would require definition of a separate view, requiring over 20 SQL commands. The star schema is designed to overcome this limitation of the 2 dimensional relational models. - A star schema can be created for every industry – consumer packaged goods, retail, telecommunications, transportation, insurance, health care, manufacturing, banking etc. Page 2 of 10 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING CHAPTER 4 DATA WAREHOUSE IMPLEMENTATION Figure 3: Star schema with sample data 4.1.1 Potential performance problems with star schemas Although most experts agree that star schemas are the preferred modeling method for data warehousing, there are numbers of issues associated with its implementation: i) Indexing problem - requires multiple metadata definitions adds to design complexity and slowness in performance - addition and deletion of levels in the hierarchy will require physical modification of the affected table, which is a time-consuming process that limits flexibility - increases size of index, thus impacting both performance and scalability. * solution – to make sure for one dimensions only have one key column ii) Level indicator problem iii) Pair wise join problem iv) Star schema join problem 4.2 Data segregation strategy i) When a data warehousing project is first initiated, it may have a mixture of operational and analytical/informational objectives. This mixture is a recipe for disaster. Redefine the project to concentrate on non-operational, informational Page 3 of 10 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING CHAPTER 4 DATA WAREHOUSE IMPLEMENTATION needs only. The primary reason for the existence of the data warehouse in the first place is to segregate operational processing from reporting. ii) Segregation also refers to partitioning the disk to implement the physical design. iii) The level of dimension may support a partial order or a total order and can be viewed via a directed path, a hierarchy, or a net. Figure 4: Segregation of dimensions i) Product Dimension Company | Product Type | Product ii) Time Dimension iii) Location Dimension Year Planet Month Season | Day Country Continent Hour AM/PM | Minute | Second State Region Zip Country code City 4.3 Aggregation i) Aggregation is the process of calculating summary data from detail records. It is often tempting to reduce the size of fact tables by aggregating data into summary records when the fact table is created. ii) However, when data is summarized in the fact table, detailed information is no longer directly available to the analyst. If detailed information is needed, the detail rows that were summarized will have to be identified and located, possibly in the source system that provided the data. iii) Fact table data should be maintained at the finest granularity possible. Aggregating data in the fact table should only be done after considering the consequences. iV) Mixing aggregated and detailed data in the fact table can cause issues and complications when using the data warehouse. For example, a sales order often contains several line items and may contain a discount, tax, or shipping cost that is applied to the order total Page 4 of 10 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING CHAPTER 4 DATA WAREHOUSE IMPLEMENTATION instead of individual line items, yet the quantities and item identification are recorded at the line item level. v) Summarization queries become more complex in this situation, and tools such as Analysis Services often require the creation of special filters to deal with the mixture of granularity. vi) There are two approaches that can be used in this situation (aggregation). One approach is to allocate the order level values to line items based on value, quantity, or shipping weight. Another approach is to create two fact tables, one containing data at the line item level, the other containing the order level information. vii) The order identification key should be carried in the detail fact table so the two tables can be related. The order table can then be used as a dimension table to the detail table, with the order-level values considered as attributes of the order level in the dimension hierarchy. As mentioned in chapter 3, there are four basic steps in designing data warehouse from data warehouse scheme: 1. Choosing the data mart for the small group of end users we deal with. 2. Fact table granularity (the smallest defined level of data in the table) is determined. 3. Fact table dimensions are selected. 4. Determine the facts for the table. In most cases, the granularity is at the transaction level, so the fact is the amount. 4.4 Data Distribution and Marketing ( Berson pg 294) - Sometimes the shape of the distribution of data can be calculated by an equation rather than just represented by the histogram. This is what is called a data distribution. Like a histogram, a data distribution can be described by a variety statistics. Page 5 of 10 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING CHAPTER 4 DATA WAREHOUSE IMPLEMENTATION - Many data distributions are well described by just two numbers, the mean and the variance. The mean is something most people are familiar with; the variance however, can be problematic. The easiest way to think about it is that it measures the average distance of each predictor value from the mean value all the records in the database. If the variance is high, the values are all over the place and very different. If the variance is low, most of the data values are fairly close to the mean. Data distribution can project the trend of marketing and this trend of marketing can be used to project the expected trend and action to be taken to stimulate the expected trend. 4.5 Physical Design (Source : http://download-west.oracle.com/docs/cd/B10501_01/server.920/a96520/physical.htm) 4.5.1 Moving from Logical to Physical Design i) Logical design is what you draw before building your warehouse (usually using ERD or Enterprise Modeling for Data warehouse). Physical design is the creation of the database with SQL statements. It is a representation of actual database structure to be built, such as the tables, columns and relationship. ii) During the physical design process, you convert the data gathered during the logical design phase into a description of the physical database structure. Physical design decisions are mainly driven by query performance and database maintenance aspects. For example, choosing a partitioning strategy that meets common query requirements enables partition pruning, a way of narrowing a search before performing it. iii) During the logical design phase, you defined a model for your data warehouse consisting of entities, attributes, and relationships. The entities are linked together using relationships. Attributes are used to describe the entities. The Page 6 of 10 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING CHAPTER 4 DATA WAREHOUSE IMPLEMENTATION unique identifier (UID) distinguishes between one instance of an entity and another. Figure 5: Logical Design Compared with Physical Design in ER Model iv) During the physical design process, you translate the expected schemas into actual database structures. At this time, you have to map Entities to tables: - Relationships to foreign key constraints - Attributes to columns - Primary unique identifiers to primary key constraints - Unique identifiers to unique key constraints v) During database design we concentrate on ‘streamlining” the logical data model into a workable physical data model. We may even decide to build two databases – in which case we should spawn two physical data models from one logical data model. The process of streamlining is called denormalization, a regrouping of data that often involves collapsing two or more entities into one table. Page 7 of 10 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING CHAPTER 4 DATA WAREHOUSE IMPLEMENTATION Further ref: http://www.rdbprime.com/Oracle/Oracle_Docs/Oracle10gDB_Server/server.101/b10736/physical.htm http://www.microsoft.com/technet/prodtechnol/sql/2000/reskit/part5/c1761.mspx?mfr=true http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.33.7893 Exercise : 1. Define star schema. (5 marks) 2. Star schema is considered logical design or physical design? (1 mark) Explain your answer. (4 marks) 3. Every DM is composed of three basic elements: a central table/fact table; the dimensional tables; and the set of relationships. Describe the three elements. (6 marks) 4. The star scheme consists of various interconnect elements. Draw the star scheme and label all the elements. (10 Marks) 5. In your lab, develop the star schema with sample data as per Figure 3. Page 8 of 10 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING CHAPTER 4 DATA WAREHOUSE IMPLEMENTATION 6. Create star schema from Batta case study below (10 marks) and develop the schema as your lab exercise: Batta is one of the world's leading footwear retailers and manufacturers with operations across five continents. Batta's strength lies in its worldwide presence. While local companies are self-governing, each one benefits from its link to the international organization for backoffice systems, product innovations and sourcing. Although Batta operates in a wide variety of markets, climates and buying power Batta companies share the same leadership points. Two important ones are product concept development and constant improvement of business processes in order to offer customers great value and the best possible service. Batta is interested in building a data warehouse to analyze their retail sales worth 1.5 billion and quantity of shoes they sold. They would like to have a detail of the analysis by retailers and suppliers on each of their product and at specific time when the sales occurs. They have product categories for suppliers that are: expensive, mid-range and low-cost. Their product can be categorized into permanent goods, seasonal and temporary existence. Batta shoes come in different sizes, colors and material combinations of boots, sandals, sneakers, slippers and shoes accessories. Tips: Refer Chapter 3.3.3 Designing a DW Fact Tables Marking scheme : ½ mark for each dimension table. (Max 2 Marks) 1 mark for complete attributes in each table ( Max 4 Marks) ½mark for each connection from fact table to dimension table. ( Max 2 Marks) ½ mark for each foreign key underlined and ½ mark for primary key underlined. (Max 2 Marks) Page 9 of 10 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING CHAPTER 4 DATA WAREHOUSE IMPLEMENTATION 7. You are required to draw a star schema for ‘Snow Boutique Holdings’ considering all information given by the following paragraph. (10 marks) An international winter clothing boutique franchise company is interested in data warehouse implementation by executing a data mart in their holding company. The CEO of ‘Snow Boutique’ said that he want to have analysis of the sales for his entire store to support their strategic marketing plan. He wants to know how many products they have sell in their stores for specific period of time such as by day, month, quarter and year. He also wants to have the figure of sales qty, amount and the cost for particular store, product and period. The product of their stores such as sweater, shoes and gloves are varies by color and size and the store is located per city which is under responsibility of a manager contactable by telephone number. 8. Design Star Scheme for Pharmacy Care Systems. (10 marks) Pharmacy Care Systems allows pharmacists to analyze the large amounts of clinical and administrative data related to pharmacy order. The requirements: Director of pharmacy want to identify the potential overuse of certain drugs in the formulary. Specifically, he want to know when and which physician are overprescribing and patients who are receiving these drugs. He can enter a few key variables into the computer and produce a list of overused drugs, the characteristics of the patients that received them, and the characteristics of the physicians that prescribed them. A clinical coordinator want to know the cost of care and final outcome for each patient that received a clinical consultation from your pharmacists. You enter the names of your pharmacists in the computer and a list of patients, their final outcomes and total costs of pharmacy appears. Page 10 of 10

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Chapter 4-021112