Download Chapter 4-021112

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY
DATA WAREHOUSE & DATA MINING
CHAPTER 4 DATA WAREHOUSE IMPLEMENTATION
Chapter 4
Data Warehouse Implementation
4.1 Data Warehouse Scheme
i)
Star scheme is the multidimensional view of data that is expressed using relational
database semantics provided by the database schema. It is organized around a central
table (fact table) joined to a few smaller tables (dimension table) using foreign key
references.
ii)
There are three basic elements in the dimensional model projected by star scheme;
(1) a central table called the fact table,
(2) two or more smaller tables called the dimensional tables, and
(3) and a set of relationships called the star join.
Figure 1: Components of star scheme
The basic premise of star schema can be divided into :
i) facts – core data elements being analyzed. Fact tables contains raw numeric items
that represent relevant business facts such as units of individual items sold, dollars
sold and dollars cost. Facts table are presummarized and aggregated along business
dimensions.
ii) dimensions
–
attributed
about
the
facts.
Dimension
table defines business
dimensions contains a non compound primary key. Example of dimensions tables are
the product purchased and the period of purchase. Dimensions table typically
represents the majority of data elements and are heavily indexed.
Page 1 of 10
FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY
DATA WAREHOUSE & DATA MINING
CHAPTER 4 DATA WAREHOUSE IMPLEMENTATION
Figure 2: Star schema example (sales)
Fact table provides
statistics for sales broken
down by product, period
and store dimensions
- In the typical star schema, the fact table is much larger than any of its dimensions
tables. This point becomes an important consideration of the performance issues
associated with star schemas.
- In a typical business analysis problem: Find the share of total sales represented by
each product in different markets, categories and periods, compared with the same
period a year ago. To do so, you could calculate the percentage – each number is of
the total of its column, a simple and common concept.
- However, in a classic relational database these calculations and display would require
definition of a separate view, requiring over 20 SQL commands. The star schema is
designed to overcome this limitation of the 2 dimensional relational models.
- A star schema can be created for every industry – consumer packaged goods, retail,
telecommunications, transportation, insurance, health care, manufacturing, banking
etc.
Page 2 of 10
FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY
DATA WAREHOUSE & DATA MINING
CHAPTER 4 DATA WAREHOUSE IMPLEMENTATION
Figure 3: Star schema with sample data
4.1.1 Potential performance problems with star schemas
Although most experts agree that star schemas are the preferred modeling method for
data warehousing, there are numbers of issues associated with its implementation:
i) Indexing problem
- requires multiple metadata definitions adds to design complexity and slowness in
performance
- addition and deletion of levels in the hierarchy will require physical modification of
the affected table, which is a time-consuming process that limits flexibility
- increases size of index, thus impacting both performance and scalability.
* solution – to make sure for one dimensions only have one key column
ii) Level indicator problem
iii) Pair wise join problem
iv) Star schema join problem
4.2 Data segregation strategy
i)
When a data warehousing project is first initiated, it may have a mixture of
operational and analytical/informational objectives. This mixture is a recipe for
disaster. Redefine the project to concentrate on non-operational, informational
Page 3 of 10
FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY
DATA WAREHOUSE & DATA MINING
CHAPTER 4 DATA WAREHOUSE IMPLEMENTATION
needs only. The primary reason for the existence of the data warehouse in the first
place is to segregate operational processing from reporting.
ii) Segregation also refers to partitioning the disk to implement the physical design.
iii) The level of dimension may support a partial order or a total order and can be viewed
via a directed path, a hierarchy, or a net.
Figure 4: Segregation of dimensions
i) Product Dimension
Company
|
Product Type
|
Product
ii) Time Dimension
iii) Location Dimension
Year
Planet
Month Season
|
Day
Country Continent
Hour AM/PM
|
Minute
|
Second
State Region
Zip Country
code
City
4.3 Aggregation
i) Aggregation is the process of calculating summary data from detail records. It is often
tempting to reduce the size of fact tables by aggregating data into summary records
when the fact table is created.
ii) However, when data is summarized in the fact table, detailed information is no longer
directly available to the analyst. If detailed information is needed, the detail rows that
were summarized will have to be identified and located, possibly in the source system
that provided the data.
iii) Fact table data should be maintained at the finest granularity possible. Aggregating data
in the fact table should only be done after considering the consequences.
iV) Mixing aggregated and detailed data in the fact table can cause issues and complications
when using the data warehouse. For example, a sales order often contains several line
items and may contain a discount, tax, or shipping cost that is applied to the order total
Page 4 of 10
FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY
DATA WAREHOUSE & DATA MINING
CHAPTER 4 DATA WAREHOUSE IMPLEMENTATION
instead of individual line items, yet the quantities and item identification are recorded at
the line item level.
v) Summarization queries become more complex in this situation, and tools such as Analysis
Services often require the creation of special filters to deal with the mixture of
granularity.
vi) There are two approaches that can be used in this situation (aggregation). One approach
is to allocate the order level values to line items based on value, quantity, or shipping
weight. Another approach is to create two fact tables, one containing data at the line
item level, the other containing the order level information.
vii) The order identification key should be carried in the detail fact table so the two tables
can be related. The order table can then be used as a dimension table to the detail table,
with the order-level values considered as attributes of the order level in the dimension
hierarchy.
As mentioned in chapter 3, there are four basic steps in designing data warehouse from data
warehouse scheme:
1. Choosing the data mart for the small group of end users we deal with.
2. Fact table granularity (the smallest defined level of data in the table) is determined.
3. Fact table dimensions are selected.
4. Determine the facts for the table. In most cases, the granularity is at the transaction
level, so the fact is the amount.
4.4 Data Distribution and Marketing
( Berson pg 294)
- Sometimes the shape of the distribution of data can be calculated by an equation
rather than just represented by the histogram. This is what is called a data distribution.
Like a histogram, a data distribution can be described by a variety statistics.
Page 5 of 10
FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY
DATA WAREHOUSE & DATA MINING
CHAPTER 4 DATA WAREHOUSE IMPLEMENTATION
- Many data distributions are well described by just two numbers, the mean and the
variance. The mean is something most people are familiar with; the variance however,
can be problematic. The easiest way to think about it is that it measures the average
distance of each predictor value from the mean value all the records in the database.
If the variance is high, the values are all over the place and very different. If the
variance is low, most of the data values are fairly close to the mean.
Data distribution can project the trend of marketing and this trend of marketing can be
used to project the expected trend and action to be taken to stimulate the expected
trend.
4.5 Physical Design
(Source : http://download-west.oracle.com/docs/cd/B10501_01/server.920/a96520/physical.htm)
4.5.1 Moving from Logical to Physical Design
i)
Logical design is what you draw before building your warehouse (usually using
ERD or Enterprise Modeling for Data warehouse). Physical design is the creation
of the database with SQL statements. It is a representation of actual database
structure to be built, such as the tables, columns and relationship.
ii) During the physical design process, you convert the data gathered during the
logical design phase into a description of the physical database structure. Physical
design decisions are mainly driven by query performance and database
maintenance aspects. For example, choosing a partitioning strategy that meets
common query requirements enables partition pruning, a way of narrowing a
search before performing it.
iii)
During the logical design phase, you defined a model for your data warehouse
consisting of entities, attributes, and relationships. The entities are linked
together using relationships. Attributes are used to describe the entities. The
Page 6 of 10
FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY
DATA WAREHOUSE & DATA MINING
CHAPTER 4 DATA WAREHOUSE IMPLEMENTATION
unique identifier (UID) distinguishes between one instance of an entity and
another.
Figure 5: Logical Design Compared with Physical Design in ER Model
iv) During the physical design process, you translate the expected schemas into
actual database structures. At this time, you have to map Entities to tables:
- Relationships to foreign key constraints
- Attributes to columns
- Primary unique identifiers to primary key constraints
- Unique identifiers to unique key constraints
v) During database design we concentrate on ‘streamlining” the logical data model
into a workable physical data model. We may even decide to build two databases
– in which case we should spawn two physical data models from one logical data
model. The process of streamlining is called denormalization, a regrouping of
data that often involves collapsing two or more entities into one table.
Page 7 of 10
FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY
DATA WAREHOUSE & DATA MINING
CHAPTER 4 DATA WAREHOUSE IMPLEMENTATION
Further ref:
http://www.rdbprime.com/Oracle/Oracle_Docs/Oracle10gDB_Server/server.101/b10736/physical.htm
http://www.microsoft.com/technet/prodtechnol/sql/2000/reskit/part5/c1761.mspx?mfr=true
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.33.7893
Exercise :
1. Define star schema.
(5 marks)
2. Star schema is considered logical design or physical design? (1 mark) Explain your
answer. (4 marks)
3. Every DM is composed of three basic elements: a central table/fact table; the dimensional
tables; and the set of relationships. Describe the three elements.
(6 marks)
4. The star scheme consists of various interconnect elements. Draw the star scheme and
label all the elements.
(10 Marks)
5. In your lab, develop the star schema with sample data as per Figure 3.
Page 8 of 10
FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY
DATA WAREHOUSE & DATA MINING
CHAPTER 4 DATA WAREHOUSE IMPLEMENTATION
6.
Create star schema from Batta case study below (10 marks) and develop the schema as
your lab exercise:
Batta is one of the world's leading footwear retailers and manufacturers with operations
across five continents. Batta's strength lies in its worldwide presence. While local companies
are self-governing, each one benefits from its link to the international organization for backoffice
systems,
product
innovations
and
sourcing.
Although Batta operates in a wide variety of markets, climates and buying power Batta
companies share the same leadership points. Two important ones are product concept
development and constant improvement of business processes in order to offer customers
great value and the best possible service.
Batta is interested in building a data warehouse to analyze their retail sales worth 1.5
billion and quantity of shoes they sold. They would like to have a detail of the analysis by
retailers and suppliers on each of their product and at specific time when the sales occurs.
They have product categories for suppliers that are: expensive, mid-range and low-cost.
Their product can be categorized into permanent goods, seasonal and temporary existence.
Batta shoes come in different sizes, colors and material combinations of boots, sandals,
sneakers, slippers and shoes accessories.
Tips: Refer Chapter 3.3.3 Designing a DW Fact Tables
Marking scheme :
½ mark for each dimension table. (Max 2 Marks)
1 mark for complete attributes in each table ( Max 4 Marks)
½mark for each connection from fact table to dimension table. ( Max 2 Marks)
½ mark for each foreign key underlined and ½ mark for primary key underlined. (Max 2 Marks)
Page 9 of 10
FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY
DATA WAREHOUSE & DATA MINING
CHAPTER 4 DATA WAREHOUSE IMPLEMENTATION
7.
You are required to draw a star schema for ‘Snow Boutique Holdings’ considering all
information given by the following paragraph. (10 marks)
An international winter clothing boutique franchise company is interested in data warehouse
implementation by executing a data mart in their holding company. The CEO of ‘Snow
Boutique’ said that he want to have analysis of the sales for his entire store to support their
strategic marketing plan.
He wants to know how many products they have sell in their stores for specific period of
time such as by day, month, quarter and year. He also wants to have the figure of sales qty,
amount and the cost for particular store, product and period.
The product of their stores such as sweater, shoes and gloves are varies by color and size
and the store is located per city which is under responsibility of a manager contactable by
telephone number.
8. Design Star Scheme for Pharmacy Care Systems. (10 marks)
Pharmacy Care Systems allows pharmacists to analyze the large amounts of clinical
and administrative data related to pharmacy order.
The requirements:
Director of pharmacy want to identify the potential overuse of certain drugs in the
formulary.
Specifically,
he
want
to
know
when
and
which
physician
are
overprescribing and patients who are receiving these drugs. He can enter a few key
variables into the computer and produce a list of overused drugs, the characteristics
of the patients that received them, and the characteristics of the physicians that
prescribed them.
A clinical coordinator want to know the cost of care and final outcome for each patient
that received a clinical consultation from your pharmacists. You enter the names of
your pharmacists in the computer and a list of patients, their final outcomes and total
costs of pharmacy appears.
Page 10 of 10