Download Fact Table

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Database wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

SQL wikipedia , lookup

Clusterpoint wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Functional Database Model wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Relational model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Database model wikipedia , lookup

Transcript
Summarizing Data with CUBE
and ROLLUP
SQL ROLLUP and CUBE commands
• Quick, efficient way to summarize the data
stored in your database
• Offer a valuable tool for gaining some quick
and dirty insight into your data
• ROLLUP and CUBE are SQL extensions and
they're available in MySQL, SQL Server 6.5
(and above) and Oracle 8i (and above)
CUBE Operation
Pet Table
Type
Type
Store
Number
Dog
Miami
12
Cat
Miami
18
Turtle
Tampa
4
Dog
Tampa
14
Cat
Naples
9
Dog
Naples
5
Turtle
Naples
1
SQL:
select Type, Store, sum(Number)
from oPet
group by type, store
WITH CUBE
Analyzing the data, you'll notice that our chain
has 27 cats, 31 dogs and 5 turtles spread
among our three stores. Our Miami store has
the largest number of pets in stock with a
whopping inventory of 30 pets.
Store
Number
Cat
Miami
18
Cat
Naples
9
Cat
NULL
27
Dog
Miami
12
Dog
Naples
5
Dog
Tample
14
Dog
NULL
31
Turtle
Naples
1
Turtle
Tample
4
Turtle
NULL
5
NULL
Miami
30
NULL
Naples
15
NULL
Tampa
18
NULL
NULL
63
ROLLUP Operation
Utilizing the ROLLUP operator
instead of the CUBE operator
will eliminate the results that
contain a NULL in the first
column.
Pet Table
Type
Store
Number
Dog
Miami
12
Cat
Miami
18
Turtle
Tampa
4
Dog
Tampa
14
Cat
Naples
9
Dog
Naples
5
Turtle
Naples
1
SQL:
select Type, Store, sum(Number)
from Pet
group by type, store
WITH ROLLUP
Type
Store
Number
Cat
Miami
18
Cat
Naples
9
Cat
NULL
27
Dog
Miami
12
NULL
Miami
30
Dog
Naples
5
NULL
Naples
15
Dog
Tample
14
NULL
Tampa
18
Dog
NULL
31
Turtle
Naples
1
Turtle
Tample
4
Turtle
NULL
5
NULL
NULL
63
We're not particularly interested in
the total number of pets at each store
-- we'd just like to know our statewide
inventory of each species along with
the standard GROUP BY data.
ที่มา: http://databases.about.com/od/sql/l/aacuberollup.htm
• Show ‘how to use Excel to do CUBE and
ROLLUP operations’
Aggregations
• It simply means a summarized, typically
additive value.
• The level of aggregation in a star schema
depends on the scenario.
• Many star schemas are aggregated to some
base level, called the grain, although this is
becoming somewhat less common as
developers rely on cube building engines to
summarize to a base level of granularity.
Cartesian-Product Operation –
Example
 Relations r, s:
 r x s:
 A=C(r x s):
Natural Join – Example
 Relations r, s:
 r x s:
Same domain
Join operation is very slow.
 r
s
 A=C(r x s):
Natural Join – Example
• Relations r, s:
 r
s
Database Design?
• Is there any problem with this design?
Normalization
instructor
department
Comparison
Normalization
Denormalization
+ Reduce space
+ Small table size
+ Reduce update, insert, and
delete anomaly
+ Reduce data entry
+ Use less index
+ Fast index since we use numeric
to represent index
+ Good for update, insert, and
delete
- Increase number of joins in
multiple tables
+ Fast query: query fewer tables
+ Reduce number of joins (fast
query processing time)
+ Good for data retrieval
+ Easy for end user to understand
data
- Increase Redundancy data
- Big table size
- Increase update, insert, and
delete anomaly
Which table is much
easier to understand?
Creating a view
that looks like
this, however,
still uses joins in
the background
and therefore
does not
achieve the best
performance on
the query.
Fortunately, there is a better way.
Designing the Star Schema
Database
Motivation
• Creating a Star Schema Database is one of the
most important step in creating a data
warehouse.
• It is important to understand how to move
from a standard, on-line transaction
processing (OLTP) system to a final star
schema.
For example:
•Assume a company sells only two products:
dog food and cat food.
•Each day, the company records the sales of
each product.
•At the end of a couple of days, the data
looks like this:
• Each day contains several transactions.
•This is the data as stored in a standard
OLTP system.
• However, the data warehouse might not
record this level of detail. Instead, it could
summarize, or aggregate, the data to daily
totals.
•The records in the data warehouse might
look something like this:
Pet Table
Date
Order Number
4/24/2012
4/25/2012
Quantity Sold
Dog Food
Cat Food
1
2
3
4
5
5
3
2
2
3
2
0
6
2
3
1
2
3
3
2
4
7
1
0
Quality Sold
Date
Dog Food
Cat Food
4/24/2012
15
13
4/25/2012
9
8
•This summarization of data reduces the number
of records by aggregating the individual
transaction records into daily records that show
the number of each product purchased each day.
•In this simple example, it is easy to derive the
table simply by running a query against Pet
table.
select
date, sum(Dog Food) as Dog Food,
sum(Cat Food) as Cat Food
from
pet
group by date
•However, many complexities enter the picture
that will be discussed later.
Date
Order Number
Dog Food
Cat Food
4/24/2012
1
5
2
4/24/2012
2
3
0
4/24/2012
3
2
6
4/24/2012
4
2
2
4/24/2012
5
3
3
4/25/2012
1
3
7
4/25/2012
2
2
1
4/25/2012
3
4
0
Quality Sold
Date
Dog Food
Cat Food
4/24/2012
15
13
4/25/2012
9
8
OLTP Systems
• Online Transaction Processing, systems are
standard, normalized databases.
• OLTP systems are optimized for inserts, updates,
and deletes; in other words, transactions.
• Transactions in this context can be thought of as
the entry, update, or deletion of a record or set of
records.
• They minimize repeated data, and they limit the
number of indexes
Flat Structure in
Excel
Problem: Repeat
the header
information in
each detail record.
Normalization
• Database normalization is the process of removing repeated information
(field) into a separate table.
• Normalization first removes repeated records in a table. For example, the
following order table contains much repeated information and is not
recommended:
If there were twenty repeated sets of fields for
detail records, the table would be unable to
handle an order for twenty one or more
products.
If an order has just has one product ordered,
all the other fields are useless.
The first step in the normalization process is to
break the repeated fields into a separate table,
and end up with this:
Now, an order can have any number of detail records.
Reduce the data entry required
Reduce the size of an OrderDetail record
ER-Diagram
To reduce repeated data, the
new detail records might look
like this:
OrderID
LineItem ProductID QuantityOrdered
12345
1
A13J2
200
12345
2
J44RR
300
12345
3
GA1AE
1000
OLTP schemas store as little data as possible to
speed inserts, updates, and deletes. Therefore,
any number that can be calculated at query
time is calculated and not stored.
OLTP Advantages
• OLTP allows for the minimization of data entry
– Only the primary key value from the OrderHeader table, the primary
key of the Product table, and then the order quantity is stored
– Built for data inserts, updates, and deletes, but not data retrieval
• OLTP reduces the size of an OrderDetail record.
– help speed inserts, updates, and deletes.
• Most of the fields that link to other tables are numeric.
– Queries generally perform much better against numeric fields than
they do against text fields.
– Replacing a series of text fields with a numeric field can help speed
queries.
– Numeric fields also index faster and more efficiently.
• With normalization, there are frequently fewer indexes per
table.
– With fewer indexes to maintain, inserts, updates, and deletes run
faster.
OLTP Disadvantages
• With a normalized structure, queries must utilize
joins across multiple tables to get all the data.
• Joins tend to be slower than reading from a single
table, so minimizing the number of tables in a
query will boost performance.
• The data in an OLTP system is not user friendly.
– Most customers don't know how to make sense of the
normalized structure of the database. Hard to query
– Joins are somewhat mysterious, and complex table
structures are difficult for the average customer to
use.
OLTP Disadvantages
• The fewer indexes in the database, the faster inserts, updates, and
deletes will be.
• The fewer indexes in the database, the slower select queries will
run.
• For the purposes of data retrieval, a higher number of correct
indexes helps speed retrieval.
• Since one of the design goals to speed transactions is to minimize
the number of indexes, OLTP databases trade faster transactions at
the cost of slowing data retrieval.
• This is one reason for creating two separate
database structures:
– an OLTP system for transactions
– an OLAP system for data retrieval.
OLTP Systems
• OLTP allows for the minimization of data entry
• Developers minimize the number of indexes in an
OLTP system.
• Indexes are important but they slow down
inserts, updates, and deletes.
• Therefore, most schemas have just enough
indexes to support lookups and other necessary
queries.
• Over-indexing can significantly decrease
performance.
Reason to Denormalize
• If customers want to spend the majority of their time performing
analysis by looking at the data, the IT group should support their
desire for fast, easy queries.
• On the other hand, maintaining the speed requirements of the
transaction-processing activities is critical.
• If these two requirements seem to be in conflict, they are, at least
partially.
• Many companies have solved this by having a second copy of the
data in a structure reserved for analysis.
• This copy is more heavily indexed, and it allows customers to
perform large queries against the data without impacting the
inserts, updates, and deletes on the main data.
• This copy of the data is often not just more heavily indexed, but
also denormalized to make it easier for customers to understand.
Reason to Denormalize
• Users query a lot. Therefore
– Need to build many index to speed up query
– Need to denormalize table to reduce join
operation and remove users’ confusion of tables’
structure
How Humans View Information
• How many bottles of Aniseed Syrup were sold
last week?
• Are overall sales of Condiments up or down this
year compared to previous years?
• On a quarterly and then monthly basis, are Dairy
Product sales cyclical?
• In what regions are sales down this year
compared to the same period last year? What
products in those regions account for the greatest
percentage of the decrease?
How Humans View Information
•
•
•
•
Tracking measure
Aggregation data (sum, count)
Time involved
Dimension Table
By certain condition
Fact Table
Star Schema
• What people want to see (Event, Measures)
– Fact Table (e.g., Sales)
• How they want to see it
– Dimension Table
View data by time, by
location, by product
Dimension Table
fat , short table
• All dimension tables should have a single-field primary
key.
• This key is typically a surrogate key and is often just an
identity column, consisting of an automatically
incrementing number.
• The value of the primary key is meaningless, hence the
surrogate key.
• The real information is stored in the other fields.
– These fields do not contain codes that link to other tables.
Because the fields contain full descriptions, the dimension
tables are often short and fat; they contain many large
fields.
Dimension Table
fat , short table
• Short and fat.
• Cannot compare in size to a normal fact table.
• For example,
– Assume a company has 30,000 products in the product table.
– The company may track sales for these products each day for
several years.
– Assuming the company actually only sells 3,000 products in any
given day.
– If they track these sales each day for ten years, the fact table
has
• 3,000 products sold X 365 day/year * 10 years = 11,000,000 records
• The dimension table with 30,000 records will be short compared to
the fact table.
Fact Table
skinny , long table
• Keep fact or measure data by conditions (long)
– Measures are numeric and additive across some or all of the dimension tables
– For example, sales are numeric and users can look at total sales for a product, or
category, or subcategory, and by any time period. The sales figures are valid no
matter how the data is sliced.
• Use Primary Key of each Dimension table (skinny)
Fact Table
skinny , long table
Long Table - hold the number of records represented by the product of the
counts in all the dimension tables. Therefore, there are many many records.
Skinny table - hold the fields which are the primary key (foreign keys ) of
the dimension tables. These fields are typically integer values. In addition, the
measures are also numeric. Therefore, the size of each record is generally much
narrower than those in the dimension tables.
Fact Table
• 10 years of daily data
• 200 stores
• 500 products
Fact Table size: 365,000,000 records
(3650 days * 200 stores * 500 products)
• This large number of records makes
the fact table long.
• If add another dimension, such as a
dimension of 10,000 customers, the
record of the fact table can be increased
up to 10,000 times.
skinny , long table
Query1: the total sales for a particular year and month for a particular category
the total SalesDollars for March, 1999 for category = “Brass Goods”
SQL query:
SELECT Sum(SF.SalesDollars) AS SumOfSalesDollars
FROM TimeDimension T , ProductDimension P, SalesFact SF
WHERE T.TimeID = SF.TimeID AND P.ProductID = SF.ProductID AND
P.Category='Brass Goods' AND T.Month=3 AND T.Year=1999
To drill down to a subcategory, the
SQL would change to look like this:
SELECT Sum(SF.SalesDollars) AS SumOfSalesDollars
FROM TimeDimension T, ProductDimension P, SalesFact SF
WHERE P.ProductID = SF.ProductID AND T.TimeID = SF.TimeID AND
P.SubCategory='Widgets' AND T.Month=3 AND T.Year=1999;
Query2: The total SalesDollars for March, 1999 by category
SQL query:
SELECT P.Category AS Category, Sum(SF.SalesDollars) AS SumOfSalesDollars
FROM TimeDimension T , ProductDimension P, SalesFact SF
WHERE T.TimeID = SF.TimeID AND P.ProductID = SF.ProductID AND
T.Month=3 AND T.Year=1999
GROUP BY P.Category
Query3: The total SalesDollars by month and by brand
SQL query:
SELECT P.Month AS Month, P. Brand AS Brand,
Sum(SF.SalesDollars) AS SumOfSalesDollars
FROM TimeDimension T , ProductDimension P, SalesFact SF
WHERE T.TimeID = SF.TimeID AND P.ProductID = SF.ProductID
GROUP BY T.Month, P. Brand
Snowflake Schemas
A Dimension tables is fat so we can broken out it
into separate tables. Save space.
This is a more normalized structure, but leads to
more difficult queries and slower response times.
Difficult to drill-down into data
Snowflake Schemas
• Increases the number of joins and can slow
queries
• The purpose of an OLAP system is to speed
queries, snowflaking is usually not productive.
• Some people try to normalize the dimension
tables to save space.
• However, in the overall scheme of the data
warehouse, the dimension tables usually only
account for about 1% of the total storage.
• Any space savings from normalizing, or
snowflaking, are negligible.
Fact Granularity
• The granularity, or frequency, of the data is
determined by the lowest level of granularity of
each dimension table.
• For example, a fact table may store weekly or
monthly totals for individual products.
• The lower the granularity, the more records that
will exist in the fact table.
• The granularity also determines how far users can
drill down without returning to the base,
transaction-level data.
Fact Table Size
• 500 products sold in 200 stores and tracked for
10 years
500 x 200 x 10 x 365
– could produce 365,000,000 records in a fact table with
a daily grain (maximum size for the table)
– Star schemas do not store zero values unless zero has
some significance.
– So, if a product did not sell at a particular store for a
particular day, the system would not store a zero
value.
– The fact table contains only the records that have a
value.
Fact Table Size
• The lower the granularity, the larger the fact table.
• In the previous example, moving from a daily to weekly
grain would reduce the potential number of records to
only slightly more than 52,000,000 records (500
products x 200 stores x 10 years x 52 weeks)
• The data types for the fields in the fact table are
numeric, which can require less storage space than the
long descriptions we find in the dimension tables.
• Be aware that each added dimension can greatly
increase the size of the fact table.
– For example, adding customer dimension table that has 20
possible values (record), the potential number of records
would reach 7.3 billion.
500 x 200 x 10 x 365 x 20 = 7,300,000,000 records
Reference
• Designing the Star Schema Database
http://www.ciobriefings.com/whitepapers/StarS
chema.asp
http://databases.about.com/od/sql/l/aacuberoll
up.htm