Download E15003 0817804 CO6002 Assignment 2010-11: Word

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Clusterpoint wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
CO6002
Advance
Database
Management
March 30
2011
The AWISDM project can be found by utilizing the short cut on the
desktop or by following the path C:\Users\0817804\Documents\Visual
Studio 2008\Projects\aj_AWISDM\aj_AWISDM the user name for the
remote desktop is 0817804 and the password is P@55word
[Type the
document
subtitle]
Contents
Table of Figures ....................................................................................................................................... 2
Table of Tables ........................................................................................................................................ 2
Pilot InternetSales Data Mart Design...................................................................................................... 3
Starting with dimension tables ........................................................................................................... 3
Database design as a star schema model ........................................................................................... 3
Date ..................................................................................................................................................... 5
ProductSubcategory............................................................................................................................ 5
Territory .............................................................................................................................................. 5
Customer............................................................................................................................................. 5
Surrogate Keys ................................................................................................................................ 6
Date Attributes................................................................................................................................ 6
Normalisation and de-normalisation .............................................................................................. 6
Comparing the terms star schema and snow-flake schema ............................................................... 7
Schemas in Data Warehouses ......................................................................................................... 8
Slowly Changing Dimensions ........................................................................................................ 10
Design the fact table ..................................................................................................................... 10
Fact Table considerations and Granularity ................................................................................... 11
Estimate the size of the Data Mart ............................................................................................... 11
Design, develop and save an SSIS Package ........................................................................................... 12
Integrate data from the four data sources into a staging database ................................................. 12
Load data to the data mart ........................................................................................................... 14
The key features of the ETL process ............................................................................................. 14
Business Intelligence Applications. ....................................................................................................... 15
Slicing and Dicing the Cube ............................................................................................................... 16
Reports .............................................................................................................................................. 17
Ad hoc Queries .................................................................................................................................. 18
OLAP .................................................................................................................................................. 18
Data Mining....................................................................................................................................... 19
Decision Support ............................................................................................................................... 20
Table of Figures
Figure 1 Four types of dimension tables ................................................................................................. 3
Figure 2 Star Schema structure for database design .............................................................................. 3
Figure 3 Entity relationship model of fact and dimension tables ........................................................... 4
Figure 4 Starter Dimensions.................................................................................................................... 4
Figure 5 Date table with the addition of surrogate keys ........................................................................ 5
Figure 6 ProductSubCategory dimension with additional surrogate keys.............................................. 5
Figure 7 The Territory Dimension ........................................................................................................... 5
Figure 8 The Customer dimension .......................................................................................................... 5
Figure 9 Star schema model.................................................................................................................... 7
Figure 10 AWISDM Data Mart as a Star Schema .................................................................................... 8
Figure 11 The benefits of a Snowflake Schema ...................................................................................... 9
Figure 12 Snowflake Schema .................................................................................................................. 9
Figure 13 AWISDM as Snowflake Schema .............................................................................................. 9
Figure 14 Fact Table .............................................................................................................................. 10
Figure 15 Entity Relationship Diagram for AWISDM ............................................................................ 11
Figure 16 Concept design for integrating the four data sources .......................................................... 12
Figure 17 Data is imported from the four sources ............................................................................... 13
Figure 18 the ETL Process ..................................................................................................................... 13
Figure 19 Loading the data into the Data Mart .................................................................................... 14
Figure 20 A flow diagram to deal with error checking.......................................................................... 14
Figure 21 Data flow to check for and correct errors............................................................................. 15
Figure 22 Output to Fact Table ............................................................................................................. 15
Figure 23 BI Platform ............................................................................................................................ 15
Figure 24 Microsoft Office Integration ................................................................................................. 16
Figure 25 Single Data Point ................................................................................................................... 16
Figure 26 one Dimension of Data ......................................................................................................... 16
Figure 27 Two Dimensions of Data ....................................................................................................... 17
Figure 28 Three Dimensional Data........................................................................................................ 17
Figure 29 Decision Support ................................................................................................................... 20
Table of Tables
Table 1 Snowflake verses Star Schema ................................................................................................... 7
Table 2 AWISDM Dimension Tables Sizes ............................................................................................. 12
Pilot InternetSales Data Mart Design
In this section we will be exploring the dimensional modelling of the pilot data mart for
AdventureWorks Ltd. The broad agreement in the fields of data warehousing and business
intelligence is that the dimensional model is the preferred structure. Data warehousing centres on
facts and the fact table. Facts are
Figure 1 Four types of dimension tables
usually numeric and are a
measurement of an event.
Starting with dimension
tables
Dimensions are the nouns of the
business world; they describe the
surrounding measurement events.
A single dimension that is shared
across all of these facts is called a
conformed dimension. A junk
dimension is a convenient way of
grouping flags and indicators. A degenerate dimension are common when the grain is of the fact
table is at a unit level. Role-playing dimensions are recycled from multiple applications within the
same database.
Database design as a star schema model
Figure 2 Star Schema structure for database
design
In its initial state, the database consists of four
tables with no relationships, shown in Figure 2 Star
Schema structure for database design.
Dimension: ProductSubCategory
a Star Schema
structure
Dimension: Customer
fact table
Dimension: Territory
Dimension: Date
DimDate
PK,I2
DateID
I1
Date key
Date
Day
MonthID
MonthDescrition
QuarterID
QuarterDescription
YearID
YearDescription
MonthlyTimeSpan
QuarterTimeSpan
MonthEndDate
QuarterEndDate
YearEndDate
FullDate
Day of the week
Day number in month
Day number overall
Week number in year
Week number overall
Quarter
FiscalPeriod
Last day in the month flag
EventKey
Holiday
Weekend
I4
I5
I6
DimProductSubCategory
DimTerritory
PK,I1
PK,I2
TerritoryID
FK1,I1
TerritoryRegion
State
Province
City
ProductID
ProductSubCategoryName
ProductCategoryName
ProductName
Brand
ProductCategory
ProductDescrition
I3
DimCustomer
PK
CustomerKey
MaritalStatus
BirthDate
Education
Occupation
AddressLine1
City
StateProvince
CountryRegionName
PostalCode
FirstName
MiddleName
LastName
Phone
hasDimProductSubCategoryFact Table / is of
Fact Table
hasDimTerritoryFact Table / is of
PK
PK,FK1,I1
PK,FK4,I4
PK,FK2,I2
PK,FK3,I3
PK
ID
CustomerKey
TerritoryID
DateID
ProductID
ProdSubCategory
hasDimDateFact Table / is of
hasDimCustomerFact Table / is of
OrderValue
Figure 3 Entity relationship model of fact and dimension tables
The first steps in developing our dimension tables are to add surrogate keys to them.
Figure 4 Starter Dimensions
This basic starting point requires us to
add surrogate keys to the dimension
tables. The addition of surrogate keys
Date
Figure 5 Date table with the addition of surrogate keys
DimDate
PK,FK1
PK,I2
CustomerKey
DateID
I1
Date key
Date
Day
MonthID
MonthDescrition
QuarterID
QuarterDescription
YearID
YearDescription
MonthlyTimeSpan
QuarterTimeSpan
MonthEndDate
QuarterEndDate
YearEndDate
FullDate
Day of the week
Day number in month
Day number overall
Week number in year
Week number overall
Quarter
FiscalPeriod
Last day in the month flag
EventKey
Holiday
Weekend
ID
ProdSubCategory
I4
I5
I6
I3
FK1
FK1
We can see from the data table shown in Figure 5 Date table
with the addition of surrogate keys, we have added a number
of surrogate, or non-natural keys to the data dimension.
ProductSubcategory
Figure 6 ProductSubCategory dimension with additional
surrogate keys
DimProductSubCategory
I1
ProductID
ProductSubCategoryName
ProductCategoryName
ProductName
Brand
ProductCategory
ProductDescrition
The ProductSubCategory
dimension with additional
keys, as show in Figure 6
ProductSubCategory
dimension with additional
surrogate keys.
Territory
The territory dimension with
additional surrogate keys,
shown in Figure 7 The
Territory Dimension.
Figure 7 The Territory Dimension
DimTerritory
Customer
I1
TerritoryID
TerritoryRegion
State
Province
City
Figure 8 The Customer dimension
DimCustomer
PK
CustomerKey
MaritalStatus
BirthDate
Education
Occupation
AddressLine1
City
StateProvince
CountryRegionName
PostalCode
FirstName
MiddleName
LastName
Phone
Finally we
have the customer dimension with its surrogate keys, as shown in Figure 8 The Customer dimension.
Surrogate Keys
A surrogate key is usually created to simplify the key structure. That is to say it is an artificial
substitution for a natural primary key, data held within this key may change with time, known as a
slowly changing dimension. These surrogate keys are neither intelligent nor business specific and
most commonly system-generated. Their purpose is to ensure that rows are unique within a
database table, typically they are numeric fields.
Date Attributes
Data mart solutions often include a role-playing dimension to record the passage of time; the “Date”
dimension. It is necessary to provide additional attributes to this dimension other than just the date
field, which would only allow users to view sales for a specific date, or specify a range of dates, this
provides a limiting scope for data analysis.
Adding fields such as: the day of the week, month name, fiscal calendar quarter or a public holiday
flag fields enables much more flexibility. For example a user would be able to compare sales data for
a particular day in a particular month to the same day in the previous year, or compare sales totals
from one week to the next. This additional functionality allows business decision makers to generate
considerably more useful conclusions from the data.
Normalisation and de-normalisation
The dimension tables of a data mart contain de-normalised data, whilst the fact tables will contain
fully normalised data. Whilst this does not optimise the databases’ file size, it does increase the
performance that queries on the data will execute at, often seen as the If the database was fully
normalised – for example, if the customer table was normalised, additional slow lookups for marital
status, profession and education would have to be performed.
The fact tables are fully normalised; they contain only time-period data and the relevant surrogate
key for each dimension.
The dimensional tables are de-normalised, usually to the second normal form, and contain surrogate
keys to facilitate query speeds.
Comparing the terms star schema and snow-flake schema
“Star and snowflake schema designs are mechanisms to separate facts and dimensions into separate
tables. Snowflake schemas further separate the different levels of a hierarchy into separate tables.”
(IBM, 2011)
Snowflake Schema
Star Schema
Joins
Higher number of Joins
Fewer Joins
Ease of Use
More complex queries and hence less easy
Less complex queries and easy to
to understand
understand
More foreign keys-and hence more query
Less no. of foreign keys and hence
execution time
lesser query execution time
Ease of maintenance or
No redundancy and hence more easy to
Has redundant data and hence less easy
change
maintain and change
to maintain/change
Type of data warehouse
Good to use for small data warehouses/data
Has redundant data and hence less easy
marts
to maintain/change
Query Performance
Dimensional table
It may have more than one dimension
Contains only single dimension table for
table for each dimension
each dimension
Dimension Table
Normalisation
3 Normal Form
2 Normal De-normalised Form
Table 1 Snowflake verses Star Schema
Primary keys
Foreign keys
Fact tables
Dimension Tables
Star schemas
Snowflake schemas
Figure 9 Star schema model
A hierarchy is a set of levels having many-to-one
relationships between each other, and the set of levels,
collectively makes up a dimension. In a relational database,
the different levels of a hierarchy can be stored in a single
table (as in Figure 9 Star schema model) or in separate
tables (as in a snowflake schema).



Many-to-one relationships
Balanced and unbalanced hierarchies
Ragged hierarchies
Schemas in Data Warehouses
Date
Customer
Fact
Product
Subcategory
Territory
Figure 10 AWISDM Data Mart as a Star Schema
A star schema model consists of one or more fully normalised fact tables,(3NF), that reference any
number of de-normalised dimension tables. The dimension tables feature redundant data in the
most granular form and are in second normal form, (2NF). This increases the simplicity of the
database and reduces the complexity of queries made upon it. A snow-flake schema is a variation on
this and features fully normalised dimension tables, (3NF). This reduces the amount of file space
needed to store the database and reduces the number of places data would need to be altered if an
update is required; but this come at a cost which is a reduction of query performance.
The facts that the data warehouse aids in analyse are classified along different dimensions. The fact
table holds the main data. It includes a large amount of aggregated data, such as price and units
sold. There may be multiple fact tables in a star schema.
Dimension tables, which are usually smaller than fact tables, include the attributes that describe the
facts. Most often we see a segregation of these dimensions into a separate table for each dimension.
Dimension tables are be joined to the fact table(s) as needed.
Dimension tables have a simple primary key, while fact tables have a set of foreign keys which make
up a compound primary key consisting of a combination of relevant dimension keys.
It is common for dimension tables to consolidate redundant data in the most granular column, and
thus rendered in second normal form. Fact tables are usually in third normal form because all data
depends on either one dimension or all of them, not on combinations of a few dimensions.
Figure 11 The benefits of a Snowflake Schema
The benefit of using the snowflake schema as shown in Figure 11 The benefits of a Snowflake
Schema, is that the storage requirements are lower since the snowflake schema eliminates many
duplicate values from the dimensions themselves.
Figure 12 Snowflake Schema
The main design scheme is star but the ProductCategory can be granulated by adding more depth
through using a snowflake schema for this element of the design. Extending the granularity of the
DimProductSubCategory dimension
DimDate
I2
I1
I4
I5
I6
I3
DateID
Date key
Date
Day
MonthID
MonthDescrition
QuarterID
QuarterDescription
YearID
YearDescription
MonthlyTimeSpan
QuarterTimeSpan
MonthEndDate
QuarterEndDate
YearEndDate
FullDate
Day of the week
Day number in month
Day number overall
Week number in year
Week number overall
Quarter
FiscalPeriod
Last day in the month flag
EventKey
Holiday
Weekend
DimCustomer
PK
CustomerKey
MaritalStatus
BirthDate
Education
Occupation
AddressLine1
City
StateProvince
CountryRegionName
PostalCode
FirstName
MiddleName
LastName
Phone
DimTerritory
I1
TerritoryID
TerritoryRegion
State
Province
City
hasDimCustomerFact Table / is of
Fact Table
PK
PK,FK1,I1
PK,FK4,I5
PK,FK2,I2
PK,FK3,I4
PK
hasDimDateFact Table / is of
DimProductSubCategory
I1
I3
ProductID
hasDimProductSubCategoryFact Table / is of
ProductSubCategoryName
ProductCategoryName
ProductName
Brand
ProductCategory
ProductDescrition
ID
CustomerKey
TerritoryID
DateID
ProductID
ProdSubCategory
OrderDate
Birthdate
OrderValue
SalesAmount
SalesTerritoty
SalesRegion
MS
Addr1
City
Country
Postcode
ProductSubCategoryName
AWISProjDate
TerritoryRegion
FirstName
LastName
DimProductCategory
has / is of
PK
ProductCategoryID
I2
I1
FK1
ProductSubCategoryID
ProductCategoryName
ProductID
Figure 13 AWISDM as Snowflake Schema
hasDimTerritoryFact Table / is of
Slowly Changing Dimensions
This is a dimensional problem that occurs with time, the attributes for a record changes over time.
This can be seen in particular in the Customer table of the AWISDM.
For example, if the customer number was used as a natural key and a customer changes their
address it would cause numerous problems. Firstly, if the record was simply updated it would not be
possible to track where previous items had been delivered; it would be as though the customer had
only a single address and had always lived there.
By using a surrogate key, a new record for the customer can be added with the new address in the
table and all new queries will use this new record; the old data remains intact. It is possible by
creating a composite natural key (customer number and address), if not all parts of this natural key
exist in the fact table it may not be possible to do a join on this new enlarged key – therefore
requiring another ID field; the surrogate key.
Although there is only one customer, they would be treated dimensionally as two separate
customers. The problem of slowly changing dimensions is one that needs addressing to avoid the
deletion of data that may be required for statistical analysis. For example: if the company wanted to
analyse sales by a certain region and the customer’s address has changed, all sales made before the
alteration would become assigned to the new region.
Design the fact table
Fact Table
PK
PK,FK1,I1
PK,FK4,I4
PK,FK2,I2
PK,FK3,I3
PK
ID
CustomerKey
TerritoryID
DateID
ProductID
ProdSubCategory
OrderValue
Figure 14 Fact Table
DimDate
PK,I2
DateID
I1
Date key
Date
Day
MonthID
MonthDescrition
QuarterID
QuarterDescription
YearID
YearDescription
MonthlyTimeSpan
QuarterTimeSpan
MonthEndDate
QuarterEndDate
YearEndDate
FullDate
Day of the week
Day number in month
Day number overall
Week number in year
Week number overall
Quarter
FiscalPeriod
Last day in the month flag
EventKey
Holiday
Weekend
I4
I5
I6
DimProductSubCategory
DimTerritory
PK,I1
PK,I2
TerritoryID
FK1,I1
TerritoryRegion
State
Province
City
ProductID
ProductSubCategoryName
ProductCategoryName
ProductName
Brand
ProductCategory
ProductDescrition
I3
DimCustomer
PK
CustomerKey
MaritalStatus
BirthDate
Education
Occupation
AddressLine1
City
StateProvince
CountryRegionName
PostalCode
FirstName
MiddleName
LastName
Phone
hasDimProductSubCategoryFact Table / is of
Fact Table
hasDimTerritoryFact Table / is of
PK
PK,FK1,I1
PK,FK4,I4
PK,FK2,I2
PK,FK3,I3
PK
ID
CustomerKey
TerritoryID
DateID
ProductID
ProdSubCategory
hasDimDateFact Table / is of
hasDimCustomerFact Table / is of
OrderValue
Figure 15 Entity Relationship Diagram for AWISDM
Fact Table considerations and Granularity
The fact table in the AWISDM requires a daily sales total for a customer in each product sub category
and region. The ETL process is currently unaffected by the source granularity as the incoming data is
all of the same level of detail. If one of the data sources listed its sales on an individual order basis
rather than a daily total for example, this data would have to be aggregated by date to enable a
courser level detail in the other data.
All subsequent sales will need to at least match this granularity. If one of the data sources begins to
produce a finer level of detail of data; for example grouping sales data by product category instead
of sub-category, all of the data being imported to the data mart would have to be brought to the
finer level of granularity to ensure consistency.
Estimate the size of the Data Mart
The fact table in the AWISDM currently has seven fields with an average field length of 38.9 bytes (1
field of 4 byte, 5 fields with a size of 52 bytes, and one of 8bytes). Assuming that there are 500,000
rows in the table this gives a total table size of
7 fields x 38.9 bytes/field x 500,000 rows = ≥ 136,150,000 bytes (≥ 129.84 Megabytes, or ≥ 132,959
Kilobytes)
The dimension tables are much smaller and typically contain a fraction of the number of rows found
in the fact table. The dimension tables in the AWISDM are of the following sizes, shown in Table 2
AWISDM Dimension Tables Sizes.
Table Name
Average Field Size
41.4 bytes
DimCustomer
14.4 bytes
DimDate
52 bytes
DimTerritory
52 bytes
DimProductSubCategory
Field Count
Row Count
14
27
5
7
18,484
1,158
10
37
File Size
≥ 10.22 Megabytes
≥ 0.43 Megabytes
≥ 2.54 Kilobytes
≥ 13.15 Kilobytes
Table 2 AWISDM Dimension Tables Sizes
As shown above, the Data mart fact table is exponentially larger than its dimension tables and will
continue to grow at a much faster rate; as more and more sales data is added each day the
dimension tables will remain relatively unchanged.
Design, develop and save an SSIS Package
Design an ETL Process to:
Integrate data from the four data sources into a staging database
Australia
D1
UK
D2
Sales data
Australia
UK
Data flow
Aj_AWISDM
US
Germany
D3
D4
US
Germany
Figure 16 Concept design for integrating the four data sources
The four data sources are uploaded into the transact process, for the UK sales a replacement takes
place, replacing tyre with tire. All four data sources are then formatted to ensure that the data types
are consistent throughout the process; the output from these conversions is then joined into a union
all function to align the data being imported. See Figure 17 Data is imported from the four sources.
Figure 17 Data is imported from the four sources
AdventureWorks Ltd
InternetSales Data Mart (AWISDM)
US Sales compatability
PK
UK Sales schema
PK,FK1
PK,FK1
ProdSubCategory
OrderDate
SalesRegion
MS
Birthdate
Addr1
City
Country
Postcode
SalesAmount
CustomerKey
ProductSubCategoryName
AWISProjDate
TerritoryRegion
BirthDate
FirstName
MiddleName
LastName
SalesAmount
australian sales compatibility
CustomerKey
ProdSubCategory
OrderDate
SalesTerritory
OrderValue
CustomerKey
has / is of
German sales
has / is of
Uk sales XML
PK
PK
ProductSubCat
OrderDate
SalesRegion
MS
Birthdate
US Sales
Addr1
City
Country
Postcode
SalesAmount
Austrailian sales
PK
CustomerKey
ProdSubCategory
OrderDate
SalesTerritory
OrderValue
US Sales
CustomerKey
PK
CustomerKey
ProductSubCategoryName
AWISProjDate
TerritoryRegion
BirthDate
FirstName
LastName
SalesAmount
Tra
n sa
ct
Ex
tra
ct
Load
Staging
Database
MS Access database
Figure 18 the ETL Process
CustomerKey
ProdSubCategory
OrderDate
SalesRegion
MS
Birthdate
Addr1
City
Country
Postcode
SalesAmount
Load data to the data mart
Import Australia
Sales
Compatibility
Import Australia
Sales
Import German
Sales
Import UK Sales
UK
Import US Sales
Compatibility
Import US Sales
Germany
US
Austrailia
Consolidate and
clean UK detail
Consolidate and
clean Australia
detail
UK
Prime
Australia
Prime
Consolidate and
clean German detail
Consolidate and
clean US detail
German
Prime
US
Prime
Aj_AWISDM Starter Dimensions
Figure 19 Loading the data into the Data Mart
The key features of the ETL process
The key features of the ETL process are in the acronym, we firstly extract the data from the disparate
data sources, we transact this data by converting the data types and then error check the data, and
finally we load the data to the data mart.
Tasks designed to validate the data
Start
Read data from data
stores
Australia Data
Store
German Data
Store
Is data validated
UK Data Store
US Data Store
No
Figure 20 A flow diagram to deal with error checking
Figure 21 Data flow to check for and correct errors
Figure 22 Output to Fact Table
Business Intelligence Applications.
Figure 23 BI Platform
As we can see from Figure 23 BI Platform, the Business Integration system is a collection of Microsoft
SQL Server Services. The BI model is used to draw data from disparate sources and present a single
unified view of the data in the form of a data mart. The BI toolkit allows for complex and demanding
data process, such as data conversions and error validation to be undertaken simply. The output
data mart would then be interrogated by the user utilising the Microsoft Office suite of applications,
some of which are shown in Figure 24 Microsoft Office Integration.
visio
doc
Excel
Word
PPT
mdb, accdb
xls
XML
Project
Access
Outlook
Figure 24 Microsoft Office Integration
The use and application of the Office suite provides business decision makers with a familiar
interface from which to build questions and queries of the data mart.
Slicing and Dicing the Cube
We start with a data point D1
Figure 25 Single Data Point
We then build a
Figure 26 one Dimension of Data
Extension of data point
• Customer ID
Region ID
Product ID
Sales Value
Date ID
D1
D2
•
•
•
•
•
Customer ID
Region ID
Product ID
Sales Value
Date ID
•
•
•
•
•
Customer ID
Region ID
Product ID
Sales Value
Date ID
D3
Figure 27 Two Dimensions of Data
Then we constrict a Cube, an example of this would be:
Cubic on Date
X
Date
Y
Customer
Z
Product
Cubic on Amount
X
Customer
Y
Product
Z
Amount
Figure 28 Three Dimensional Data
Finally we build our cube
Date
Geographic
Product
Customer
• Date ID
• Date
• Region
• Country
• Product ID
• Product Name
• Customer ID
• Customer
Name
Reports
Reports are simply to produce utilising Excel, Access or even Word.
Ad hoc Queries
Ad-hoc queries become simple to create as the star schema data
mart is specifically designed with these in mind.
OLAP
As already stated a number of times (Saeed, 2011), “in a Data Warehouse decision support
environment we are interested in the big picture”, we want to look at the data from a macro level
instead of the micro level. For a macroscopic view aggregates are used.
In this example we look at the sales volume

i.e. number of items sold as a function of:
o product
o time
o geography
Note that all three of them are dimensions.
The proceed with the analysis, a cube structure will be first created such that each dimension of the
cube will correspond to each identified dimension, and within each dimension will be the
corresponding hierarchy. The example further shows how the dimensions are “rolled-out” i.e.
Province into divisions, then division into district, then district into city and finally cities into zones.
Note that weeks could be rolled into a year and at the same time months can be rolled into quarters
and quarters rolled into years. Based on these three dimensions a cube is created and shown in
Cube Operations
 Rollup: summarize data
e.g., given sales data, summarize sales for last year by product category and region

Drill down: get more details
e.g., given summarized sales as above, find breakup of sales by city within each region, or within
province or county

Slice and dice: select and project
e.g.: Sales of soft-bicycles in Chester during last quarter

Pivot: change the view of data
How we view the data stored in a cube.
There are four fundamental cubes operations which are:
(i)
(ii)
(iii)
(iv)
rollup
drill down
slice and dice
pivoting
Data Mining
Knowledge-Discovery and Data Mining, is the process of automatically searching large volumes of data
for patterns using tools such as classification, association rule mining, clustering, etc.
Data Mining commonly involves four classes of task:
(i)
(ii)
(iii)
(iv)
Classification
Arranges the data into predefined groups. For example an email program might attempt
to classify an email as legitimate or spam. Common algorithms include Nearest neighbor,
Naive Bayes classifier and Neural network.
Clustering
Is like classification but the groups are not predefined, so the algorithm will try to group
similar items together.
Regression
Attempts to find a function which models the data with the least error. A common
method is to use Genetic Programming.
Association rule learning
Searches for relationships between variables. For example a store might gather data of
what each customer buys. Using association rule learning, AdventureWorks Ltd can work
out what products are frequently bought together, which is useful for marketing
purposes. This is sometimes referred to as “market basket analysis”.
Decision Support
Knowledge
Cloud
Cloud
Cloud
In
te
r
pr
e
ta
tio
n
Cloud
A1
C1
Mo
de
l
D1
Co
ns
tru
c
tio
n
Patterns
B1
D
C
A
B
Processed Data
Original Data
Target Data
D1 Austrailia
D2 Germany
D3 UK
D4 US
e
nt
aI
t
Da
n
io
at
gr
&
c
le
se
n
tio
Figure 29 Decision Support
Pr
ce
ro
e-p
ng
ssi