Download - Sacramento - California State University

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
A COURSEWARE OF ONLINE ANALYTICAL PROCESSING CUBE
A Project
Presented to the faculty of the Department of Computer Science
California State University, Sacramento
Submitted in partial satisfaction of
the requirements for the degree of
MASTER OF SCIENCE
in
Computer Science
by
Sudha Chakravarthy
FALL
2014
i
© 2014
Sudha Chakravarthy
ALL RIGHTS RESERVED
ii
A COURSEWARE ON ONLINE ANALYTICAL PROCESING CUBE
A Project
by
Sudha Chakravarthy
Approved by:
__________________________________, Committee Chair
Dr. Meiliu Lu
__________________________________, Second Reader
Dr. Ying Jin
____________________________
Date
iii
Student: Sudha Chakravarthy
I certify that this student has met the requirements for format contained in the University
format manual, and that this project is suitable for shelving in the Library and credit is to
be awarded for the project.
__________________________, Graduate Coordinator
Dr. Nikrouz Faroughi
Department of Computer Science
iv
___________________
Date
Abstract
of
A COURSEWARE OF ONLINE ANALYTICAL PROCESSING CUBE
by
Sudha Chakravarthy
OLAP is a short form for On Line Analytic Processing. It enables multidimensional view
of data and answers multi-dimensional queries by using database tables. OLAP allows
users to analyze and view the data in different dimensions and help make strategic and
tactical decisions based on the information stored in the data warehouses.
The main objective of this project is to provide the students with a web based interactive
tutorial on OLAP and its operations with live examples and exercises. The courseware on
OLAP will be available to all the interested users around the world.
The courseware will help students to practically learn the concepts of OLAP by running
open sample queries on example data. Also, the students get an opportunity to work on
exercises based on the examples demonstrated to them. The courseware uses data that has
been extracted from various sources, and loaded into Sql tables so that open queries can
be run on the data to demonstrate examples explaining concept of OLAP operations. The
website is created by using languages such as html, php and MySQL.
Anyone with internet access are free to use this tool, however certain portions of the
website has limited access and might ask for the users credentials.
v
The OLAP courseware is integrated with two other courseware created earlier so that
students can have an easy access to all the concepts regarding data mining and data
warehousing from a single website. The OLAP courseware is used and evaluated
positively in the class of CSC 177 data mining and data warehousing at CSU Sacramento
fall 2014.
_______________________, Committee Chair
Dr. Meiliu Lu
_______________________
Date
vi
ACKNOWLEDGEMENTS
I would like to take this opportunity to thank all the people who have helped me walk
through this process.
My heartfelt thanks and a special mention to Dr. Meiliu Lu for providing me this
opportunity of working under her on my masters project. She has been very patient,
encouraging and guided me through the entire process. Her detailed feedback was really
helpful during my project design and development. My sincere thanks to Prof. Ying Jing
for being my second reader. I also take this opportunity to thank Dr. Nikrouz Faroughi
for his review of the project.
Finally I would like to thank my husband for providing me full support, motivation and
encouragement during this endeavor.
vii
TABLE OF CONTENTS
Page
Acknowledgements............................................................................................................vii
List of Tables.......................................................................................................................x
List of Figures................................................................................................................... xii
Chapter
1. INTRODUCTION...........................................................................................................1
2. BACKGROUND.............................................................................................................5
2.1 Need for the Tool...............................................................................................5
2.2 Scope..................................................................................................................5
2.3 Project Development and Design.......................................................................6
2.4 Technology Used...............................................................................................7
3. OLAP LEARNING TOOL ARCHITECTURE...............................................................9
4. ROAD MAP FOR OLAP COURSEWARE..................................................................12
4.1 Multidimensional Modeling.............................................................................13
4.2 OLAP Cube......................................................................................................14
4.4 Data Models in the OLAP World....................................................................15
4.5 OLAP Operations.............................................................................................16
4.6 Examples..........................................................................................................21
4.7 Exercise............................................................................................................22
4.7.1. Authentication..................................................................................23
viii
4.8 Quiz..................................................................................................................23
5. LEARNING BY EXAMPLES......................................................................................24
5.1 Data Source......................................................................................................24
5.2 Data Preprocessing...........................................................................................24
5.3 Purpose of the Data Mart.................................................................................27
5.4 Data Mart Design.............................................................................................27
5.4.1 Concept Hierarchy............................................................................34
5.5 Open Queries and Results of OLAP Operations..............................................36
6. COURSEWARE INTEGRATION................................................................................48
7. COURSEWARE EVALUATION.................................................................................50
8. CONCLUSION..............................................................................................................53
Appendix A Rollup Report................................................................................................54
Appendix B Roll Down Report..........................................................................................58
Appendix C Slicing Report................................................................................................59
Appendix D Scoping Report..............................................................................................62
Appendix E Dicing Report.................................................................................................64
Bibliography......................................................................................................................65
ix
LIST OF TABLES
Tables
Page
Table 1 Cube representing recorded temperatures.............................................................17
Table 2 Rollup operation on recorded temperatures..........................................................17
Table 3 Roll down operation - Reduction of time hierarchy.............................................18
Table 4 Slicing operation on mild dimension....................................................................19
Table 5 Dicing operation on cool and mild dimensions....................................................20
Table 6 Preprocessed movie and award data set................................................................26
Table 7 Fact table ..............................................................................................................30
Table 8 Year dimension table............................................................................................31
Table 9 Movie dimension table..........................................................................................32
Table 10 Award type dimension table...............................................................................33
Table 11 Award Categories dimension table.....................................................................33
Table 12 Names dimension table.......................................................................................34
Table 13 Single table dataset.............................................................................................37
Table 14 Roll up using single table dataset.......................................................................38
Table 15 Roll up result using star schema.........................................................................40
x
Table 16 Atonement movie awards...................................................................................42
Table 17 Roll down on atonement movie..........................................................................43
Table 18 Slice operation on movie and year......................................................................45
Table 19 Dice operation on movie, year and name...........................................................46
Table 20 Scope operation using year dimension...............................................................47
xi
LIST OF FIGURES
Figures
Page
1.
Data warehouse as a buffer between OLTP and OLAP...........................................3
2.
OLAP courseware architecture.................................................................................9
3.
Courseware main page............................................................................................12
4.
Steps in the OLAP creation process........................................................................13
5.
OLAP cube with TIME and PRODUCT as dimensions.........................................14
6.
Pivot........................................................................................................................20
7.
Exercise page..........................................................................................................22
8.
Snapshot of Quiz page............................................................................................23
9.
Data Mart Design....................................................................................................28
10.
Star schema for movie and awards data mart.........................................................29
11.
Concept Hierarchy..................................................................................................35
12.
Snapshot of the integrated courseware...................................................................49
13.
Feedback results for questions 1, 5 and 6...............................................................51
14.
Feedback for questions 7, 8 and 9............................................................ ..............52
xii
Chapter 1
INTRODUCTION
A data warehouse is a centralized repository that stores data from multiple information
sources and transforms them into a common, multidimensional data model for efficient
querying and analysis. Data warehouses contain consolidated data, from several
databases and other data sources with data of varying sizes, over long periods of time.
These type of data collected over long periods of time and stored in a single repository is
called historical data. So, Data warehousing has become generally accepted as the best
approach for providing an integrated, consistent source of data for use in data analysis
and business decision making. However, data warehousing can present complex issues
and require significant time and resources to implement [1]. For such a large size data
warehouse, query throughput and response times are very important. To facilitate these
complex analyses, data in a data warehouse is typically modeled in a multidimensional
fashion. By modeling data in a multidimensional manner it can be expressed in a simpler,
expressive and easier to understandable way so that users can make decision easily. As
the name indicates multidimensional modeling concentrates on dimensions, facts and
measures.
Data can be modeled by two ways such as On Line Transaction Processing (OLTP) and
Online analytical Processing (OLAP). OLTP deal with data that are used for transactions
and will have several operations applied on them. For example: Consider an ATM
1
machine, it will have current data of concurrent users. The pin number, amount
withdrawn and balance become data elements. OLTP allows users to insert, update and
delete data otherwise known as small online transactions.
OLAP deals with large amount of historical data, answers multidimensional queries and
provides an approach for users to view data in different dimensions. Since OLAP deals
with a large amount of historical data, response time is a very essential measure. For
example: An airline company wants to set a new price for a flight, we can use 5 years of
historical data about flight reservation, like peak hours of travel, cost to help make the
airline company to decide the price for the flight. The data in a OLAP database are
usually in the form of schemas like a star schema.
Data warehouse acts a buffer between an OLTP and OLAP. OLTP contains data of
concurrent users. OLAP consists of historical data collected over long periods of time,
and help model the data for easier view and help users to make their best decisions. Data
warehouse consists of both the historical data and the concurrent data. It reconciles the
data and processes it so that OLAP systems can derive the data from it. The below figure
shows how data warehouse acts as a buffer between OLAP and OLTP.
2
OLTP
SYSTEM
OLTP
S
SYSTEMS
CONCUR
CONCUR
RENT/OP
RENT/OP
ERATIO
ERATION
NAL
AL
DATA
DATA
DATA
WAREHOUSE
RECONCILED
DATA
OLAP
SYSTEMS
HISTORICAL
DATA/
DERIVED
DATA
DATA
CONCURR
CONC
ENT DATA
CONCU
URRE
CONCURR
RRENT
NT
Figure 1 Data warehouseENT
as aDATA
buffer between OLTP and OLAP.
DATA
DATA
The above figure shows that data warehouse obtains the incoming data from multiple
OLTP sources and customizes the derived data as reconciles data to OLAP applications.
In this courseware we are mainly going to study about OLAP and its operations. This will
provide a practical learning experience for any user about OLAP. This chapter mainly
provided an overview about data warehouse, OLTP and OLAP systems. Chapter 2
provides background information, need for the courseware and the scope of the project.
Chapter 3 provides and architecture of the OLAP courseware. Chapter 4 discuss the
implementation and design of the courseware in detail. Chapter 5 contains explanation of
examples of OLAP operations with demonstrations and supporting code.
3
Chapter 7 explain the need for integrating the courseware with the other courseware's
created for the data mining and data warehousing course. Chapter 8 through talks about
evaluation of the courseware by students taking the data mining and data warehousing
course and conclusion of the report.
4
Chapter 2
BACKGROUND
OLAP help analyze the historical data stored in a data warehouse over long periods of
time across various dimensions. In this way, users can view the data along various
dimensions and thus analyze the data and make strategic decisions.
2.1 Need for the Tool
The main motivation behind creating this courseware is that there is no live schema that
demonstrates how OLAP operations work for students. There are many websites that
explains about OLAP operations theoretically. Though the theoretical explanations are
clear enough for a user to understand the concepts of OLAP, a practical demonstration of
the concepts would provide the users with an insight on how the OLAP operation works
in real world.
2.2 Scope
The courseware is for users who would already have some knowledge about databases
and would like to learn in depth about OLAP and its operations. OLAP can be divided
into several operations, that help the user view the data in the data warehouse in different
dimensions. So this courseware would give the users a self paced study on the topic by
providing examples and exercises following the examples. The examples for every OLAP
operations are demonstrated using open sample queries. The answers for each OLAP
operations are withdrawn from the database and retrieved in the form of tables for easier
view. Following the examples would be some exercises that will encourage the user to
5
work out the questions based on the examples demonstrated. Following the examples and
exercises is an online Quiz that would help the user evaluate their knowledge on the
topic.
2.3 Project Development and Design
The development of the project took place in two phases:
a) RESEARCH AND ANALYSIS: Before developing the courseware quite an amount of
research was done to understand the concepts of OLAP and its operations. Also in order
to demonstrate the OLAP operations, I had to choose a dataset that would create interest
in the students to learn the courseware. The dataset chosen to illustrate the example is a
movie and award dataset and this dataset was created from imdb and wikipedia. The
dataset basically will show the movie and nominees that won awards according to
respective years. I chose another dataset called the car and truck sales data for providing
exercise questions to the users.
b) DESIGN AND DEVELOPMENT: The design of the project was divided in such a
way to accommodate three ways of implementation. The first phase is the front end,
where the users can see the layout of the courseware with information and respective tabs
leading to separate spaces explaining concepts related to the courseware. The second
phase is about the datasets used in the project. These datasets underwent certain data
mining process before loading into the database. The third phase is the backend, where
the data from the database will be displayed to the user based on the query thrown. The
datasets are loaded into respective tables in the database. Respective example queries for
6
demonstrating each OLAP operation are executed against the data. The resulting answer
data will be displayed to the user in the form of tables.
The framework of the courseware can be divided into three parts:
a) Front end
b) Data
c) Back end
2.4 Technology Used
This section describes the various technologies used for developing the OLAP
courseware. As mentioned above, the courseware can be divided into three layers, and the
technologies used in each layer is discussed below:
FRONT END: The front end is the main page that is first visible to the users when they
access the courseware from any web browser. The technologies that were used to build
the front end where PHP and HTML. The main reason behind choosing PHP is its
compatilibility with MySQL. This helps in easy data retrieval and information display.
The main features of PHP include:

Open Source

Compatibility with MySQL

Online library support

Easily embedded into HTML
BACK END: The backend is not accessed by the user directly, rather it can be accessed
through the front end. The backend technologies used in this project are:
MS ACCESS to load all the data as a dataset. Loading the data into MSaccess will help
easy insertion of data into the database tables. All the data cleansing process were
performed in MS access.
MySQL workbench was used to create database and tables. Data from the respective
excel sheets were loaded into the database tables. Workbench has a feature called
7
Import/Export data which allows the data in excel sheets to be directly imported into the
database tables. When right clicking on the respective table name we can copy the query
for creation of the table, and the respective inserts made into the table. PHP and MySQL
are widely used in California State University, Sacramento campus. So a command
prompt was used to access MySQL in remote servers.
The main features of MySQL for this project include:

Ease of use: It requires only basic knowledge to work with SQL

Security: MySQL protects sensitive data from intruders

Scalability: MySQL can handle almost any amount of data for example 300-400
million rows of data.
8
Chapter 3
OLAP LEARNING TOOL ARCHITECTURE
This chapter describes the architecture implemented in the courseware. The architecture
of the OLAP courseware can be divided into four separate layers to provide convenience
to the users who access this courseware. The courseware can be divided into four layers
that are Presentation layer, Logic layer, database layer and data layer. The block diagram
below shows the four layers representing the courseware:
Browser
BROWSER
IE/MOZILLA
(IE,
FIREFOX
FIREFOX)
PRESENTATION
LAYER
PHP/HTML
and
DATABASE
DATABASE
(MySql)
LOGIC LAYER
DATABASE
LAYER
DATAS
ET
MSExcel
DATA LAYER
Figure 2 OLAP courseware architecture
Presentation layer: The presentation layer is the outermost layer of our courseware. The
web interface that we can view as a front end is constructed using PHP and HTML. This
presentation layer is organized into various tabs that helps us access the information. The
subject is organized into introduction, multidimensional modeling, OLAP operations in
detail, examples, exercise, quiz and references. The content of the courseware include
visual diagrams and attachments that support the information provided in the courseware.
9
This user friendly web interface also supports report generation in the form of tabular
columns for easy understanding by the user.
Logic layer: This is the layer that acts as a middleman between the data warehouse and
the web interface. This layer takes care of anything that happens behind the web
interface. The logic layer otherwise called as the business logic layer takes care of
retrieving information and processing the information sent from the user interface to the
other layers in the architecture. This layer checks for the consistency of data presented on
the web interface. Configuration was performed to establish a connection between the
frontend and the database. The code for obtaining the configuration is:
<?php
mysql_connect("server url", "username", "password") or die(mysql_error());
echo "Connected to MySQL<br />";
?>
Database layer: This is the layer where all the data stored is retrieved and displayed to the
user in the form of tabular columns. The database layer is constructed using Mysql
workbench. In our courseware we use the movie and award data as the main data for
illustrating the OLAP operations. The schema or the database is created and the data is
loaded into individual tables. In our courseware, when user presses the "execute query"
button under each illustration of OLAP operation, then this layer helps processing the
query, and retrieving the information stored in the necessary tables. This layer helps
storing the results of analyzed data.
Data layer: This is the layer that contains the raw dataset that is being extracted from
various sources across the web. The raw data set is then stored in MS excel files. The
data undergoes some preprocessing before loading into the database. All the files had to
10
be carefully analyzed for choosing the correct attributes to create the final data. Moreover
the data had to be cleaned and filled in for missing values. Careful decisions had to be
made when deciding which data has to be kept, and which had to be deleted.
It is very important to use a layered architecture because it offers several advantages. By
adding a layer called the business logic layer the courseware's framework is made
thinner, and allows fast flow of information to be displayed to the users. The courseware
can be accessed by n number of users, so the system has to be scalable which can be
achieved by the n tier architecture. A good scalable website always ensures a good
performance. The n tier architecture also provides security to the data stored in the
database.
This chapter provided us information about the framework of the courseware. The
upcoming chapters provide in depth information about the courseware and its
components.
11
Chapter 4
ROAD MAP FOR OLAP COURSEWARE
This chapter describes the OLAP courseware in detail. In this chapter the user will obtain
complete knowledge on how to use the courseware. This courseware may or may not
provide complete features that a user may want, but it would be a learning platform for
users to obtain a basic knowledge on OLAP and its operations. The courseware contains
six main tabs that covers all the information about OLAP.
Figure 3 Courseware main page
12
The layout of the courseware is simple and is user friendly. The following sections below
will explain the main topics covered in the courseware.
4.1 Multidimensional Modeling
Normally data is distributed in several data sources and are not compatible with each
other. Example of incompatible data: Customer ages, which are stored as birth date for
purchases made online and are stored as age categories for in store sales.
OLAP helps extracting data from various data resources and helps make them
compatible, thus ensuring that the meaning of the data in a datastore matches with all
other data stores. Since OLAP collects historical data and make it compatible for users to
analyze and make tactical decisions, without OLAP it would be hard to obtain reports
like, What are the popular products purchased by customers between the ages 20 to 30.
This type of data organization is called multidimensional modeling.
EXTRACT
EXTRACT
DATA
DATA
FROM
SOURCES
TRANSFORM
DATA
IMPORT TO
DATABASE
BUILD
CUBE
PROVIDE
REPORT
Figure 4 Steps in the OLAP creation process
In order to help user analyze and view the data, OLAP uses tables from the databases.
This will help the user to view the data in a multidimensional way. Also, it will help the
user to analyze and query the data to make logical and tactful decisions. Basically
information or data to be viewed multidimensionally should be stored as an array or cube.
13
4.3 OLAP Cube
OLAP cubes are mainly designed to retrieve data in an efficient manner. A cube can be
defined as a group of data cells arranged by dimensions and measures. Dimensions can
be defined as how we want to view data, for eg: time, and product and measures can be
defined as items that can be counted, summarized or aggregated such as sales amount.
Cube can also link data from different dimensions. Cube helps the user to view the data
in different dimensions.
t
i
m
e
Product
Figure 5 OLAP cube with TIME and PRODUCT as dimensions.
Here are some ways to create a data cube:
Select dimensions: The first way is to select a set of dimensions, which is a subset of
attributes. For example, in the above cube, we have selected time and product as
dimensions, this becomes a two dimensional data cube.
Select concept hierarchy levels: Selecting subsets for each dimensions. For example:
Selecting weeks for time and soap for product. Populating the cube by selecting a
measure.
14
4.4 Data Models In The OLAP World
Cubes can be stored in three different models:
Relational OLAP (ROLAP)
Multidimensional OLAP (MOLAP)
Hybrid OLAP (HOLAP)
a) ROLAP:
To provide fast response time to OLAP applications, super relational features are added
to traditional RDBMS to form super relational database systems. The data in this model
is stored in relational databases. Relational OLAP follows dimensional modeling, which
organizes data into measures stored in fact tables and dimensions stored in
dimension/satellite tables.
ROLAP uses data marts. Data marts are databases that share many of the features of data
warehouses but are smaller in scope.
Data can be stored in two ways in a relational model:
i) Star schema: By storing data in a star schema we can overcome usability and
performance problem. A star schema consists of one fact table and several dimension
tables.
ii) Snow flake schema: To avoid redundancy and waste of storage due to denormalized
structure of dimension tables, a snow flake schema is created. A normalized star schema
is a snow flake schema.
15
b) MOLAP:
MOLAP is a special purpose data model in which multidimensional data and operations
are mapped directly.
c) HOLAP:
HOLAP is a combination of ROLAP and MOLAP. HOLAP allows storing a set of the
data in a MOLAP repository and another set of the data in a ROLAP repository, utilizing
the advantages of each.
4.5 OLAP Operations
We are going to cover the main OLAP operations under this topic. Before that we need to
understand certain concepts. In a multidimensional data model, the data is organized into
multiple dimensions, and each dimension contains multiple subset of abstraction defined
as concept hierarchies. This organization provides users with the flexibility to view data
from various perspectives. This is called concept hierarchy. For example, we have some
attributes like day, temperature etc. Climbing up the concept hierarchy for day would
give us week and climbing down the concept hierarchy or creating subsets for
temperature would give us hot, mild or cool temperatures.
16
OLAP provides a user friendly platform for data analysis in an interactive manner. OLAP
operations helps to display views of data in different dimensions to the user, thus
allowing interactive query and analysis of the data.
a) ROLL UP
The roll up operation performs aggregation on a data cube, either by climbing up a
concept hierarchy for a dimension or by climbing down a concept hierarchy, i.e.
dimension reduction. In order to understand rollup consider the following cube with
temperatures of certain days recorded weekly:
Table 1 Cube representing recorded temperatures
If we want to set levels in temperature like mild and cold temperatures from the above
cube, then we need to group the columns and add the values according to the concept
hierarchy. As a result we obtain the below cube:
Table 2 Rollup operation on recorded temperatures
b) ROLL DOWN
The roll down operation is the opposite of roll up operation. Roll down operation can also
be called as drill down operation. The purpose of roll down operation is to navigate from
less detailed data to a more detailed data.
The roll down operation can be realized by two ways
17
a) Stepping down a concept hierarchy for a dimension
b) Introducing new dimensions
If we want to perform roll down operation on the above cube represented by table 1, then
we can descend the time hierarchy from the level of week to the more detailed level of
day. Also, new dimensions can be added to the cube, as drill down adds more detail to
the data. The below table shows the result of drill down operation performed on above
cube.
Table 3 Roll down operation - Reduction of time hierarchy
c) SLICING
Slice performs a selection of only one dimension of the given cube, thus resulting in a
sub cube. For example, If we perform selection on only dimension i.e. temperature =
mild, from the above example cube shown in table 1, then we obtain the following cube:
18
Table 4 Slicing operation on mild dimension
d) DICING
The dicing operation performs a selection on two or more dimensions and obtains a sub
cube. For example, applying the selection time as day 3 or day 4 and temperature as cool
or mild to the original cube, we get a two dimensional subcube as shown in the table
below:
19
Table 5 Dicing operation on cool and mild dimensions
e) Pivot:
Pivot also called as Rotate changes the dimensional orientation of the cube. It rotates the
axes of the data so that data can be viewed from different perspectives. Pivot groups data
with different dimensions. The below figure shows an illustration about pivot.
T
E
M
P
E
R
A
T
U
R
E
T
I
M
E
TEMPERATURE
TIME
Figure 6 Pivot
f) Scoping:
The scoping OLAP operation restricts the view of database objects to a specified subset.
g) More OLAP operations:
There are some more operations that exist in the OLAP world like:
20
Screening: Restricting the view of a set of data by screening some of the members of a
dimension.
Drill Across: Drill across helps accesses more than a fact table linked by common
dimensions
Drill Through: Drill Through helps drill down to the bottom level of a data cube to its
relational tables.
4.6 Examples
The examples page in the courseware will contain live demonstration of some of the
OLAP operations. I have chosen a movie and award data mart to demonstrate the
examples. Data from various sources like Imdb and wikipedia was collected in an excel
sheet. The data used for demonstrating the examples was a movie and award database
and the data underwent certain data mining operations like data cleaning. After data
cleaning data were loaded into sql tables using MySql workbench. Then the data was
formulated into a star schema by defining fact and dimension tables. The operations of
OLAP are then demonstrated by throwing queries on the database and the resulting
answers or report were generated in a tabular column. The queries are displayed to the
users as open query for their learning. The layout of the example page will contain OLAP
operations whose demonstration when clicked will be showed in a separate page. Chapter
5 will provide a detail explanation about the examples page in the courseware.
4.7. Exercise
This section of the courseware helps the students to do exercises based on the examples
demonstrated to them. Again a separate data set was collected, and this data set contained
21
information on car and truck sales for particular years in the country USA. Users can
download the data and add more data to it. They can also perform data mining operations
on the data. The users can create their own database and perform the OLAP operations
questions provided in the examples page. Questions for each OLAP operations are
provided, and users can use MySql or any other database platform to execute queries on
the data. The exercise part was designed to provide a practical learning experience for the
users. The below picture shows a snapshot of the exercise page.
Figure 7 Exercise page
22
4.7.1 Authentication
The exercise come with the answers and queries, but only privileged users will be able to
view the answers.
4.8 Quiz
Finally the courseware has a section called quiz which provides opportunity for the users
to self evaluate themselves by answering a set of quiz problems.
Figure 8 Snapshot of Quiz page
23
Chapter 5
LEARNING BY EXAMPLES
As we saw in chapter 4, in the OLAP courseware the examples page and the exercise
page provides a practical learning about OLAP and its operations to the user. In the
examples page, each OLAP operations is linked to a separate page that explains what that
operation is. Open queries are provided to demonstrate each OLAP operation so that data
can be viewed in various dimensions and the result displayed as a tabular column. In the
exercise page a data set is provided to the user, so that they get an opportunity to work
with the data and answer questions posted in the webpage under each OLAP operations.
5.1 Data Source
The data set for the example page is called movie and awards. The data set were collected
from imdb and wikipedia. This data set describes, the movie and The data set for the
exercise page is called car and truck sales data. The data collected, is stored in an excel
file.
5.2 Data Preprocessing
The initial dataset is filled with data in an ambiguous manner so the data has to be
cleaned and preprocessed. Also it has to be carefully analyzed for choosing the correct
attributes to create the final data, so it is necessary that the file containing the data should
be clear of erroneous data. The process of data cleaning can be done by the following
ways:
24
a) Blank substitution
This is a problem, where the dataset has incomplete data. Some of the attributes values
are left blank in this case. In order to fill in the necessary data in the missing value places,
i had to do some research from google and finding the values that matched the exact rows
and columns attributes.
b) Special characters
Some of the data in the dataset, contained special characters like "?" in place of real data.
As mentioned above these values has to be replaced by finding out the exact data that
matches the other data in the file.
c) Unwanted data
It is an important aspect to delete or discard unwanted data that are not required for the
creation of the data mart. Careful decisions had to be made when deciding which data
has to be kept, and which had to be deleted.
d) Deleting unwanted attributes
The dataset also contained some unwanted attributes that had to be cleared. The usage of
MySql was very useful to delete unwanted data.
The main purpose of performing data cleaning is that data can be sorted and easily loaded
into database tables.
25
Table 6 Preprocessed movie and award data set
26
5.3 Purpose of the Data Mart
Data warehouse is large storage of data, accumulated from a wide range of sources within
a company and used to guide management decisions. Data mart is segment of Data
warehouse that provides data for reporting. For example when you want to go to a
restaurant, you would like to gather important information like when the restaurant would
be open, what cuisines they would offer, and the highest rating for a restaurant. This
project help the users learn about how OLAP operations work with the movie and award
data mart as an example resource. The main motive behind choosing a movie and award
dataset is that people in the current trend are interested in movies, so they can show more
interest in learning the courseware, with practical and interesting examples. This datamart
contains all the hollywood movies released from the year 2000 till the year 2013. These
movies should have won certain specific categories of awards. The data for which type of
respe/ctive awards like golden globe, academy award, are also collected in the datamart.
Also, the award category namely best actor, best actress etc are sorted and arranged. Over
700 rows of such data was collected and preprocessed.
27
5.3 Data Mart Design
The main objective here is to provide a data mart design that would process the queries
easily and answer questions immediately to the users. In order to make it possible we
need to design a simple schema, and the movie and award data mart is designed based on
a star schema. The block diagram below shows a basic design of the data mart.
COLLECT
DATA
DRAW
DIMENSION
TABLES
DESIGN
CONCEPT
HIERARCH
Y
DRAW
FACT
TABLES
DRAW
STAR
SCHEMA
Figure 9 Data Mart Design
There are several advantages as to choosing a star schema than just using a single table to
perform the OLAP operations. There are two main advantages we need to consider as to
why a star schema approach was used. They are:
a) High efficiency in query performance because it retrieves exactly the necessary
rows from the fact table.
28
b) Simple in structure.
Figure 10 Star schema for movie and awards data mart
The above figure shows the star schema for the movie and awards data mart with the
"awardcollectiontable" as the fact table. A fact table consists of the measurements, metric
or facts and it is located at the center of a star schema surrounded by dimension tables.
29
A snapshot of the fact table is shown below.
Table 7 Fact table
We can see that this fact table contains measures, and foreign keys which refer to primary
keys in the dimension tables. Basically the primary keys (idyear, idmovie, idaward,
idcategory) of the dimension tables are foreign keys to the fact tables. So, in the above
table, the entire data of the data mart is stored here, and the foreign keys reference to the
30
data stored in the other tables. The foreign keys columns contain values instead of
original data, this is done to prevent duplication and loss of data.
The tables surrounding the fact tables are called dimension tables. In data warehousing,
a dimension table is a set of tables that accompanies a fact table. There are five
dimension tables surrounding the fact table "awardcollectonfact". They are year, movie,
awards, award categories, winners. The year table consists of all the years from 20012013. The year table with idyear as primary key, references to the movie released in these
years, and the award provided for the movies released only in the given year range.
Table 8 Year dimension table
The movie dimension table with idmovie as primary key, consists of the list of movie
names that recieved awards from the year 2001-2013. The table contained about 80
movies.
31
Table 9 Movie dimension table
Awards dimension table, with IdAward as primary key consists of the classification of
award, given to the award winner for a particular movie in a respective field. 8 main
award types were chosen for this data mart. The most popular award types were chosen,
because it can add interest to the users learning of the courseware.
32
Table 10 Award type dimension table
Award categories dimension table, with Idcategories as primary key consists of different
categories of award for eg: best director, best actor etc, for their work in a particular
movie in a particular year.
Table 11 Award categories dimension table
33
The final dimension table is called as "names" dimension table. This table has idnames as
primary key and consists of a list of names of winners who bagged the awards for their
work in a movie. The table consists of 356 winner names.
Table 12 Names dimension table
All these tables are queried and the results show how OLAP operations work.
5.4.1 Concept Hierarchy
We have already seen a detail explanation of concept hierarchy in chapter 4. Here we are
going to see the concept hierarchy of the movie and award database.
34
Figure 11 Concept Hierarchy
The above figure shows the concept hierarchy of the data mart. It shows year, movie and
names dimensions that can climb up the concept hierarchy to award category and award
types. All these dimensions can finally wrap up to form the award collection.
35
5.5 Open Queries and Results of OLAP Operations
Now we need to understand how the OLAP operations work using the above data base.
Once the tables are loaded into the database, open query transactions are executed to
retrieve information and demonstrate to the users on how the OLAP operations work.
a) ROLL UP:
As we saw in chapter 4 a roll up operation summarizes data by performing aggregation
on a data cube in any one of the following ways: 1) by climbing up a concept hierarchy
for a dimension 2) by dimension reduction
We are going to see query executions for rollup in two ways:
i) Roll using a single table
In order to understand what is a single table we need to look at the snapshot below
36
Table 13 Single table dataset
The table above contains data that summarizes all award winners from different award
categories from movies in a particular year. So in order to obtain the total number of
awards for a movie in a particular year from the table above, we can use the following
query:
37
SELECT
Year, movie, count (awardtype)
FROM
OLAP.AwardCollection
GROUP BY
year, movie with ROLLUP;
In the above query, the select operation will select the respective year and movie
dimensions. The count (awardtype) will count the total number of award types. The from
clause explains that the year and movie dimensions should be selected from the Award
collection table from the OLAP database created using MySql. The group by function is
used for grouping the result set by one or more columns. The below table shows the
query output for roll up using single table.
Table 14 Roll up using single table dataset
ii) Roll up using star schema
We have already seen the star schema of the movie and award database earlier in this
chapter. So, in order to obtain the total number of awards for a movie in a particular year
using the star schema we have to use the following query:
38
SELECT
Y.year, M.movie,
COUNT(A.AwardType) as NumAwards
FROM
OLAP.AwardCollectionFact as ACF , OLAP.Years as Y, OLAP.categories as C,
OLAP.movie as M, OLAP.Awards as A
WHERE
ACF.idYear=Y.idYear and ACF.idcategory=C.idcategories and
ACF.idMovie=M.idMovie and ACF.idAward=A.idAward
GROUP BY
year, movie with ROLLUP;
From the above query, the select operation will select the respective year and movie
dimensions. The count (awardtype) will count the total number of award types and we
have provided an alias name for awardtype as numawards. The alias here is used to
provide a temporary name for the column that displays the count of total number of
awards. The from clause explains that the year, movie dimensions should be selected
from the Award collection fact table from the OLAP database created using MySql. We
already saw about primary key and foreign key earlier in this chapter. The primary key
from each dimension table is matched with the foreign keys in the awardcollectionfact
table. This helps preventing duplication of data and data loss. Finally the group by
function is used for grouping the result set by one or more columns. The below table
shows the query output for roll up using star schema.
39
Table 15 Roll up result using star schema
b) ROLL DOWN
As we saw before Roll down operation is the reverse of roll up operation. The operation
can be performed in either two ways:
i) By stepping down a concept hierarchy
ii) By introducing new dimension
In order to demonstrate roll down, in roll up for a particular movie was performed the
courseware. As an example "atonement" movie was taken to demonstrate roll down.
The following query shows how rollup is performed on the database to obtain the number
of awards obtained by the particular movie "atonement".
40
SELECT
Y.year, M.movie,
COUNT(A.AwardType) as NumAwards
FROM
OLAP.AwardCollectionFact as ACF , OLAP.Years as Y, OLAP.categories as C,
OLAP.movie as M, OLAP.Awards as A
WHERE
ACF.idYear=Y.idYear and ACF.idcategory=C.idcategories and
ACF.idMovie=M.idMovie and ACF.idAward=A.idAward and M.movie='Atonement'
GROUP BY
year, movie with ROLLUP;
From the above query, the select operation will select the respective year and movie
dimensions. The count (awardtype) will count the total number of award types and we
have provided an alias name for awardtype as numawards. The alias here is used to
provide a temporary name for the column that displays the count of total number of
awards. The from clause explains that the year, movie dimensions should be selected
from the Award collection fact table from the OLAP database created using MySql. The
primary key from each dimension table is matched with the foreign keys in the
awardcollectionfact table. We have provided the selection of movie as "atonement" in
the where clause. Finally the group by function is used for grouping the result set by one
41
or more columns. The below table shows the query output for finding the number of
awards won by atonement movie using roll up.
Table 16 Atonement movie awards
From the above figure, we see that atonement movie has won a total of 28 awards. Now
we apply roll down on the same movie to obtain more detailed data about atonement
movie. The below query shows how to perform roll down on atonement movie, so that it
travels down the concept hierarchy and obtains all the minute details of atonement movie.
SELECT
category, movie, year,
COUNT(AwardType) as NumAwards, name
FROM OLAP.categories as C
LEFT JOIN
(SELECT ACF.idcategory as idc, category as ca, movie, year, AwardType,name
FROM OLAP.AwardCollectionFact as ACF, OLAP.movie as M, OLAP.Years as Y,
OLAP.Awards as A, OLAP.categories as C
WHERE ACF.idMovie=M.idMovie and M.movie='Atonement' and
42
ACF.idYear=Y.idYear and ACF.idAward=A.idAward and
ACF.idcategory=C.idcategories)
as FACT on C.idcategories=FACT.idc
GROUP BY category,year, movie;
When we take a look at the query, we find that it is a nested query. So lets take a look at
the inner query written in bold format. The select clause will select category from
awardcollection fact table, and other dimensions such as cateogry, movie, year,
awardtype and name from the tables awardcollectionfact and dimension tables such as
movie, years, awards, categories using from clause. The where clause helps select the
atonement movie. We see that a left join operation is performed. Left join combines two
queries and returns all rows from the left table (categories table), with the matching rows
in the right table (awardcollectionfact) and give null when there is no matching records in
the right side. The outer query makes a selection of category, movie and year from
categories table. The left join operation thus returns the rows of categories table matched
with awardcollectionfact table and groups the columns by category, year and movie using
group by function. The below snapshot is an output of the above query.
Table 17 Roll down on atonement movie
43
From the above figure we see the value 0 provided for records that did not match between
the two tables categories and awardcollectionfact.
c) Slicing
The slice operation selects one dimension on a given cube and give us a new sub cube.
Suppose from our movie data mart, we just need to slice which movie bagged awards and
in which year, then we need to apply the following query:
SELECT
Y.year, M.movie
FROM
OLAP.AwardCollectionFact as ACF , OLAP.Years as Y, OLAP.movie as M
WHERE ACF.idYear=Y.idYear and ACF.idMovie=M.idMovie
GROUP BY year, movie;
From the above query, the select operation will select the respective year and movie
dimensions. The from clause explains that the year, movie dimensions should be selected
from the Award collection fact table and dimension tables such as years and movie. The
where clause performs referencing of foreign and primary keys between the above
mentioned tables. Finally the group by function is used for grouping the result set by one
or more columns based on year and movie dimensions. The below table shows the query
output for slicing.
44
Table 18 Slice operation on movie and year
d) Dicing
The dice operation performs selection of two or more dimension on a given cube and
give us a new sub cube. So, from our movie data mart we want to dice movie and the
corresponding year for which "Leonardo di caprio" won awards, then we have to apply
the following query:
SELECT
year, movie, name
FROM
OLAP.AwardCollectionFact as ACF, OLAP.Years as Y, OLAP.movie as M
WHERE ACF.idYear=Y.idYear and ACF.idMovie=M.idMovie and Y.Year > 2003 and
name like '%Leon%' order by year;
45
From the above query, the select operation will select the respective year, name and
movie dimensions.
The from clause explains that the above dimensions should be
selected from the Award collection fact table and dimension tables such as years and
movie. The where clause performs referencing of foreign and primary keys between the
above mentioned tables and performs selection based on the needed output. Here we need
to obtain list of movie names for which Leonardo di caprio has won the movie awards.
The below table shows the output of dice operation:
Table 19 Dice operation on movie, year and name
e) Scoping
This operation restricts the view of the database objects to a specified subset. Here we
have restricted the scope of view of the database to academy awards between certain
years, say 2003 to 2006. Scoping is advantageous when we have a large amount of data,
where we want to limit the access of information to certain subset. The below query
shows how scoping works:
46
SELECT
year, movie, name, AwardType
FROM
OLAP.AwardCollectionFact as ACF , OLAP.Years as Y, OLAP.movie as M,
OLAP.Awards as A
WHERE ACF.idYear=Y.idYear and ACF.idMovie=M.idMovie and
ACF.idAward=A.idAward and A.AwardType='Academy Award'
and Y.Year > 2003 and Y.Year < 2006 order by Year, movie;
From the above query, the select operation will select the respective year, name, movie
and awardtype dimensions. The from clause explains that the above dimensions should
be selected from the Award collection fact table and dimension tables such as years,
awards and movie. The where clause performs referencing of foreign and primary keys
between the above mentioned tables and performs selection based on the needed output.
Here we want to scope the view of the database to academy awards between the years
2003 to 2006. The below table shows the output of scope operation:
Table 20 Scope operation using year dimension
47
Chapter 6
COURSEWARE INTEGRATION
In order to provide the users with a complete knowledge on the important concepts of
datamining and datawarehousing, this courseware is integrated with two other
coursewares created earlier for the datamining and datawarehousing course. The
coursewares that are integrated together are:
a) A courseware for datawarehousing
(url: http://athena.ecs.csus.edu/~enroll/enrollDW/Intro.php)
b) A courseware on Extract, Transform and Load (ETL) process
(url: http://athena.ecs.csus.edu/~web_etl/etl/)
c) A courseware of ONLINE ANALYTICAL PROCESSING CUBE
(url: http://athena.ecs.csus.edu/~OLAP/OLAP/introduction.php)
The coursewares were mainly integrated to make it advantageous for the users to access
all the important topics of datamining and datawarehousing in a single platform. The
below snapshot shows the webpage with the integrated courseware.
48
Figure 12 Snapshot of the integrated courseware
49
Chapter 7
COURSEWARE EVALUATION
The previous chapters discussed about the contents of the courseware. This chapters will
discuss the assessment made on the courseware. This is a major element that will add
success to the courseware. The main motive of the courseware is that the users should
find it effective in understanding the topics of OLAP. As it will be easier for them to
learn and obtain knowledge about the topic in detail. In order to make sure that the
courseware reached the users, evaluation process was conducted to evaluate how much
the users understood the courseware.
The courseware was mainly prepared for the course CSc 177, which is data mining and
data warehousing. Students of this course was introduced to the courseware with a
demonstration. After the demonstration the class of students were given a task of
understanding the courseware and evaluate it. The feedback provided by the students
were positive and encouraging. The students were asked to give feedback to the survey
questions below:

Think about your learning experience in using OLAP courseware, how would you
rate your learning experience?

How would you rate the look and feel of the courseware?

How user friendly is the tool?

What do you think about the tools feature?

How was the courseware presented to you?
50

How would you rate the illustration and examples?

Are the topics covered in this courseware organized properly?

Did the examples help you in completing the exercises easily?

Is learning the OLAP concepts using this courseware easier than reading a book?
The students gave positive feedbacks to most of the questions and also gave suggestions
on how to improve, or what to add in the courseware.
The below graph shows the statistics to some of the questions from the student's
feedback:
Figure 13 Feedback results for questions 1, 5 and 6
51
Figure 14 Feedback for questions 7, 8 and 9
The students were also given an opportunity to comment on the courseware. The students
felt the overall experience of the courseware was good and helped in understanding the
fundamentals of OLAP. They also felt that the exercises at the end helped them think and
apply their knowledge of what they understood from the examples demonstrated.
Students also suggested to add some animations or videos on how the OLAP operations
work.
The suggestions from feedback will be included in future work which will be covered in
the next chapter.
52
Chapter 8
CONCLUSION
A static textbook presentation of the key concepts of OLAP cube would not provide an
opportunity for users to learn the concepts clearly through interactive exercises and
demonstrations. Although there are a lot of websites that explain the concepts of OLAP
clearly, there is no live demonstrations of the operations of OLAP. So an e-learning
courseware that would illustrate the key concepts of OLAP and its operations was
necessary.
The main objective of this project was to develop a web based interactive courseware that
supports a unified learning experience. As a conclusion to the project report the objective
and scope of the project as stated earlier was achieved The courseware would now
illustrate the key concepts of OLAP cube using example demonstrations.
The
courseware is freely accessible to any user. The courseware is also designed to be user
friendly and one does not require a lot of time to learn how to use the courseware. The
courseware will also allow users to understand the concepts of data mining, and
implement it on raw data that is provided to the users for performing the exercise section.
This project gave me an opportunity to understand the concepts of OLAP deeply and also
work with Mysql and php, thus enhancing my skills in both the languages.
53
Appendix A Rollup Report
54
55
56
57
Appendix B Roll Down Report
58
Appendix C Slicing Report
59
60
61
Appendix D Scoping Report
62
63
Appendix E Dicing Report
64
Bibliography
1. "Data Mining: Concepts and Techniques" by Jiawei Han, Micheline Kamber,
and Jian Pei, 3rd ed., Morgan Kaufmann 2012
2. "Data Modeling Techniques for Data Warehousing" by Chuck Ballard, Dirk
Herreman, Don Schau, Rhonda Bell, Eunsaeng Kim and Ann Valencic
3. Fundamentals of Data Warehouses by Jarke, M., Lenzerini, M., Vassiliou, Y.,
Vassiliadis. P, 2nd, rev. and extended ed. 2003, XV, 224 p.
4. The datawarehouse toolkit - The definitive guide to Dimensional modeling by
Ralph kimball, Margy Ross, 3rd ed., Wiley 2013
5. Data models in the OLAP world. [Online] Available:
http://www.tutorialspoint.com/dwh/dwh_OLAP.htm
6. OLAP operations by Andrei Pandre. [Online] Available:
http://apandre.wordpress.com/data/datacube/
7. Data Warehouse and OLAP. [Online] Available:
http://www.cs.ccsu.edu/~markov/ccsu_courses/DataMining-2.html
65
66