Download a courseware for data warehousing - Sacramento

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
A COURSEWARE FOR DATA WAREHOUSING
Manashree Laxmikant Kulkarni
B.E., Rashtrasant Tukdoji Maharaj Nagpur University, 2006
PROJECT
Submitted in partial satisfaction of
the requirements for the degree of
MASTER OF SCIENCE
in
COMPUTER SCIENCE
at
CALIFORNIA STATE UNIVERSITY, SACRAMENTO
FALL
2010
A COURSEWARE FOR DATA WAREHOUSING
A Project
by
Manashree Laxmikant Kulkarni
Approved by:
__________________________________, Committee Chair
Dr. Meiliu Lu
__________________________________, Second Reader
Dr. Du Zhang
____________________________
Date
ii
Student: Manashree Laxmikant Kulkarni
I certify that this student has met the requirements for format contained in the University format
manual, and that this project is suitable for shelving in the Library and credit is to be awarded for
the project.
__________________________, Graduate Coordinator
Dr. Nikrouz Faroughi
Department of Computer Science
iii
___________________
Date
Abstract
of
A COURSEWARE FOR DATA WAREHOUSING
by
Manashree Laxmikant Kulkarni
Data warehousing is one of the important approaches for data integration and data
preprocessing. The objective of this project is to develop a web-based interactive courseware to
help beginning data warehouse designers to reinforce the key concepts of data warehousing
using a case study approach.
The case study is to build a bottom up data warehouse for a university student enrollment
prediction data mining system. This data warehouse is able to generate summary reports as input
data files for a data mining system to predict future student enrollment.
The data source include: (1) the enrollment data from California State University at
Sacramento, and (2) the related public data of California. In the courseware, we build the data
warehouse systematically using a set of four demonstrations covering the following data
warehousing topics: fundamentals, design principle, building an enterprise data warehouse using
an incremental approach and aggregation. Every demonstration has the capability of data
reporting for the end users upon their requests.
iv
We integrate the courseware with an introductory data warehousing and data mining
class. This class of 20 students evaluated the effectiveness of this tool. Addition of feedback link to
the courseware website for the end users is one of the results obtained from this evaluation.
, Committee Chair
Dr. Meiliu Lu
______________________
Date
v
ACKNOWLEDGMENTS
I would like to express my deep and sincere gratitude to my project advisor, Dr. Meiliu
Lu for her support and guidance throughout the project. I thank her for giving me an opportunity
to work on a unique idea and put it into reality. She provided me valuable advice and untiring
help during the development of the project. Her detailed and constructive comments were very
beneficial not only during the phase of website development for the project but also during the
phase of report writing. Without her encouragement and personal guidance, the success of this
project would not have been possible.
My sincere thanks to Dr. Du Zhang for his detailed review and productive remarks on the
project report. I also thank Dr. Nikrouz Faroughi for his review and advice for the successful
completion of this project.
My warm thanks to the University Library at California State University, Sacramento for
providing me with books and resources helpful in my project research.
I owe my deepest thanks to my family for their love and support throughout my life. I am
indebted to my father, Late Mr. Laxmikant Kulkarni, whose faith and hard work provided me the
encouragement and support to pursue my Master’s degree. My loving thanks to my mother, Mrs.
Jayashree Kulkarni, my brother, Mr. Ashish Kulkarni, and my grandfather, Mr. K. K Panse for
their love and constant support during difficult moments. I owe my loving thanks to my husband,
Mr. Prasad Shah, without his support and understanding, this project would have been
impossible.
I extend my thanks to all those who have helped me directly or indirectly in the
completion of this project. Last but not the least; thanks are to the Almighty for all the blessings.
vi
TABLE OF CONTENTS
Page
Acknowledgments.....................................................................................................................vi
List of Tables ........................................................................................................................... ix
List of Figures ........................................................................................................................... x
Chapter
1. INTRODUCTION .............................................................................................................. 1
2. BACKGROUND ................................................................................................................ 6
3. COURSEWARE DESIGN ............................................................................................... 11
4. ENROLLMENT DATA WAREHOUSE DESIGN .......................................................... 14
Interviewing ............................................................................................................... 15
Purpose of Enrollment Data Warehouse .................................................................... 16
Enrollment Case Study Data mart Design ................................................................. 17
Enrollment Case Study Data mart Refinement .......................................................... 23
Enrollment Data Reporting ........................................................................................ 25
5. ENTERPRISE DATA WAREHOUSE ............................................................................ 29
Enterprise Data Warehouse for Enrollment Case Study ............................................ 30
Incremental Approach on Enrollment EDW.............................................................. 31
Enrollment Case Study EDW Design ........................................................................ 31
Enrollment EDW Data Reporting .............................................................................. 35
6. AGGREGATION ON ENROLLMENT DATA WAREHOUSE..................................... 38
Aggregation ............................................................................................................... 38
vii
Performance Parameter .............................................................................................. 39
Aggregate Schema Design on Enrollment Case Study .............................................. 42
Performance Analysis ................................................................................................ 44
7. COURSEWARE EVALUATION .................................................................................... 48
8. CONCLUSION .................................................................................................................. 50
Appendix A. Enrollment Report ............................................................................................ 53
Appendix B. Enrollment with Socioeconomic Report ........................................................... 55
Appendix C. Enrollment Prediction Report ........................................................................... 57
Appendix D. Documentation on Courseware Website ........................................................... 59
Bibliography ............................................................................................................................ 60
viii
LIST OF TABLES
Page
1.
Table 1 Data Warehouse and Database ..............................................................10
2.
Table 2 Enrollment Summary Report .................................................................22
3.
Table 3 User-desired Query Report ....................................................................28
4.
Table 4 Enrollment Prediction Report ................................................................37
5.
Table 5 Query Output without Aggregation .......................................................46
6.
Table 6 Query Output with Aggregation ............................................................46
ix
LIST OF FIGURES
Page
1.
Figure 1 A Courseware for Data Warehousing.....................................................4
2.
Figure 2 Data Warehousing Vs. Flat Files ............................................................7
3.
Figure 3 Framework of Courseware ...................................................................11
4.
Figure 4 Courseware Demonstrations: Demo A and Demo B ............................14
5.
Figure 5 Initial Data mart on Enrollment ...........................................................18
6.
Figure 6 Time Dimension Table .........................................................................18
7.
Figure 7 Student Classification Dimension Table ..............................................19
8.
Figure 8 Enrollment Fact Table ..........................................................................20
9.
Figure 9 Enrollment Star Schema ......................................................................21
10.
Figure 10 Refined Enrollment Data mart............................................................24
11.
Figure 11 Enrollment Graph for Computer Science Department .......................25
12.
Figure 12 Snapshot of User Input .......................................................................26
13.
Figure 13 Courseware Demonstration: Demo C .................................................29
14.
Figure 14 Socioeconomic Dimension Table .......................................................32
15.
Figure 15 Prediction Fact Table ..........................................................................33
16.
Figure 16 Enrollment EDW ................................................................................34
17.
Figure 17 Aggregate Functions...........................................................................38
18.
Figure 18 Aggregate Design Methodology .........................................................39
19.
Figure 19 Enrollment Aggregate Table ..............................................................42
20.
Figure 20 Enrollment Aggregate Schema ...........................................................43
21.
Figure 21 Feedback Component .........................................................................49
x
1
Chapter 1
INTRODUCTION
Every institution, small or big, requires exploitation of a large scale of chronological data.
An analytical prediction model for this data can facilitate imperative management functions such
as decision making and planning. The data warehouse has been playing a critical role in data
preprocessing and data integration. It allows speedy repossession of input data for data mining
and data analysis. The outcome of data reporting, data analysis and data mining tools support
management planning for budget analysis, resource allocation, forecasting, prediction, and other
business processes [1, 2].
A data warehouse is storage of historical data for a business, an experiment or any other
enterprise. It consists of selectively extracted data from a primary source or any other source
inter-related with the primary data [3]. It reduces the cost-per-analysis due to the simpler and
standardized structures in contrast to the application databases. A data warehouse is an Online
Analytical Processing (OLAP) system [4, 2] that is vital to an enterprise for making business
decisions and responding to analytical questions crucial for a business process. Hence, a data
warehouse becomes more resourceful for a business process than the Online Transaction
Processing (OLTP) systems [4].
The main idea of this courseware project is to provide a quick learning tool for data
warehousing. The courseware is a 3-tier web application entitled “The Courseware for Data
Warehousing”. It illuminates basic concepts, design principles, and performance enhancement
techniques of data warehousing. This application is an e - learning tool integrated into a course
website for a Computer Science course, CSc 177: Data Warehousing and Data Mining, in
California State University, Sacramento. The courseware supplements the data warehousing
2
topics of this course such as aggregation. We explain the topics in the courseware in depth and
allow students to explore.
The courseware also provides a quick reference to the students who have not taken any
course on data warehousing topics. The tool supports the course material using illustrative
examples, interactive demonstrations and visual diagrams to the topic explanation. This gives
students interest and insight in the learning process. The students can assess their understanding
of data warehousing via interactive quizzes provided at the end of each demonstration.
The courseware provides a systematic method for designing a data warehouse. We
develop the data warehouse on a case study solely for the purpose of education. The case study
uses the student enrollment data from California State University, Sacramento. In the courseware,
we demonstrate steps to build a data warehouse for the enrollment data. This tool not only
illustrates the data warehousing design process but also reveals some of the incorrect practices
throughout the process. We identify ways to circumvent these incorrect practices effectually.
In our case study, we build an enterprise data warehouse for the student enrollment data
of the College of Engineering and Computer Science in California State University, Sacramento.
The data sources for this project are the student enrollment data from the California State
University at Sacramento and the enrollment-related social and economic data of the California
State.
The main intention of designing a data warehouse is to prepare input data for an existing
data mining system. The data stored in a data warehouse is the preprocessed data that forms an
input for the data mining tools [3, 2].
In our case study, we build the enrollment data warehouse that contains the preprocessed
enrollment data. The summary reports retrieve the preprocessed data from the data warehouse.
The data reporting tools generate such user-defined summary reports. The reported data can be
3
the input to the data mining tools. These tools perform data mining on the input data and provide
the desired results like student enrollment predictions [1].
Moreover, the data warehouse is capable of storing the data mining results and can
generate summary reports for these results. In our case study, we design the enrollment data
warehouse capable of generating summary reports on the student enrollment predictions. The
summary reports provide statistics essential for decision making on college budget analysis, new
faculty hiring, course demands, facility provisions, etc. The summary reports identify the data
patterns and predict potential data values. This technique of data warehousing can be valuable to
any enterprise for accurate estimation, forecasting, resource allocation, budget analysis, better
management planning, decision-making and improvement in business performance measures like
productivity, ROI (Return On Investment), profit, etc [5, 2].
Figure 1 shows a snapshot of the courseware tool’s introduction page. You can visit the
courseware at the following URL: http://gaia.ecs.csus.edu/~enroll/enrollDW/Intro.php.
The courseware divides the topics into four demonstrations. The first demonstration,
Demo A, explains how to identify the purpose and the user requirements of a data warehouse. It
demonstrates the design for a simple data mart. The second demonstration, Demo B, helps
recognize the purpose of refining a data mart. This section demonstrates the refining process of
the data mart while in compliance with the preceding design. The third demonstration, Demo C,
shows the method of building an enterprise data warehouse escalating the data mart design from
the former section. The fourth demonstration, Demo D, gives the idea of aggregation technique in
amplifying the performance of the data warehouse. In addition, this section shows the comparison
on the performance of data warehouse with and without aggregation. Furthermore, the topic
emphasizes on generation of summary reports. Each demonstration provides interactive user
sessions to generate summary reports as per the user specifications. The user sessions input the
4
user requirements and generate user-desired reports. These demonstrations also explain query
development and query execution in data reporting.
Figure 1 A Courseware for Data Warehousing
As a part of this project, we carry out a study on the effectiveness of the courseware tool.
We integrate the courseware with an data warehousing and data mining class in Spring 2010. This
class of 20 students evaluated the first version of the courseware. The integration of the
courseware to a data warehousing class and the subsequent courseware evaluations substantiates
the success of this tool.
In this chapter, we presented an overview of the project on the courseware for data
warehousing. We introduced the case-based approach of building the data warehouse on
enrollment data. In the next chapter, we explain the contextual part of the courseware. In the
chapters 3 through 6, we describe the design of the courseware website and explain the four
5
demonstrations of the courseware. In chapter 7, we summarize the results and feedback on the
courseware tool. In chapter 8, we conclude the project report and include the imminent
possibilities of the courseware.
6
Chapter 2
BACKGROUND
In the first chapter, we introduce the data-warehousing concept and the significance of
the data warehouse to a business process. In this chapter, we provide comprehensive description
for the enrollment case study and the enrollment data sources used in our courseware.
The idea of the case study originated from a thesis on “Enrollment projection through
data mining” by Svetlana S. Aksenova [1]. In her project report, the author presents a remarkable
use of the data mining tools to build the enrollment projection models. We noticed that this
process utilizes the historical enrollment data in form of the flat files for the data mining tools.
This process also included preprocessing of a large amount of data. The preprocessing of the
large amount of data from the flat files is time consuming and needs a lot of labor. Hence, we
consider developing a data warehouse on the enrollment data. By doing so the data mining tools
can directly consume the data from the data warehouse without recurrent preprocessing activities.
In addition, we also take into consideration the data changing according to the dynamic user
needs. The data warehouse overcomes the disadvantages of continually processing and repeatedly
inputting data from flat files. Figure 2 shows the difference of inputting data to the data mining
tools from a data warehouse versus flat files.
7
Figure 2 Data Warehousing Vs. Flat Files
Before designing any data warehouse, designers define the purpose of the data
warehouse. The purpose of the data warehouse identifies the management questions, user
requirements and enterprise measurements. In our case study, the management of the University
might need information on the factors that affect the enrollment data or the effect of
unemployment rate on the enrollment value. Many questions might arise like what is the
enrollment headcount for the last year. These questions relate either to the overall business
process or to an individual transaction [4]. A large number of query transactions executed on a
data warehouse retrieve this information. There is also a possibility that the nature of
management questions change with time. To meet these dynamic and continuing
management/user requirements, there is a need to store a large amount of historical data in an
easy to retrieve and efficient manner like a data warehouse.
The user-requirements can help determine the historical data needed to be stored in the
data warehouse. The interviewing process [4, 2] identifies these requirements. In our case study,
there are two goals of building the enrollment data warehouse:
8
(1) Enrollment reporting: User should be able to generate summary reports. These reports
display the relationship and interdependency among various attributes of the historical data
sources. The reports help to answer various management questions related to enrollment data.
They retrieve selective data on basis of the user conditions in a user query.
(2) Enrollment prediction: The data-mining project inputs the reports or the preprocessed data
from the data warehouse and performs data analysis. The purpose is to predict values for the
student enrollment count using data mining and analysis for the forthcoming years. Analysts
identify the data mining algorithms [3] that produce a negligibly small error in prediction
values. The difference between the real values and predicted value gives the error value. If
this error value is acceptably small, the predictions are as good as real values for the
forecasted student enrollment values. The management needs to exploit this forecasted data
for decision-making process. The decision-making includes budget planning, curriculum
planning, faculty hiring, resource allocation, income evaluation from tuition, etc [1, 6].
Historical Data: The historical data is stored into a data warehouse as a preprocessed data. In our
case study, we use two sources of historical data required for the enrollment data:
1. Enrollment data and other enrollment related data from the University [1, 6]
2. Socio-economic data that influences enrollment from the State of California [1]
The data collected from the College of Engineering and Computer Science for the last 30
years include enrollment values per semester for graduate and undergraduate students. The data
collected from the California State is also for last 30 years and include the socioeconomic figures
such as the employment rate, population, income, etc. The enrollment data from the Computer
Science department and the socio-economic data from the State are the only real time data [1, 6].
Other department enrollment values are generated using excel spreadsheets using RANDOM ()
and RANDBETWEEN () functions [7] for courseware purpose only. The real data is mostly
9
numeric data available in form of flat files such as excel, spreadsheets, etc. and other online
operational systems.
We classify this data into spatial and chronological dimensions to preprocess and prepare
data for the data loading process [5, 2]. The spatial attributes include department, college,
location and the temporal attributes consists of term and year. There are several ways of data
loading to a data warehouse. In this project, we do the following steps for data loading process:
(1) Convert all flat files into one format of Comma Separated versions .csv files.
(2) Execute the below MySQL query on the data warehouse [8]:
// (input name of the flat file)
LOAD DATA LOCAL INFILE ‘enrolldata.csv’
// (input name of table)
INTO TABLE Enroll_Fact
// (table columns separated by comma)
FIELDS TERMINATED BY ','
// (input name of the table columns)
(new_students, transferred_students, continuing_students, returning_students);
From the historical data, the university data provides enrollment report generation. Both
the university and state data together provide input for the data mining tools. Hence, the data
warehouse provides an efficient way of preprocessing, reporting and analyzing the historical data.
One might say that databases organize the data much more efficiently than flat files, then
why data warehousing. Table 1 gives a general idea of the differences between the data
warehouse and database [4].
10
Differences
Database
Process Type
Transactions
Data Warehouse
Analytical queries and
report generations
Read and Write
Read Only
(Insert, Update, Delete)
(Select)
Current data
Historical and current data
Execution of business
Measurement of business
process
process
Query type
Data
Purpose/Application
Table 1 Data Warehouse and Database
In this chapter, we obtain a detail understanding of the objective to build a data
warehouse for the enrollment case study. In the next chapter, we provide the structure of the
courseware. We explain the 3-tier architecture and components of the courseware website.
11
Chapter 3
COURSEWARE DESIGN
In this chapter, we describe the courseware architecture in detail. The courseware, based
on the principles of n-tier web applications [9], is a 3-tier web application that is conveniently
accessible to the data warehouse learners all round the world. The 3-tiers employed in this project
mainly consist of the web interface, the logic tier and the data tier [9].
Figure 3 Framework of Courseware
Presentation Tier: The web interface written in PHP, HTML and JavaScript offers
structure to this tool. The structure organizes the subject matter into introduction, demonstrations,
quizzes and references. It exhibits a series of steps for building a successful data warehouse. The
user-interactive interface empowers report generation, knowledge assessment, tool evaluation,
and user-interactive illustrations. The web interface displays the illustrative examples and visual
diagrams that support the topics.
12
Logic Tier: This tier administers the execution behind the web interface. It controls the
flow of data, from the data warehouse to the web display. This tier is responsible for business
logic. The business logic comprises of the database services like query structure, procedures, and
the user services. It takes care of the server-side code executions such as input validation, content
display, database security etc. This tier is also responsible for data access. It performs
computation and valuation, and devises decisions on the historical enrollment data and enrollment
prediction values in our case study.
Data Tier: This tier includes the data warehouse parameters, the data sources to be stored
in the data warehouse and the other data related functions. In our case study, the primary data is
the student enrollment data from California State University at Sacramento, and the secondary
data is the enrollment related socioeconomic facts obtained from the California state agencies.
This tier stores these primary and secondary data sources. This tier also stores the data analysis
and the data mining results executed against the data warehouse. It integrates existing data
sources, new data generated and data operations for the data warehouse relevant to that business
process [9, 8].
The courseware tool has the advantages of 3-tier architecture like integration of data and
services, high performance due to client server technology and improved security. Consequently,
we get a more robust application.
In the presentation part of courseware website, it presents how to design the enrollment
data warehouse through a set of four demonstrations. The demonstrations cover the following
topics: (1) fundamentals of data warehouse, (2) data warehouse design principle, (3) building an
enterprise data warehouse using an incremental approach, and (4) aggregation. Each
demonstration presents detailed description on building the data warehouse via set of steps. Every
step has text, diagram, and ready-to-go query runs. Furthermore, the courseware outlines the
13
theory that behinds each subject and provides a set of quiz problems for self-evaluation. In the
upcoming chapters, we discuss these demonstrations in detail.
14
Chapter 4
ENROLLMENT DATA WAREHOUSE DESIGN
In this chapter, we elucidate the first two demonstrations in the courseware tool. In demo
A, we show the design process of the initial data mart for the enrollment data warehouse. This
demonstration explains how to define the objective for building a data warehouse using
interviewing process. Demo B shows how to refine the initial data mart designed in the previous
demonstration. Both the demonstrations have user interactive facility to generate summary reports
against the enrollment data mart. Figure 4 shows the design steps included in Demo A and Demo
B from the courseware website.
Figure 4 Courseware Demonstrations: Demo A and Demo B
Through these demonstrations, we commence the design of the data warehouse using an
enrollment case study. Using the case study approach, we describe the principles and techniques
crucial for the data warehouse design.
15
Interviewing
Before any designing process, we should be acquainted with the purpose for building a
data warehouse. We design the data warehouse for a business process. We identify the business
process and the parameters of this process during the process of interviewing [10, 2].
Interviewing is the technique of talking to people who know the process well.
Generally, the management or the end users to the data warehouse are the suitable
interviewees in the interview process and the interviewer is the designer or the group of designers
of the data warehouse. The interviewers form a list of questions that would assist purposeidentification. This process of interviewing takes place throughout the design process. The
designer, according to his/her design needs, decides on the number of interviewing phases. The
first phase of interview takes place before outlining the initial design. The results of this interview
are useful for providing a skeleton for a data warehouse. The second phase mostly occurs before
proceeding to the physical design. The third phase can occur at the refinement process of the data
warehouse design. Many interviewing phases can occur depending on how often the design needs
to be refined. Interview phases also occur during the evaluation stage of the data warehouse. If
the users are completely content by the results of the data warehouse design, possibly there is no
necessity to carry out interviews any further.
The courseware integrated with the data-warehousing course aims at designing an
enrollment data warehouse. The end user to the enrollment data warehouse is the end user of the
courseware. Hence, we start the interview process for the initial data mart design with the
instructor of the course, CSc 177 Data warehousing and data mining. A few question-answer
sessions held for the first phase interview helps initiate the design of the enrollment data
warehouse. These interview sessions generated answers to enrollment data selection, queries’
16
executions, format of summary reports, identification of time and space dimensions for data
classification, formation of consensus between memory and performance for the data warehouse,
etc. Some examples of the interview questions are:
1. What enrollment data do the end users desire?
2. Into what categories the enrollment data classifies or in which format do the end users
desire the summary reports?
3. What attributes related to enrollment should the query result display?
The data warehouse gets the capability of answering the user and management questions
and it is during the interview processes that we find out the relevant facts that interests the end
users and get the minute details of the business process.
In our case study, we use dimensional modeling principles to design a data warehouse. A
dimensional model consists of a group of fact tables and dimension tables. Interviewing process
helps identify the grain detail of the fact table and the attributes of the dimension tables. For
example, for generating reports from the data warehouse, interviewing determine whether the
reports should be on monthly, quarterly or yearly basis. Another interviewing exercise would be
the generation of refined dimensional model from draft dimensional model. The interview would
take place between the end users of draft dimensional model and the designers. The feedback on
the draft model would help designers to include the missing attributes and refine the model
effectively.
Purpose of Enrollment Data Warehouse
In the previous section, we determine how to collect the user requirements through the
interviewing process. The user requirements for the enrollment data warehouse demand the
17
preparation of pre-processed data as an input for data mining tools and the provision of user
facility to generate summary reports, categorizing them by term, year, degree and college. There
exist two types of summary reports: (1) enrollment reports for graduate and undergraduate
students for the last 30 years; and (2) demographic factors on each year’s enrollment data. This
combination of type (1) and (2) forms the input data for a data mining system to output
enrollment prediction reports for future 5 years. Hence, we start the design process of an
enrollment data warehouse in consideration to these requirements.
Enrollment Case Study Data mart Design
In this section, we start designing an initial data mart for the enrollment data warehouse.
The first phase of interview gives us a splendid idea on the user requirements. The basic user
requirement is to generate summary reports categorized by year and term, and by student
classification on degree. The user also needs the enrollment headcount of the students classified
per enrollment as new, continuing, returned or transferred. With this knowledge on data, we can
design the draft schema for the data mart. Figure 5 shows the draft dimensional model for the
enrollment data.
We design the data mart using the dimensional modeling principles [11, 12, 4]. The
dimension model classifies the data related to the process into facts and dimensions. These
principles facilitate efficient use of physical space.
18
Figure 5 Initial Data mart on Enrollment
On interviewing, we obtain that the user needs to generate report for enrollment count
(enrollment count is the measurement) in a particular year (year is an attribute). Analyzing the
query, we notice that the time parameter breaks the measurement into useful subsets (filter by
year). Hence, we identify the first dimension for the enrollment data mart as the time dimension.
The dimensions segregate the measurements into useful subsets. While designing the dimension
table, the attributes that qualify queries or break out measurements into useful subsets, hold
together into one dimension table [10].
According to dimensional modeling principles, the
dimension tables are short and wide, i.e. they can have a large number of columns. The dimension
table clusters the attributes of that dimension. Hence, each column of the dimension table
correlates to an attribute of the dimension [2].
Figure 6 Time Dimension Table
19
We design the time dimension as a table with the attributes year and term as the columns
to the table. Each table has a primary key [10] that makes each row unique for enrollment
classification. Figure 6 shows the Time Dimension table. Similarly, we can design a Student
Classification Dimension table as shown in Figure 7.
Every dimension table has a primary key. We create this key while loading the historical
data. In this demonstration, MySQL AUTO_INCREMENT generates unique keys in the MySQL
tables [8]. For a more informative reporting, the dimension tables should be rich with attributes.
The design of dimension table also determines the relation of dimensions to the facts and their
appearance in the reports. By a similar approach, we can identify other dimensions in the
dimensional model.
Figure 7 Student Classification Dimension Table
In the draft dimensional model, we declare the quantity (i.e. enrollment count) as facts.
The facts recorded are the enrollment counts for newly enrolled students, continuing students,
transferred students, returning students and the total number of students enrolled. According to
the dimensional modeling principles, the facts are the measurements that evaluate the process.
They are mostly numerical in nature. The fact table groups the measurements (referred to as facts)
and the attributes of the facts.
20
The fact table not only gives the required measurement but also the relationship among
the dimensions and measurement. Enrollment fact table has foreign keys referencing to the
dimensional tables via the Time Dimension ID and the Student Classification ID. The primary
key of the fact table is a concatenated key involving a subset of the foreign keys. The fact table is
the dependent table in the schema design. These tables are narrow and deep i.e. they can have a
large number of rows. Each row in the fact table gives the facts at same level of detail.
Figure 8 shows the columns of the enrollment fact table as the types of enrollments, the
primary key and the foreign keys that reference the dimension tables.
Figure 8 Enrollment Fact Table
In addition, the enrollment fact contains the attribute "eligible to continue" count related
to the “continuing students” attribute. We do this to avoid having a separate dimension table for
this attribute. If we design such a dimension table for “eligible to continue”, it would have the
same rows as the fact table and would cause data redundancy.
Figure 9 shows a sketch of an initial data mart design for the enrollment in form of a star
schema. The star schema displays the relationship among different entities. A star schema is a set
21
of tables in a relational database designed according to the principles of dimensional modeling
[10]. It is the simplest kind of data warehouse schema in which one or more fact tables reference
one or more dimension tables [2].
We design the enrollment star schema to optimize the queries that have large data access.
It consists of one fact table stating enrollment facts, and the dimension tables linked to the fact
table through the corresponding foreign keys. Queries against such a schema include a variety of
combinations among dimensions and facts. Hence, star schema not only facilitates RDBMS
capabilities but also add the ability to answer variety of management or end user questions [2].
Figure 9 Enrollment Star Schema
After designing the enrollment star schema, we load the historical data from the flat files
to the corresponding tables using the data loading process as described earlier. Thus, we get the
enrollment data mart for the data warehouse design. We can generate summary reports against
this enrollment data mart. We use MySQL queries to retrieve this data. This is the last step of the
Demo A in the courseware.
The Demo A gives the end users the facility to extract data from the enrollment data mart
according to their requirements. Let us suppose, the end user wants to generate report for
Computer Science graduate students for Fall 2000. The query to generate summary report for this
22
inquiry takes the conditional values ‘graduate’, ‘Fall’, and ‘2000’ as user input. The courseware
use MySQL queries to generate summary reports. The MySQL query formed for this inquiry has
the query structure similar to one below:
SELECT Time Dimension Table Year, Time Dimension Table Term, New Students, Transferred
Students, Returning Students, Students Eligible to Continue, Continuing Students
FROM Enrollment Fact Table
INNER JOIN Time Dimension Table
ON Enrollment Time ID = Time ID
INNER JOIN Student Classification Dimension Table
ON Enrollment Student Class ID = Student Class ID
WHERE Enrollment ID
IN (SELECT Enrollment ID FROM Enrollment Fact Table
WHERE Enrollment Time ID
IN (SELECT Time ID FROM Time Dimension Table
WHERE year = 2000 and term = ’Fall’)
AND Enrollment Student Class ID
IN (SELECT Student Class ID FROM Student Classification Dimension Table
WHERE degree = ‘graduate’))
Table 2 shows the summary report generated by the courseware for this inquiry.
Year
2000
Term
Fall
New
39
Transferred
114
Returning
117
Eligible to Continue
156
Table 2 Enrollment Summary Report
Continuing Students
31
23
Enrollment Case Study Data mart Refinement
In Demo B, the enrollment data mart model is incrementally refined by iterating the steps
of design process from Demo A. Refinement helps meet additional user requirements such as
omission of old data values or integration of new data sources. The main purpose of refining is to
get all the relevant data into the data mart in conformance to the initially designed model. The
purpose of refining the enrollment data mart is as follows [12, 5, 2]:
1. Increasing the capability to answer management questions over other departments
2. Including missing data such as tuition fees
3. Expanding the data model structure to get the effect of socioeconomic factors on the
enrollment values
We need to expand the enrollment data mart slowly over other departments in the
college. While refining the data mart, we design the data mart such that it is easily scalable over
other colleges under the California State universities. Hence, we require another dimension, the
Academic dimension, for the enrollment data mart. The refinement needs a second phase of
interviewing. We identify the attributes of academic dimension during this second phase.
This stage of refinement gives an opportunity to include new data that was missing in the
data mart previously. The steps of designing the initial data mart are critical because we iterate
these steps on the initial design to refine the model with more relevant subject areas. In the
refinement process, we iterate the following steps from Demo A:
1. Identify the relevant data related areas.
2. Determine attributes and relations between different areas by the process of interviewing.
3. Load the new data such that it conforms to the data model.
4. Iterate these steps until all the areas relevant to the data are covered [2].
24
We design the new dimension, Academic Dimension, and append it to the model such
that it conforms to the initial data mart design. The attributes of the dimension comprise
department, college and location. To establish the relationship between the academic unit
dimension and the measurement (enrollment data), we need a referential integrity key with the
academic unit dimension table. We add a new reference key for this dimension in the enrollment
fact table. The primary key of the enrollment fact table is the concatenated key of the reference
keys to the time dimension, student classification dimension and academic dimension tables. The
star schema includes the updated fact table with the new dimension table. Figure 10 shows the
refined dimensional model.
Figure 10 Refined Enrollment Data Mart
We use historical data to refine the data warehouse. We load the data from the
departments in the College of Engineering and Computer Science. The data is loaded in such a
way that it conforms to the refined data mart design. We load real data for Computer Science
department and generate data for all other departments in College of Engineering and Computer
25
Science. This data is randomly generated using data generation tools like Microsoft Excel 2007
RAND () and RANDBETWEEN () functions [7] for experimental purpose only.
Enrollment Data Reporting
The data mart is ready to respond to user queries to generate summary reports of the type
(1) as stated in section 4.2 of this chapter. Figure 11 shows approximate one such report in form
of a graph for the enrollment values. The graph shows the total number of undergraduate students
enrolled in Computer Science department for the past 30 years. Similarly, we can generate reports
in form of text input for the data mining system [3].
Figure 11 Enrollment Graph for Computer Science Department
The end users generate a variety of enrollment summary reports. Various queries execute
against the enrollment data mart and display these user-desired reports. The reports can display
data accurately by using INNER JOINS in query languages like MySQL/SQL [8]. The data
access time depends on the query structure and the database table hierarchy. The queries govern
the generation of summary reports. The designers optimize these queries to improve the speed of
data access and the performance of data warehouse [see chapter 6]. Query optimization offers
efficiency to the data warehouse so that the end users view the reports in a few milliseconds.
26
The Demo A reports and Demo B reports in the courseware give user the facility to input
values for generation of enrollment reports. Figure 12 shows one such snapshot of user input in
Demo B reports. In this query, the user wants to know the new student enrollment count for
California State University, Sacramento for Mechanical department in College of Engineering
and Computer Science for Spring 2004 semester.
Figure 12 Snapshot of User Input
The query executed against the refined data mart is as follows:
SELECT Academic Dimension Table Department, Time Dimension Table Year, Time
Dimension Table Term, New Students, Transferred Students
FROM Enrollment Fact Table
INNER JOIN Time Dimension Table ON Enrollment Time ID = Time ID
27
INNER JOIN Student Classification Dimension Table ON Enrollment Student Class ID =
Student Class ID
INNER JOIN Academic Dimension ON Enrollment Academic ID = Academic ID
WHERE Enrollment ID
IN (SELECT Enrollment ID
FROM Enrollment Fact Table
WHERE Enrollment Time ID
IN (SELECT Time ID
FROM Time Dimension
WHERE year = 2004 AND term = ‘Spring’)
AND Enrollment Student Class ID
IN (SELECT Student Class ID
FROM Student Classification Dimension
WHERE degree = ‘undergraduate’)
AND Enrollment Academic ID
IN (SELECT Academic ID
FROM Academic Dimension
WHERE university = ‘California State University, Sacramento’
AND college = ‘Engineering and Computer Science’
AND department = ‘Mechanical’))
Table 3 shows the resultant output retrieved by this query.
28
Department
Year
Term
New
Transferred
Mechanical
2004
Spring
47
87
Table 3 User-desired Query Report
In this chapter, we incorporate only the student enrollment data to generate the summary
reports. In the next chapter, we extend the dimensional modeling design to build an enterprise
data warehouse. In the enterprise data warehouse for enrollment data, we include the
socioeconomic data together with the enrollment data. The reports generated against the
enterprise data warehouse not only display the facts but also show users the effect of
socioeconomic factors on the student enrollment values. In addition, the next chapter elucidates
Demo C of the courseware.
.
29
Chapter 5
ENTERPRISE DATA WAREHOUSE
In this chapter, we visit the third demonstration, Demo C, of the courseware tool. Demo
C illustrates the design process of the enterprise data warehouse for the enrollment case study in a
systematic way. The design process clarifies how to expand the dimensional modeling design
over an enterprise and conform to the design of enrollment data warehouse devised so far. Demo
C provides a user interactive facility to generate enrollment summary reports against the
enterprise data warehouse. Figure 13 shows the design steps of Demo C from the courseware
website.
Figure 13 Courseware Demonstration: Demo C
The first section gives the idea of the enterprise data warehouse for the enrollment case
study. It identifies the data sources valuable for the enrollment enterprise. The subsequent section
describes the methodology of designing the enterprise data warehouse for the enrollment case
study. The concluding section shows the increased capability of the enterprise data warehouse for
30
data reporting and data analysis over a wide range of data sources such as enrollment data and
socioeconomic data.
Enterprise Data Warehouse for Enrollment Case Study
An enterprise data warehouse (EDW) is a warehouse for the enterprise data and other
relevant data. The EDW optimizes data for analyzing, querying, and reporting purposes [10, 12,
2].
The EDW (enterprise data warehouse) mainly integrates data from various systems. This
data in combination is more valuable and can satisfy user queries that are unanswerable by any
other operational system. The EDW updates the data periodically. Consequently, the underlying
architecture of the EDW develops a query processing support offering efficiency and
performance to the data warehouse [10, 2].
The best designs of an EDW consist of schema designs. The schemas are an integrated
series of conformed dimension tables and transaction-grained fact tables. They develop a business
into a complete analytical warehouse [12, 5, 2].
The goal of the EDW (enterprise data warehouse) for the enrollment case study is to
provide consistent and accurate enrollment related information in an organized and secured
manner. In our case study, the enterprise is the university. The researchers, executive level
managers, administrators and enterprise owners are some of the end users to the EDW. The
enrollment data becomes easily and speedily accessible to the end users via the enrollment EDW.
Query processing and analysis against the enrollment EDW present the impact of social and
economic factors of California State on the statistics of student enrollment of the university.
31
The courseware tool uses the incremental approach described in the next section to
design the enrollment EDW.
Incremental approach on Enrollment EDW
There are two approaches in designing an enterprise data warehouse. The first approach
is the traditional approach in which the design is ready before loading any data in the data
warehouse. Explicitly, the data is loaded in the data warehouse in the final stages. The second
approach in the incremental approach in which the EDW is build a subject area at a time. Unlike
the traditional approach, the data is loaded for each subject area design individually. The design
continues iterating itself through aggressive feedback rotations with the users [10, 12, 2].
In our case study on enrollment analysis, we design the EDW using an incremental
approach. The former demonstrations comprise the subject area of enrollment analysis. Demo C
increments the design by including a new subject area, enrollment analysis using socioeconomic
data, to our data warehouse. We use the dimensional modeling principles to increment the design
for the enrollment EDW (enterprise data warehouse).
Enrollment Case Study EDW Design
We begin with the process of interviewing [10, 2] to identify the socio-economic factors,
which influence the enrollment statistics of the universities in California. The data collected
consists of attributes like population, employment rate, graduation rate and tuition fees. These
attributes, categorized by year, form the new dimension for socio-economic data. Figure 14
shows the Socio-economic dimension table designed.
32
Figure 14 Socioeconomic Dimension Table
The data mining process using the data mining tools and techniques [3] carried on the
historical data, the enrollment data and socioeconomic data combined, can aid predict student
enrollment values for coming years. [1] These predictions need to be stored in the data
warehouse. We create a new fact table, Prediction fact table, to store the forecasted results of data
mining from [1]. According to (Svetlana Aksenova, 2007), the data mining result include the
predicted values and the residual values for new students, transferred students, returning students
and continuing students [1]. We realize that these values form the grains of the new fact table.
The fact table requires establishing relation with relevant data. Hence, the fact table needs to
reference the dimension tables using foreign keys. The primary key on the fact table indexes each
data row distinctively. Figure 15 shows the prediction fact table.
33
Figure 15 Prediction Fact Table
The star schema for the enrollment EDW consists of two fact tables along with their
respective dimension tables. The dimensions for prediction fact table are time dimension,
academic unit dimension, student classification dimension and socioeconomic dimension. Some
of the dimensions in the prediction fact table are common with the enrollment fact table. Hence,
both the fact tables use these dimension tables mutually. Figure 16 shows the star schema for
enrollment EDW (enterprise data warehouse).
We load the socioeconomic data and prediction data [1] from the historical data sources
into the socioeconomic table and the prediction fact table respectively. Correspondingly, we load
the reference keys to the dimension tables into the prediction fact table.
34
Figure 16 Enrollment EDW
Demo C shows how to build a series of interlocking star schema [4] where each star
schema corresponds to one subject area. The design of enrollment EDW (enterprise data
warehouse) using an incremental approach is complete.
The next section exhibits the importance of building the enrollment EDW. It explains
how the EDW provides value to the organization. The data reporting and data analysis performed
against the EDW verifies that the enrollment EDW provides a consistent and pertinent view of
enterprise data [2].
35
Enrollment EDW Data Reporting
The enrollment data warehouse is ready for testing and deployment. Testing evaluates
data reporting and ETL processing on the enrollment and prediction data. It makes the enrollment
EDW ready to respond to user queries and generate summary reports not only of type (1) but also
of type (2) as per stated in section 4.2 of chapter 4. It ensures quality, consistency and correctness
in the user-desired data reports generated by user queries [5].
In the case study for enrollment, we write queries in MySQL query language and then
test queries for data reporting purposes. The following example gives an idea on query logic to
retrieve data as required by the end users. Let us say, the user needs to compare the actual
enrollment value and the predicted enrollment value for newly enrolled graduate students in fall
2000 for the College of Engineering and Computer Science. One of the ways to form a query is as
follows:
SELECT Student Classification Dimension Degree, Academic Dimension Department, Time
Dimension Year, Term, New Enrollment Count, New Predicted Value
FROM Prediction Fact
INNER JOIN Socioeconomic Dimension
ON (Prediction Socioeconomic ID = Socioeconomic ID)
, Enrollment Fact
INNER JOIN Student Classification Dimension
ON (Enrollment Student Classification ID = Student Classification ID)
INNER JOIN Academic Dimension
ON (Enrollment Academic ID = Academic ID)
INNER JOIN Time Dimension
36
ON (Enrollment Time ID = Time ID)
WHERE Enrollment ID
IN (SELECT Enrollment ID FROM Enrollment Fact
WHERE Enrollment Time ID
IN (SELECT Time ID FROM Time Dimension WHERE year = 2000 AND term = 'fall')
AND Enrollment Academic ID
IN (SELECT Academic ID FROM Academic Dimension
WHERE college ='Engineering and Computer Science'
AND university ='California State University, Sacramento')
AND Enrollment Student Classification ID
IN (SELECT Student Classification IDFROM Student Classification Dimension
WHERE degree = 'Graduate')
)
AND Prediction Fact ID
IN (SELECT Prediction Fact ID FROM Prediction Fact
WHERE Prediction Time ID
IN (SELECT Time ID FROM Time Dimension WHERE year = 2000 AND term = 'fall')
AND Prediction Socioeconomic ID
IN (SELECT Socioeconomic ID FROM Socioeconomic Dimension WHERE year = 2000)
AND Prediction Academic ID
IN (SELECT Academic ID FROM Academic Dimension
WHERE college ='Engineering and Computer Science'
AND university ='California State University, Sacramento')
AND Prediction Student Classification ID
37
IN (SELECT Student Classification ID FROM Student Classification Dimension
WHERE degree = 'Graduate')
)
AND Enrollment Academic ID = Prediction Academic ID
AND Enrollment Time ID = Prediction Time ID
AND Enrollment Student Classification ID = Prediction Student Classification ID;
Table 4 shows the report on the actual enrollment values and the predicted enrollment
values obtained from this query.
Department
Actual New Students Enrolled Predicted number of new students
Computer Science
39
38
Civil
32
187
Mechanical
73
29
Electrical
70
158
Computer Engineering
130
85
Table 4 Enrollment Prediction Report
Such prediction reports can give the predicted values and the actual values for the past
years. These reports can form input for data mining tools to predict the enrollment values for
future years.
This chapter concludes the design of enterprise data warehouse for enrollment case study.
To summarize, the courseware provided steps to build an enterprise data warehouse for the
enrollment analysis case study. In the next chapter, we discuss the performance of the enterprise
data warehouse and describe the performance improving technique called aggregation.
38
Chapter 6
AGGREGATION ON ENROLLMENT DATA WAREHOUSE
In the earlier chapters, we presented the three demonstrations of the courseware tool.
These demonstrations illustrated how to build an enterprise data warehouse (EDW) using a case
study. In this chapter, we explain the final demonstration, Demo D, of the courseware tool. This
demonstration provides an example of improving the data warehouse performance and the
method to implement it. We accomplish this by using aggregation on the data warehouse.
Aggregation
An aggregate value is the result of an aggregate function. The aggregate functions are the
mathematical functions such as sum, average, maximum, minimum or any user defined function
[13]. Figure 17 shows the aggregate functions.
Figure 17 Aggregate Functions
The process of designing the schema for aggregate values in data warehousing is
aggregation [14]. Figure 18 summarizes this process. To implement aggregation on the data
warehouse, we start with identifying the aggregates vital for the enterprise. The aggregate
valuable for the enrollment data warehouse is the total headcount of enrollment. The second step
39
is to design the enrollment aggregate schema for the aggregate values. Next, we calculate the sum
total values using aggregate functions. Finally, we load the calculated enrollment values into the
aggregate fact table.
Figure 18 Aggregate Design Methodology
The idea of aggregation is to improve the performance of the data warehouse. For this
purpose, we identify the parameters that help improve data warehouse performance in the next
section. We discuss how the aggregation influences the performance parameters of a data
warehouse.
Performance Parameter
Data reporting demands that the end users receive quick and accurate results. The data
reports should reach at the user end within a few milliseconds. This is one of the key aspects to
improve the data warehouse performance. Hence, the performance parameter of the data
warehouse is the query execution time. To improve the performance of the data warehouse, we
need to decrease the query execution time. We can accomplish this by designing aggregate tables.
In our case study, the enrollment fact table contains the enrollment values, which are
atomic in nature. The management or the end users perform a number of data reporting operations
on the data warehouse. Let us assume that the majority of data reports generated consists of the
40
aggregate values. For example, the users query to retrieve data on the total enrollment for the
year 2000.
Each time the user needs this data, the data warehouse needs to execute the following
query [8]:
SELECT Time Dimension Table Year, SUM (New Students + Transferred Students + Returning
Students + Continuing Students) AS total enrollment
FROM Enrollment Fact Table
INNER JOIN Time Dimension Table
ON Enrollment Time ID = Time ID
INNER JOIN Student Classification Dimension Table
ON Enrollment Student Class ID = Student Class ID
INNER JOIN Academic Dimension
ON (Enrollment Academic ID = Academic ID)
WHERE Enrollment ID
IN (SELECT Enrollment ID FROM Enrollment Fact Table
WHERE Enrollment Time ID
IN (SELECT Time ID FROM Time Dimension Table
WHERE year = 2000)
AND Enrollment Academic ID
IN (SELECT Academic ID FROM Academic Dimension
WHERE college ='Engineering and Computer Science'
AND university ='California State University, Sacramento'))
GROUP BY Time Dimension year
41
Each time we execute this query, the query accesses many data rows even if the values in
these rows are not required by the end users in the query result. The query reads all the
enrollment values from each row and hence, takes a lot of time to execute.
The sum function over a particular year gives us the total enrollment count for that
particular year. The sum is the addition of enrollment values for both the terms (fall, spring) of a
year plus the enrollment values for graduate and undergraduate students. It also has to sum up the
count of new, transferred, returning and continuing students for that year. The query needs time
to perform the calculation for summing up all the enrollment values.
Hence, the execution time for this query is addition of the time (in milliseconds) to
connect to the server where the data warehouse is located plus the time (in milliseconds) to
retrieve all the required cells and the time (in milliseconds) to calculate the sum function.
To improve the efficiency of the data warehouse, we need to make these queries run
faster. We can reduce the execution time in the following two ways:
1. By eliminating the time to calculate the sum function
2. By reducing the time required to retrieve the number of row
Aggregate tables implement these two ways to reduce the query execution time. The
queries run faster if they read only the pre-calculated aggregate values instead of performing
calculation on a number of rows. We calculate the aggregate of the commonly requested values
and store them in some table. By doing this, the query has to access only a few rows in aggregate
tables instead of accessing all the data rows containing enrollment values.
In the next section, we design the aggregate tables and the aggregate schema for the
enrollment case study.
42
Aggregate Schema Design on Enrollment Case Study
The aggregate schema is the star schema of aggregate tables. Each aggregate table is a
fact table. It has the aggregate values for one aggregate function [13]. The process starts by
identifying the frequently queried aggregate. Interviewing process identifies the aggregate
functions that the users queries frequently. Let us assume that the most required aggregate for our
enrollment data warehouse is the enrollment headcount grouped by year. We design an aggregate
table for the sum function on enrollment values.
Figure 19 Enrollment Aggregate Table
Figure 19 shows the sum aggregate fact table. An aggregate fact table is similar to a base
fact table except that the facts are the aggregate values. The aggregate fact table consists of the
foreign keys that reference the dimension tables. In other words, the aggregate values are stored
in the fact tables categorized by the dimensions.
The interview process helps us identify the dimensions for the aggregate table. We derive
these dimensions using the base dimensions: socioeconomic dimension, student classification
dimension, time dimension and academic unit dimension. We identify that the user require
aggregates calculated on a yearly basis. Hence, the time dimension for this schema should have a
primary key for each year. In this schema, we do not need the student classification dimension
43
because the aggregate value consists of both the undergraduate and graduate student count. We
can modify the academic dimension as per user needs. Here, we do not modify the academic
dimension and the socioeconomic dimension. We design the aggregate schema using these
dimension tables connecting with the aggregate fact table.
An aggregate star schema [14] is similar to base schema with the aggregate table as the
fact table. The dimension tables conform to the base schema design. The base dimension tables
give an idea on the aggregate dimension tables. Hence, we can reuse the same dimension tables or
redesign the dimension tables with modification if required. The primary key of the aggregate
fact table is the combination of reference keys that reference the dimension tables [14].
Figure 20 Enrollment Aggregate Schema
Figure 20 shows the aggregate schema for the sum function. In this manner, we can
design the aggregate schema for the other desired aggregate functions. After designing the
aggregate schema, we calculate the sum aggregate for each year and load these values in the fact
44
table. The query execution on the aggregate table should display the same output, as the base
schema would do. The output should be accurate and consistent.
When we implement aggregation, the most important task is the maintenance of the
aggregate tables [14]. The addition of new values or deletion of old values change the aggregate
value of the attribute.
In our case study, we would add new enrollment values for the upcoming years. The
aggregate table should reflect these additions of enrollment values. The new aggregate values for
the upcoming years need to be calculated and loaded into the aggregate table. This task of
maintaining and keeping the aggregate table current is crucial in aggregation. The short programs
like stored procedures, triggers, or code help maintain the aggregate table in the data warehouse.
These programs execute every time there are changes in the values in the base schema that would
affect the aggregate schema. The maintenance is an ongoing process. Aggregate tables are
refreshed either on addition of new row or updating an existing one.
Performance Analysis
The aggregate tables store the pre-aggregated values which otherwise are aggregated
during query executions. Aggregation reduces the necessity of inner joins and group by clauses in
queries. The non-aggregation query has to scan numerous rows and join the related values to
display the result. On the other hand, the query against aggregate tables read only a small number
of rows from the aggregate tables. We design the aggregate schema allowing for each row of the
aggregate table to summarize average of 20 rows of base table [2, 14].
Now, let us compare the performance of the query execution on the data warehouse with
and without aggregation. Considering the previous example, the user needs to know the total
45
number of students enrolled in the year 2000 in the College of Engineering and Computer Science
at CSUS.
We discussed the query formed against the base schema before. This time we add one
more column for the function COUNT (*) as scanned rows. The query is as follows [8]:
SELECT Time Dimension Table Year, SUM (New Students + Transferred Students + Returning
Students + Continuing Students) AS Total enrollment, COUNT (*) AS Scanned rows
FROM Enrollment Fact Table
INNER JOIN Time Dimension Table
ON Enrollment Time ID = Time ID
INNER JOIN Student Classification Dimension Table
ON Enrollment Student Class ID = Student Class ID
INNER JOIN Academic Dimension
ON (Enrollment Academic ID = Academic ID)
WHERE Enrollment ID
IN (SELECT Enrollment ID FROM Enrollment Fact Table
WHERE Enrollment Time ID
IN (SELECT Time ID FROM Time Dimension Table
WHERE year = 2000)
AND Enrollment Academic ID
IN (SELECT Academic ID FROM Academic Dimension
WHERE college ='Engineering and Computer Science'
AND university ='California State University, Sacramento'))
GROUP BY Time Dimension year
46
This query executes on the enrollment star schema (the base star schema), which does not
implement aggregation. Table 5 shows the output of this query.
Year
Total enrollment Scanned rows
2000
7292
20
Table 5 Query Output without Aggregation
This COUNT (*) function gives the number of rows in the table. The count shows the
number of rows accessed for each resultant row (i.e. for each year). The total enrollment count for
a particular year is the sum of enrollment values for two types of degree students (graduate and
undergraduate), for two terms (fall and spring) and for five programs of the college (Civil,
Mechanical, Computer Science, Computer Engineering, and Electrical Engineering). Thus, the
total number of rows accessed for a single resultant row (for a particular year) is 2 * 2 * 5 = 20.
Thus, the total number of rows accessed to obtain the total enrollment count for year 2000 is 20.
Now, let us form the query against the aggregate schema that would output the same result on
enrollment.
SELECT year, total enrollment, COUNT (*) AS scanned rows
FROM Enrollment Aggregate Fact
WHERE year = ‘2000’ GROUP BY year;
Year
Total enrollment Scanned rows
2000 7292
1
Table 6 Query Output with Aggregation
47
We execute both the queries several times and note down the execution time each time.
We calculate the mean values of these observations. Approximately, the time required to execute
the first query is about 0.050 milliseconds. The second query that includes aggregation needs
about 0.030 milliseconds to execute on the enrollment data warehouse. After performing these
query executions, we notice that the time required to execute the first query is much more than
the time required executing the second one. We carried out such executions against the
enrollment data warehouse for variety of other queries. These experiments and observations
verified that aggregation reduces the query execution time and improves the performance of
enrollment data warehouse.
This chapter concludes the discussion on courseware demonstrations. In the next chapter,
we discuss the evaluation of the courseware and provide a prospective to this project.
48
Chapter 7
COURSEWARE EVALUATION
In the earlier chapters, we completed the discussion on the contents of the courseware. In
this chapter, we validate the assessment on the courseware. This substantiates the operational
success of the courseware tool. The success depends on how effective the end users (learners or
students) find the courseware tool in understanding the data-warehousing topic. As a part of this
project, we carried out a study on testing the effectiveness of the courseware tool.
We integrated the courseware with an introductory data warehousing and data mining
class in Spring 2010. We introduced courseware as an eLearning tool to the students of Computer
Science course, CSc 177: Data Warehousing and Data Mining, in California State University,
Sacramento. This class of 20 students evaluated the first version of the courseware. The students
were the upper division undergraduate and graduate students of the Computer Science
Department We conducted a survey on courseware in this class. The students stayed personally
engaged in using courseware to understand the fundamentals on data warehousing. The overall
assessment from this student group on this courseware was extremely encouraging.
We achieved positive feedback from the survey takers. The survey takers found that the
courseware is very accessible and helpful to understand the fundamentals of data warehousing.
They also found that the figures and examples are supportive. According to them, the courseware
complemented the course lectures very well. In addition, the students were able to follow the
steps and illustrations in the courseware very easily. They found the simplicity and natural
progression of the courseware website useful for learning. The quizzes in the courseware became
handy for them to review for tests.
49
We also obtained constructive feedback from the students on the courseware. The
feedback suggests that the results generated from the demo required further verification. It would
be beneficial to integrate a data-preprocessing component and a data-mining component to the
courseware. Improvement in enrollment prediction system, data mining system and application
for the enterprise data warehouse would be advantageous.
Based on the input from this student group, we added an on-line feedback component for
the tool users. The Figure 21 shows the snapshot of this component. This component collects tool
evaluation data from the users providing us a quantitative measurement on degree of user
satisfaction. It also allows the user to offer constructive suggestions to us in an on-going basis.
We believe that this component is necessary for the success of a developing courseware. It makes
the courseware more efficient and durable while offering it the scope for improvement.
Figure 21 Feedback Component
50
Chapter 8
CONCLUSION
Although there are other online courseware tools such as (Kevin Woods, 2007) [9] for
various learning topics, we have not found an on-line courseware exclusively devoted for data
warehousing such as this courseware. This tool provides a whole development life cycle of a data
warehouse using a case study with a set of supplementary examples. The main advantages of
courseware are the usefulness, the scope, and the accessibility of this tool to the beginning datawarehouse designers and developers.
Through this courseware, we presented a comprehensive design and functionalities of a
web based tool for learning fundamental concepts of data warehousing. The courseware
demonstrates the importance of data warehousing in an enterprise. It offers a systematic method
to design a data warehouse using a case based approach. In this case study, we develop the data
warehouse for the university using a bottom up approach [2, 10, 12]. The data sources include:
(1) the enrollment data from California State University at Sacramento, and (2) the related public
data of California [1]. The courseware not only provides the enrollment data-warehouse design
for the university but also demonstrates the capability of data warehouse for data reporting, data
mining and data analysis on these data sources.
The courseware further illuminates the performance parameters of a data warehouse. It
validates improvement in the data warehouse performance by comparing the performance
parameters (query execution time) on the data warehouse with and without implementing
aggregation.
51
Finally, we substantiate the success of the courseware by integrating the courseware in
the data warehousing class and obtaining continuous feedback from the students. A feedback link
in the website contributes to the ongoing evaluation of courseware from the online users.
The courseware provides enormous opportunities for development. There are many areas
for future research work extending this project, which include strengthening of the case study
structure, refinement of concept description and web presentation, and addition of new
components on other related topics. The list of to-be-added case study topics include: ETL, data
mining and data preprocessing [3].
This project allows me to combine theory and implementation of data warehousing
principles into a great learning experience. It offered a practice of data generation, design, real
time data collection, data loading, data extraction and data analysis. It also provided an
opportunity to develop a 3-tier application using PHP, HTML, JavaScript and MySQL from
scratch. In addition, it provided a foundation for imminent professional progress on technical
areas such as data warehousing and web development. Future work for this project can include
new topics into the courseware such as ETL.
52
APPENDICES
53
APPENDIX A
Enrollment Report
Enrollment Report generated by Courseware on the data warehouse for last 5 years for
undergraduate students enrolled in Engineering College
Department
Electrical
Civil
Mechanical
Electrical
Computer Engineering
Computer Science
Mechanical
Civil
Computer Engineering
Computer Science
Electrical
Computer Engineering
Computer Science
Mechanical
Civil
Computer Engineering
Computer Science
Electrical
Civil
Mechanical
Civil
Computer Engineering
Computer Science
Electrical
Civil
Mechanical
Electrical
Computer Engineering
Computer Science
Year
2006
2006
2006
2006
2006
2006
2006
2006
2006
2006
2007
2007
2007
2007
2007
2007
2007
2007
2007
2007
2008
2008
2008
2008
2008
2008
2008
2008
2008
Term
Fall
spring
Fall
spring
Fall
Fall
spring
Fall
spring
spring
spring
Fall
Fall
spring
Fall
spring
spring
Fall
spring
Fall
Fall
spring
spring
Fall
spring
Fall
spring
Fall
Fall
New Transferred Continuing Returning
41
67
21
86
89
127
47
28
64
28
57
84
57
61
132
50
120
137
45
109
20
80
320
54
149
102
44
122
80
119
42
148
103
61
118
101
20
50
380
76
25
117
106
95
123
96
63
46
25
85
321
32
51
89
65
142
61
90
43
70
44
92
25
114
45
44
381
45
99
43
39
60
94
106
29
38
86
113
109
49
35
80
79
113
149
145
11
91
100
38
375
6
81
56
19
136
90
27
89
147
118
107
15
78
136
63
104
125
139
85
14
44
45
89
321
5
54
Department
Mechanical
Civil
Mechanical
Electrical
Computer Engineering
Computer Science
Mechanical
Civil
Computer Engineering
Computer Science
Electrical
Computer Engineering
Computer Science
Mechanical
Civil
Computer Engineering
Computer Science
Electrical
Civil
Mechanical
Electrical
Year
2008
2009
2009
2009
2009
2009
2009
2009
2009
2009
2009
2010
2010
2010
2010
2010
2010
2010
2010
2010
2010
Term
spring
spring
fall
spring
fall
fall
spring
fall
spring
spring
fall
fall
fall
spring
fall
spring
spring
fall
spring
fall
spring
New Transferred Continuing Returning
107
88
116
50
33
74
38
101
53
86
33
119
29
148
17
27
99
89
74
52
340
86
322
30
94
94
30
29
150
140
11
38
117
104
82
54
320
44
366
60
150
148
27
30
71
83
45
89
340
82
322
10
98
28
27
50
73
122
37
114
149
78
45
31
260
49
344
50
98
121
26
125
138
125
91
50
49
65
43
96
76
120
40
46
55
APPENDIX B
Enrollment with Socioeconomic Report
Enrollment Reports generated by Courseware on the data warehouse for new graduate students
for last 5 years with the socioeconomic factors
Department
year
term
Electrical
CSc
Civil
Mechanical
Comp Engg.
Electrical
CSc
Civil
Mechanical
Comp Engg.
Civil
Mechanical
Comp Engg.
Electrical
CSc
Civil
Mechanical
Comp Engg.
Electrical
CSc
Mechanical
Comp Engg.
Electrical
CSc
Civil
Mechanical
Comp Engg.
2006
2006
2006
2006
2006
2006
2006
2006
2006
2006
2007
2007
2007
2007
2007
2007
2007
2007
2007
2007
2008
2008
2008
2008
2008
2008
2008
fall
spring
spring
fall
fall
spring
fall
fall
spring
spring
spring
fall
fall
spring
fall
fall
spring
spring
fall
spring
fall
fall
spring
fall
fall
spring
spring
Unemployment Tuition ($) BS graduate New
rate
rate
Enrolled
6
1008
51
97
6
1008
51
22
6
1008
51
111
6
1008
51
129
6
1008
51
54
6
1008
51
95
6
1008
51
30
6
1008
51
58
6
1008
51
45
6
1008
51
128
6
1125
51
56
6
1125
51
114
6
1125
51
28
6
1125
51
68
6
1125
51
32
6
1125
51
120
6
1125
51
119
6
1125
51
49
6
1125
51
117
6
1125
51
13
5.9
1230
50
27
5.9
1230
50
150
5.9
1230
50
58
5.9
1230
50
200
5.9
1230
50
113
5.9
1230
50
61
5.9
1230
50
126
56
Department
year
term
Electrical
CSc
Civil
CSc
Civil
Mechanical
Comp Engg.
Electrical
CSc
Civil
Mechanical
Comp Engg.
Electrical
Civil
Mechanical
Comp Engg.
Electrical
CSc
Civil
Mechanical
Comp Engg.
Electrical
CSc
2008
2008
2008
2009
2009
2009
2009
2009
2009
2009
2009
2009
2009
2010
2010
2010
2010
2010
2010
2010
2010
2010
2010
fall
spring
spring
fall
fall
spring
spring
fall
spring
spring
fall
fall
spring
fall
spring
spring
fall
spring
spring
fall
fall
spring
Fall
Unemployment Tuition ($) BS graduate New
rate
rate
Enrolled
5.9
1230
50
27
5.9
1230
50
430
5.9
1230
50
59
5.8
1335
41
230
5.8
1335
41
109
5.8
1335
41
49
5.8
1335
41
116
5.8
1335
41
64
5.8
1335
41
340
5.8
1335
41
99
5.8
1335
41
44
5.8
1335
41
31
5.8
1335
41
40
5.8
1440
32
87
5.8
1440
32
43
5.8
1440
32
135
5.8
1440
32
56
5.8
1440
32
345
5.8
1440
32
85
5.8
1440
32
126
5.8
1440
32
36
5.8
1440
32
133
5.8
1440
32
654
57
APPENDIX C
Enrollment Prediction Report
Enrollment prediction report generated by Courseware on the data warehouse for undergraduate
students for last 5 years
Department
Electrical
Computer Science
Computer Engineering
Civil
Mechanical
Computer Science
Computer Engineering
Electrical
Mechanical
Civil
Electrical
Computer Science
Computer Engineering
Civil
Mechanical
Computer Science
Computer Engineering
Electrical
Mechanical
Civil
Computer Engineering
Civil
Mechanical
Computer Science
Computer Engineering
Electrical
Mechanical
Civil
Electrical
Year
2006
2006
2006
2006
2006
2006
2006
2006
2006
2006
2007
2007
2007
2007
2007
2007
2007
2007
2007
2007
2008
2008
2008
2008
2008
2008
2008
2008
2008
Term
spring
fall
fall
spring
fall
spring
spring
fall
spring
fall
spring
fall
fall
spring
fall
spring
spring
fall
spring
fall
fall
spring
fall
spring
spring
fall
spring
fall
spring
Total predicted
185
525.686
145
114
95
479.003
288
90
341
321
179
528.076
201
470
552
475.377
544
289
205
160
192
467
374
467.678
236
35
308
135
206
Total Enrolled
300
474
411
291
233
526
383
215
417
389
343
463
328
267
357
515
275
241
347
264
282
353
318
519
396
292
361
307
428
58
Department
Computer Science
Civil
Mechanical
Computer Science
Computer Engineering
Electrical
Mechanical
Civil
Electrical
Computer Science
Computer Engineering
Computer Science
Computer Engineering
Electrical
Mechanical
Civil
Electrical
Computer Science
Computer Engineering
Civil
Mechanical
Year
2008
2009
2009
2009
2009
2009
2009
2009
2009
2009
2009
2010
2010
2010
2010
2010
2010
2010
2010
2010
2010
Term
fall
spring
fall
spring
spring
fall
spring
fall
spring
fall
fall
spring
spring
fall
spring
fall
spring
fall
fall
spring
fall
Total predicted
527.456
35
370
444.786
155
107
451
309
305
525.891
79
425
402
439
119
392
359
516.457
418
221
246
Total Enrolled
460
246
291
790
357
355
247
339
221
778
314
703
303
370
203
346
282
754
288
404
253
59
APPENDIX D
Documentation on Courseware Website
Please see the attached CD-ROM containing the code files for the Courseware website
design in HTML, PHP, JavaScript and MySQL.
60
BIBLIOGRAPHY
1. Aksenova, Svetlana S., "Enrollment projection through data mining", MS project report,
CSUS, 2005.
2. Prof. Lu, CSc -177 Lecture Notes, Spring 2010. Course Website:
http://gaia.ecs.csus.edu/~mei/177/csc177.html
3. Jiawei Han, Micheline Kambe, “Data Mining: Concepts and Techniques”, 2nd Edition,
Morgan Kaufmann Publishers, 2006.
4. Christopher Adamson, Michael Venerable, “Data Warehouse Design Solutions”, Wiley
Publishing Inc., 1998.
5. Ralph Kimball, Laura Reeves, Margy Ross, Warren Thornthwaite, “The Data warehouse
Lifecycle Toolkit: Expert Methods for Designing, Developing, and Deploying Data
Warehouses”, Wiley Publishing Inc., 1998.
6. Computer Science Reports, Office of Institutional Research, California State University,
Sacramento [Online]. Available: http://www.oir.csus.edu/Reports/FactBook/DEPT/CSC.cfm
7. Microsoft Excel Support [Online]. Available: http://office.microsoft.com/en-us/excel-help/
8. MySQL Reference Manual [Online]. Available: http://dev.mysql.com/doc/refman/5.0/en/
9. Kevin C. Woods, “XML data representation and transformations for bioinformatics”, MS
project report, CSUS, 2007.
10. Imhoff, Galemmo and Geiger, “Mastering Data Warehouse Design”, Wiley Publishing Inc.,
2003.
11. W. H. Inmon, “Building the Data Warehouse”, John Wiley & Sons, Inc, NY, 2005.
12. Ralph Kimball, Margy Ross, “The Data Warehouse Toolkit: The Complete Guide to
Dimensional Modeling”, Wiley Publishing Inc., 2003.
61
13. Jim Gray et al., “Data Cube: A Relational Aggregation operator Generalizing Group-By,
Cross-Tab, and Sub-Totals”, Kluwer Academic Publishers, 1997.
14. Adamson, “Mastering Data Warehouse Aggregates Solutions”, Wiley Publishing Inc., 2006.