Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A COURSEWARE OF ONLINE ANALYTICAL PROCESSING CUBE A Project Presented to the faculty of the Department of Computer Science California State University, Sacramento Submitted in partial satisfaction of the requirements for the degree of MASTER OF SCIENCE in Computer Science by Sudha Chakravarthy FALL 2014 i © 2014 Sudha Chakravarthy ALL RIGHTS RESERVED ii A COURSEWARE ON ONLINE ANALYTICAL PROCESING CUBE A Project by Sudha Chakravarthy Approved by: __________________________________, Committee Chair Dr. Meiliu Lu __________________________________, Second Reader Dr. Ying Jin ____________________________ Date iii Student: Sudha Chakravarthy I certify that this student has met the requirements for format contained in the University format manual, and that this project is suitable for shelving in the Library and credit is to be awarded for the project. __________________________, Graduate Coordinator Dr. Nikrouz Faroughi Department of Computer Science iv ___________________ Date Abstract of A COURSEWARE OF ONLINE ANALYTICAL PROCESSING CUBE by Sudha Chakravarthy OLAP is a short form for On Line Analytic Processing. It enables multidimensional view of data and answers multi-dimensional queries by using database tables. OLAP allows users to analyze and view the data in different dimensions and help make strategic and tactical decisions based on the information stored in the data warehouses. The main objective of this project is to provide the students with a web based interactive tutorial on OLAP and its operations with live examples and exercises. The courseware on OLAP will be available to all the interested users around the world. The courseware will help students to practically learn the concepts of OLAP by running open sample queries on example data. Also, the students get an opportunity to work on exercises based on the examples demonstrated to them. The courseware uses data that has been extracted from various sources, and loaded into Sql tables so that open queries can be run on the data to demonstrate examples explaining concept of OLAP operations. The website is created by using languages such as html, php and MySQL. Anyone with internet access are free to use this tool, however certain portions of the website has limited access and might ask for the users credentials. v The OLAP courseware is integrated with two other courseware created earlier so that students can have an easy access to all the concepts regarding data mining and data warehousing from a single website. The OLAP courseware is used and evaluated positively in the class of CSC 177 data mining and data warehousing at CSU Sacramento fall 2014. _______________________, Committee Chair Dr. Meiliu Lu _______________________ Date vi ACKNOWLEDGEMENTS I would like to take this opportunity to thank all the people who have helped me walk through this process. My heartfelt thanks and a special mention to Dr. Meiliu Lu for providing me this opportunity of working under her on my masters project. She has been very patient, encouraging and guided me through the entire process. Her detailed feedback was really helpful during my project design and development. My sincere thanks to Prof. Ying Jing for being my second reader. I also take this opportunity to thank Dr. Nikrouz Faroughi for his review of the project. Finally I would like to thank my husband for providing me full support, motivation and encouragement during this endeavor. vii TABLE OF CONTENTS Page Acknowledgements............................................................................................................vii List of Tables.......................................................................................................................x List of Figures................................................................................................................... xii Chapter 1. INTRODUCTION...........................................................................................................1 2. BACKGROUND.............................................................................................................5 2.1 Need for the Tool...............................................................................................5 2.2 Scope..................................................................................................................5 2.3 Project Development and Design.......................................................................6 2.4 Technology Used...............................................................................................7 3. OLAP LEARNING TOOL ARCHITECTURE...............................................................9 4. ROAD MAP FOR OLAP COURSEWARE..................................................................12 4.1 Multidimensional Modeling.............................................................................13 4.2 OLAP Cube......................................................................................................14 4.4 Data Models in the OLAP World....................................................................15 4.5 OLAP Operations.............................................................................................16 4.6 Examples..........................................................................................................21 4.7 Exercise............................................................................................................22 4.7.1. Authentication..................................................................................23 viii 4.8 Quiz..................................................................................................................23 5. LEARNING BY EXAMPLES......................................................................................24 5.1 Data Source......................................................................................................24 5.2 Data Preprocessing...........................................................................................24 5.3 Purpose of the Data Mart.................................................................................27 5.4 Data Mart Design.............................................................................................27 5.4.1 Concept Hierarchy............................................................................34 5.5 Open Queries and Results of OLAP Operations..............................................36 6. COURSEWARE INTEGRATION................................................................................48 7. COURSEWARE EVALUATION.................................................................................50 8. CONCLUSION..............................................................................................................53 Appendix A Rollup Report................................................................................................54 Appendix B Roll Down Report..........................................................................................58 Appendix C Slicing Report................................................................................................59 Appendix D Scoping Report..............................................................................................62 Appendix E Dicing Report.................................................................................................64 Bibliography......................................................................................................................65 ix LIST OF TABLES Tables Page Table 1 Cube representing recorded temperatures.............................................................17 Table 2 Rollup operation on recorded temperatures..........................................................17 Table 3 Roll down operation - Reduction of time hierarchy.............................................18 Table 4 Slicing operation on mild dimension....................................................................19 Table 5 Dicing operation on cool and mild dimensions....................................................20 Table 6 Preprocessed movie and award data set................................................................26 Table 7 Fact table ..............................................................................................................30 Table 8 Year dimension table............................................................................................31 Table 9 Movie dimension table..........................................................................................32 Table 10 Award type dimension table...............................................................................33 Table 11 Award Categories dimension table.....................................................................33 Table 12 Names dimension table.......................................................................................34 Table 13 Single table dataset.............................................................................................37 Table 14 Roll up using single table dataset.......................................................................38 Table 15 Roll up result using star schema.........................................................................40 x Table 16 Atonement movie awards...................................................................................42 Table 17 Roll down on atonement movie..........................................................................43 Table 18 Slice operation on movie and year......................................................................45 Table 19 Dice operation on movie, year and name...........................................................46 Table 20 Scope operation using year dimension...............................................................47 xi LIST OF FIGURES Figures Page 1. Data warehouse as a buffer between OLTP and OLAP...........................................3 2. OLAP courseware architecture.................................................................................9 3. Courseware main page............................................................................................12 4. Steps in the OLAP creation process........................................................................13 5. OLAP cube with TIME and PRODUCT as dimensions.........................................14 6. Pivot........................................................................................................................20 7. Exercise page..........................................................................................................22 8. Snapshot of Quiz page............................................................................................23 9. Data Mart Design....................................................................................................28 10. Star schema for movie and awards data mart.........................................................29 11. Concept Hierarchy..................................................................................................35 12. Snapshot of the integrated courseware...................................................................49 13. Feedback results for questions 1, 5 and 6...............................................................51 14. Feedback for questions 7, 8 and 9............................................................ ..............52 xii Chapter 1 INTRODUCTION A data warehouse is a centralized repository that stores data from multiple information sources and transforms them into a common, multidimensional data model for efficient querying and analysis. Data warehouses contain consolidated data, from several databases and other data sources with data of varying sizes, over long periods of time. These type of data collected over long periods of time and stored in a single repository is called historical data. So, Data warehousing has become generally accepted as the best approach for providing an integrated, consistent source of data for use in data analysis and business decision making. However, data warehousing can present complex issues and require significant time and resources to implement [1]. For such a large size data warehouse, query throughput and response times are very important. To facilitate these complex analyses, data in a data warehouse is typically modeled in a multidimensional fashion. By modeling data in a multidimensional manner it can be expressed in a simpler, expressive and easier to understandable way so that users can make decision easily. As the name indicates multidimensional modeling concentrates on dimensions, facts and measures. Data can be modeled by two ways such as On Line Transaction Processing (OLTP) and Online analytical Processing (OLAP). OLTP deal with data that are used for transactions and will have several operations applied on them. For example: Consider an ATM 1 machine, it will have current data of concurrent users. The pin number, amount withdrawn and balance become data elements. OLTP allows users to insert, update and delete data otherwise known as small online transactions. OLAP deals with large amount of historical data, answers multidimensional queries and provides an approach for users to view data in different dimensions. Since OLAP deals with a large amount of historical data, response time is a very essential measure. For example: An airline company wants to set a new price for a flight, we can use 5 years of historical data about flight reservation, like peak hours of travel, cost to help make the airline company to decide the price for the flight. The data in a OLAP database are usually in the form of schemas like a star schema. Data warehouse acts a buffer between an OLTP and OLAP. OLTP contains data of concurrent users. OLAP consists of historical data collected over long periods of time, and help model the data for easier view and help users to make their best decisions. Data warehouse consists of both the historical data and the concurrent data. It reconciles the data and processes it so that OLAP systems can derive the data from it. The below figure shows how data warehouse acts as a buffer between OLAP and OLTP. 2 OLTP SYSTEM OLTP S SYSTEMS CONCUR CONCUR RENT/OP RENT/OP ERATIO ERATION NAL AL DATA DATA DATA WAREHOUSE RECONCILED DATA OLAP SYSTEMS HISTORICAL DATA/ DERIVED DATA DATA CONCURR CONC ENT DATA CONCU URRE CONCURR RRENT NT Figure 1 Data warehouseENT as aDATA buffer between OLTP and OLAP. DATA DATA The above figure shows that data warehouse obtains the incoming data from multiple OLTP sources and customizes the derived data as reconciles data to OLAP applications. In this courseware we are mainly going to study about OLAP and its operations. This will provide a practical learning experience for any user about OLAP. This chapter mainly provided an overview about data warehouse, OLTP and OLAP systems. Chapter 2 provides background information, need for the courseware and the scope of the project. Chapter 3 provides and architecture of the OLAP courseware. Chapter 4 discuss the implementation and design of the courseware in detail. Chapter 5 contains explanation of examples of OLAP operations with demonstrations and supporting code. 3 Chapter 7 explain the need for integrating the courseware with the other courseware's created for the data mining and data warehousing course. Chapter 8 through talks about evaluation of the courseware by students taking the data mining and data warehousing course and conclusion of the report. 4 Chapter 2 BACKGROUND OLAP help analyze the historical data stored in a data warehouse over long periods of time across various dimensions. In this way, users can view the data along various dimensions and thus analyze the data and make strategic decisions. 2.1 Need for the Tool The main motivation behind creating this courseware is that there is no live schema that demonstrates how OLAP operations work for students. There are many websites that explains about OLAP operations theoretically. Though the theoretical explanations are clear enough for a user to understand the concepts of OLAP, a practical demonstration of the concepts would provide the users with an insight on how the OLAP operation works in real world. 2.2 Scope The courseware is for users who would already have some knowledge about databases and would like to learn in depth about OLAP and its operations. OLAP can be divided into several operations, that help the user view the data in the data warehouse in different dimensions. So this courseware would give the users a self paced study on the topic by providing examples and exercises following the examples. The examples for every OLAP operations are demonstrated using open sample queries. The answers for each OLAP operations are withdrawn from the database and retrieved in the form of tables for easier view. Following the examples would be some exercises that will encourage the user to 5 work out the questions based on the examples demonstrated. Following the examples and exercises is an online Quiz that would help the user evaluate their knowledge on the topic. 2.3 Project Development and Design The development of the project took place in two phases: a) RESEARCH AND ANALYSIS: Before developing the courseware quite an amount of research was done to understand the concepts of OLAP and its operations. Also in order to demonstrate the OLAP operations, I had to choose a dataset that would create interest in the students to learn the courseware. The dataset chosen to illustrate the example is a movie and award dataset and this dataset was created from imdb and wikipedia. The dataset basically will show the movie and nominees that won awards according to respective years. I chose another dataset called the car and truck sales data for providing exercise questions to the users. b) DESIGN AND DEVELOPMENT: The design of the project was divided in such a way to accommodate three ways of implementation. The first phase is the front end, where the users can see the layout of the courseware with information and respective tabs leading to separate spaces explaining concepts related to the courseware. The second phase is about the datasets used in the project. These datasets underwent certain data mining process before loading into the database. The third phase is the backend, where the data from the database will be displayed to the user based on the query thrown. The datasets are loaded into respective tables in the database. Respective example queries for 6 demonstrating each OLAP operation are executed against the data. The resulting answer data will be displayed to the user in the form of tables. The framework of the courseware can be divided into three parts: a) Front end b) Data c) Back end 2.4 Technology Used This section describes the various technologies used for developing the OLAP courseware. As mentioned above, the courseware can be divided into three layers, and the technologies used in each layer is discussed below: FRONT END: The front end is the main page that is first visible to the users when they access the courseware from any web browser. The technologies that were used to build the front end where PHP and HTML. The main reason behind choosing PHP is its compatilibility with MySQL. This helps in easy data retrieval and information display. The main features of PHP include: Open Source Compatibility with MySQL Online library support Easily embedded into HTML BACK END: The backend is not accessed by the user directly, rather it can be accessed through the front end. The backend technologies used in this project are: MS ACCESS to load all the data as a dataset. Loading the data into MSaccess will help easy insertion of data into the database tables. All the data cleansing process were performed in MS access. MySQL workbench was used to create database and tables. Data from the respective excel sheets were loaded into the database tables. Workbench has a feature called 7 Import/Export data which allows the data in excel sheets to be directly imported into the database tables. When right clicking on the respective table name we can copy the query for creation of the table, and the respective inserts made into the table. PHP and MySQL are widely used in California State University, Sacramento campus. So a command prompt was used to access MySQL in remote servers. The main features of MySQL for this project include: Ease of use: It requires only basic knowledge to work with SQL Security: MySQL protects sensitive data from intruders Scalability: MySQL can handle almost any amount of data for example 300-400 million rows of data. 8 Chapter 3 OLAP LEARNING TOOL ARCHITECTURE This chapter describes the architecture implemented in the courseware. The architecture of the OLAP courseware can be divided into four separate layers to provide convenience to the users who access this courseware. The courseware can be divided into four layers that are Presentation layer, Logic layer, database layer and data layer. The block diagram below shows the four layers representing the courseware: Browser BROWSER IE/MOZILLA (IE, FIREFOX FIREFOX) PRESENTATION LAYER PHP/HTML and DATABASE DATABASE (MySql) LOGIC LAYER DATABASE LAYER DATAS ET MSExcel DATA LAYER Figure 2 OLAP courseware architecture Presentation layer: The presentation layer is the outermost layer of our courseware. The web interface that we can view as a front end is constructed using PHP and HTML. This presentation layer is organized into various tabs that helps us access the information. The subject is organized into introduction, multidimensional modeling, OLAP operations in detail, examples, exercise, quiz and references. The content of the courseware include visual diagrams and attachments that support the information provided in the courseware. 9 This user friendly web interface also supports report generation in the form of tabular columns for easy understanding by the user. Logic layer: This is the layer that acts as a middleman between the data warehouse and the web interface. This layer takes care of anything that happens behind the web interface. The logic layer otherwise called as the business logic layer takes care of retrieving information and processing the information sent from the user interface to the other layers in the architecture. This layer checks for the consistency of data presented on the web interface. Configuration was performed to establish a connection between the frontend and the database. The code for obtaining the configuration is: <?php mysql_connect("server url", "username", "password") or die(mysql_error()); echo "Connected to MySQL<br />"; ?> Database layer: This is the layer where all the data stored is retrieved and displayed to the user in the form of tabular columns. The database layer is constructed using Mysql workbench. In our courseware we use the movie and award data as the main data for illustrating the OLAP operations. The schema or the database is created and the data is loaded into individual tables. In our courseware, when user presses the "execute query" button under each illustration of OLAP operation, then this layer helps processing the query, and retrieving the information stored in the necessary tables. This layer helps storing the results of analyzed data. Data layer: This is the layer that contains the raw dataset that is being extracted from various sources across the web. The raw data set is then stored in MS excel files. The data undergoes some preprocessing before loading into the database. All the files had to 10 be carefully analyzed for choosing the correct attributes to create the final data. Moreover the data had to be cleaned and filled in for missing values. Careful decisions had to be made when deciding which data has to be kept, and which had to be deleted. It is very important to use a layered architecture because it offers several advantages. By adding a layer called the business logic layer the courseware's framework is made thinner, and allows fast flow of information to be displayed to the users. The courseware can be accessed by n number of users, so the system has to be scalable which can be achieved by the n tier architecture. A good scalable website always ensures a good performance. The n tier architecture also provides security to the data stored in the database. This chapter provided us information about the framework of the courseware. The upcoming chapters provide in depth information about the courseware and its components. 11 Chapter 4 ROAD MAP FOR OLAP COURSEWARE This chapter describes the OLAP courseware in detail. In this chapter the user will obtain complete knowledge on how to use the courseware. This courseware may or may not provide complete features that a user may want, but it would be a learning platform for users to obtain a basic knowledge on OLAP and its operations. The courseware contains six main tabs that covers all the information about OLAP. Figure 3 Courseware main page 12 The layout of the courseware is simple and is user friendly. The following sections below will explain the main topics covered in the courseware. 4.1 Multidimensional Modeling Normally data is distributed in several data sources and are not compatible with each other. Example of incompatible data: Customer ages, which are stored as birth date for purchases made online and are stored as age categories for in store sales. OLAP helps extracting data from various data resources and helps make them compatible, thus ensuring that the meaning of the data in a datastore matches with all other data stores. Since OLAP collects historical data and make it compatible for users to analyze and make tactical decisions, without OLAP it would be hard to obtain reports like, What are the popular products purchased by customers between the ages 20 to 30. This type of data organization is called multidimensional modeling. EXTRACT EXTRACT DATA DATA FROM SOURCES TRANSFORM DATA IMPORT TO DATABASE BUILD CUBE PROVIDE REPORT Figure 4 Steps in the OLAP creation process In order to help user analyze and view the data, OLAP uses tables from the databases. This will help the user to view the data in a multidimensional way. Also, it will help the user to analyze and query the data to make logical and tactful decisions. Basically information or data to be viewed multidimensionally should be stored as an array or cube. 13 4.3 OLAP Cube OLAP cubes are mainly designed to retrieve data in an efficient manner. A cube can be defined as a group of data cells arranged by dimensions and measures. Dimensions can be defined as how we want to view data, for eg: time, and product and measures can be defined as items that can be counted, summarized or aggregated such as sales amount. Cube can also link data from different dimensions. Cube helps the user to view the data in different dimensions. t i m e Product Figure 5 OLAP cube with TIME and PRODUCT as dimensions. Here are some ways to create a data cube: Select dimensions: The first way is to select a set of dimensions, which is a subset of attributes. For example, in the above cube, we have selected time and product as dimensions, this becomes a two dimensional data cube. Select concept hierarchy levels: Selecting subsets for each dimensions. For example: Selecting weeks for time and soap for product. Populating the cube by selecting a measure. 14 4.4 Data Models In The OLAP World Cubes can be stored in three different models: Relational OLAP (ROLAP) Multidimensional OLAP (MOLAP) Hybrid OLAP (HOLAP) a) ROLAP: To provide fast response time to OLAP applications, super relational features are added to traditional RDBMS to form super relational database systems. The data in this model is stored in relational databases. Relational OLAP follows dimensional modeling, which organizes data into measures stored in fact tables and dimensions stored in dimension/satellite tables. ROLAP uses data marts. Data marts are databases that share many of the features of data warehouses but are smaller in scope. Data can be stored in two ways in a relational model: i) Star schema: By storing data in a star schema we can overcome usability and performance problem. A star schema consists of one fact table and several dimension tables. ii) Snow flake schema: To avoid redundancy and waste of storage due to denormalized structure of dimension tables, a snow flake schema is created. A normalized star schema is a snow flake schema. 15 b) MOLAP: MOLAP is a special purpose data model in which multidimensional data and operations are mapped directly. c) HOLAP: HOLAP is a combination of ROLAP and MOLAP. HOLAP allows storing a set of the data in a MOLAP repository and another set of the data in a ROLAP repository, utilizing the advantages of each. 4.5 OLAP Operations We are going to cover the main OLAP operations under this topic. Before that we need to understand certain concepts. In a multidimensional data model, the data is organized into multiple dimensions, and each dimension contains multiple subset of abstraction defined as concept hierarchies. This organization provides users with the flexibility to view data from various perspectives. This is called concept hierarchy. For example, we have some attributes like day, temperature etc. Climbing up the concept hierarchy for day would give us week and climbing down the concept hierarchy or creating subsets for temperature would give us hot, mild or cool temperatures. 16 OLAP provides a user friendly platform for data analysis in an interactive manner. OLAP operations helps to display views of data in different dimensions to the user, thus allowing interactive query and analysis of the data. a) ROLL UP The roll up operation performs aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by climbing down a concept hierarchy, i.e. dimension reduction. In order to understand rollup consider the following cube with temperatures of certain days recorded weekly: Table 1 Cube representing recorded temperatures If we want to set levels in temperature like mild and cold temperatures from the above cube, then we need to group the columns and add the values according to the concept hierarchy. As a result we obtain the below cube: Table 2 Rollup operation on recorded temperatures b) ROLL DOWN The roll down operation is the opposite of roll up operation. Roll down operation can also be called as drill down operation. The purpose of roll down operation is to navigate from less detailed data to a more detailed data. The roll down operation can be realized by two ways 17 a) Stepping down a concept hierarchy for a dimension b) Introducing new dimensions If we want to perform roll down operation on the above cube represented by table 1, then we can descend the time hierarchy from the level of week to the more detailed level of day. Also, new dimensions can be added to the cube, as drill down adds more detail to the data. The below table shows the result of drill down operation performed on above cube. Table 3 Roll down operation - Reduction of time hierarchy c) SLICING Slice performs a selection of only one dimension of the given cube, thus resulting in a sub cube. For example, If we perform selection on only dimension i.e. temperature = mild, from the above example cube shown in table 1, then we obtain the following cube: 18 Table 4 Slicing operation on mild dimension d) DICING The dicing operation performs a selection on two or more dimensions and obtains a sub cube. For example, applying the selection time as day 3 or day 4 and temperature as cool or mild to the original cube, we get a two dimensional subcube as shown in the table below: 19 Table 5 Dicing operation on cool and mild dimensions e) Pivot: Pivot also called as Rotate changes the dimensional orientation of the cube. It rotates the axes of the data so that data can be viewed from different perspectives. Pivot groups data with different dimensions. The below figure shows an illustration about pivot. T E M P E R A T U R E T I M E TEMPERATURE TIME Figure 6 Pivot f) Scoping: The scoping OLAP operation restricts the view of database objects to a specified subset. g) More OLAP operations: There are some more operations that exist in the OLAP world like: 20 Screening: Restricting the view of a set of data by screening some of the members of a dimension. Drill Across: Drill across helps accesses more than a fact table linked by common dimensions Drill Through: Drill Through helps drill down to the bottom level of a data cube to its relational tables. 4.6 Examples The examples page in the courseware will contain live demonstration of some of the OLAP operations. I have chosen a movie and award data mart to demonstrate the examples. Data from various sources like Imdb and wikipedia was collected in an excel sheet. The data used for demonstrating the examples was a movie and award database and the data underwent certain data mining operations like data cleaning. After data cleaning data were loaded into sql tables using MySql workbench. Then the data was formulated into a star schema by defining fact and dimension tables. The operations of OLAP are then demonstrated by throwing queries on the database and the resulting answers or report were generated in a tabular column. The queries are displayed to the users as open query for their learning. The layout of the example page will contain OLAP operations whose demonstration when clicked will be showed in a separate page. Chapter 5 will provide a detail explanation about the examples page in the courseware. 4.7. Exercise This section of the courseware helps the students to do exercises based on the examples demonstrated to them. Again a separate data set was collected, and this data set contained 21 information on car and truck sales for particular years in the country USA. Users can download the data and add more data to it. They can also perform data mining operations on the data. The users can create their own database and perform the OLAP operations questions provided in the examples page. Questions for each OLAP operations are provided, and users can use MySql or any other database platform to execute queries on the data. The exercise part was designed to provide a practical learning experience for the users. The below picture shows a snapshot of the exercise page. Figure 7 Exercise page 22 4.7.1 Authentication The exercise come with the answers and queries, but only privileged users will be able to view the answers. 4.8 Quiz Finally the courseware has a section called quiz which provides opportunity for the users to self evaluate themselves by answering a set of quiz problems. Figure 8 Snapshot of Quiz page 23 Chapter 5 LEARNING BY EXAMPLES As we saw in chapter 4, in the OLAP courseware the examples page and the exercise page provides a practical learning about OLAP and its operations to the user. In the examples page, each OLAP operations is linked to a separate page that explains what that operation is. Open queries are provided to demonstrate each OLAP operation so that data can be viewed in various dimensions and the result displayed as a tabular column. In the exercise page a data set is provided to the user, so that they get an opportunity to work with the data and answer questions posted in the webpage under each OLAP operations. 5.1 Data Source The data set for the example page is called movie and awards. The data set were collected from imdb and wikipedia. This data set describes, the movie and The data set for the exercise page is called car and truck sales data. The data collected, is stored in an excel file. 5.2 Data Preprocessing The initial dataset is filled with data in an ambiguous manner so the data has to be cleaned and preprocessed. Also it has to be carefully analyzed for choosing the correct attributes to create the final data, so it is necessary that the file containing the data should be clear of erroneous data. The process of data cleaning can be done by the following ways: 24 a) Blank substitution This is a problem, where the dataset has incomplete data. Some of the attributes values are left blank in this case. In order to fill in the necessary data in the missing value places, i had to do some research from google and finding the values that matched the exact rows and columns attributes. b) Special characters Some of the data in the dataset, contained special characters like "?" in place of real data. As mentioned above these values has to be replaced by finding out the exact data that matches the other data in the file. c) Unwanted data It is an important aspect to delete or discard unwanted data that are not required for the creation of the data mart. Careful decisions had to be made when deciding which data has to be kept, and which had to be deleted. d) Deleting unwanted attributes The dataset also contained some unwanted attributes that had to be cleared. The usage of MySql was very useful to delete unwanted data. The main purpose of performing data cleaning is that data can be sorted and easily loaded into database tables. 25 Table 6 Preprocessed movie and award data set 26 5.3 Purpose of the Data Mart Data warehouse is large storage of data, accumulated from a wide range of sources within a company and used to guide management decisions. Data mart is segment of Data warehouse that provides data for reporting. For example when you want to go to a restaurant, you would like to gather important information like when the restaurant would be open, what cuisines they would offer, and the highest rating for a restaurant. This project help the users learn about how OLAP operations work with the movie and award data mart as an example resource. The main motive behind choosing a movie and award dataset is that people in the current trend are interested in movies, so they can show more interest in learning the courseware, with practical and interesting examples. This datamart contains all the hollywood movies released from the year 2000 till the year 2013. These movies should have won certain specific categories of awards. The data for which type of respe/ctive awards like golden globe, academy award, are also collected in the datamart. Also, the award category namely best actor, best actress etc are sorted and arranged. Over 700 rows of such data was collected and preprocessed. 27 5.3 Data Mart Design The main objective here is to provide a data mart design that would process the queries easily and answer questions immediately to the users. In order to make it possible we need to design a simple schema, and the movie and award data mart is designed based on a star schema. The block diagram below shows a basic design of the data mart. COLLECT DATA DRAW DIMENSION TABLES DESIGN CONCEPT HIERARCH Y DRAW FACT TABLES DRAW STAR SCHEMA Figure 9 Data Mart Design There are several advantages as to choosing a star schema than just using a single table to perform the OLAP operations. There are two main advantages we need to consider as to why a star schema approach was used. They are: a) High efficiency in query performance because it retrieves exactly the necessary rows from the fact table. 28 b) Simple in structure. Figure 10 Star schema for movie and awards data mart The above figure shows the star schema for the movie and awards data mart with the "awardcollectiontable" as the fact table. A fact table consists of the measurements, metric or facts and it is located at the center of a star schema surrounded by dimension tables. 29 A snapshot of the fact table is shown below. Table 7 Fact table We can see that this fact table contains measures, and foreign keys which refer to primary keys in the dimension tables. Basically the primary keys (idyear, idmovie, idaward, idcategory) of the dimension tables are foreign keys to the fact tables. So, in the above table, the entire data of the data mart is stored here, and the foreign keys reference to the 30 data stored in the other tables. The foreign keys columns contain values instead of original data, this is done to prevent duplication and loss of data. The tables surrounding the fact tables are called dimension tables. In data warehousing, a dimension table is a set of tables that accompanies a fact table. There are five dimension tables surrounding the fact table "awardcollectonfact". They are year, movie, awards, award categories, winners. The year table consists of all the years from 20012013. The year table with idyear as primary key, references to the movie released in these years, and the award provided for the movies released only in the given year range. Table 8 Year dimension table The movie dimension table with idmovie as primary key, consists of the list of movie names that recieved awards from the year 2001-2013. The table contained about 80 movies. 31 Table 9 Movie dimension table Awards dimension table, with IdAward as primary key consists of the classification of award, given to the award winner for a particular movie in a respective field. 8 main award types were chosen for this data mart. The most popular award types were chosen, because it can add interest to the users learning of the courseware. 32 Table 10 Award type dimension table Award categories dimension table, with Idcategories as primary key consists of different categories of award for eg: best director, best actor etc, for their work in a particular movie in a particular year. Table 11 Award categories dimension table 33 The final dimension table is called as "names" dimension table. This table has idnames as primary key and consists of a list of names of winners who bagged the awards for their work in a movie. The table consists of 356 winner names. Table 12 Names dimension table All these tables are queried and the results show how OLAP operations work. 5.4.1 Concept Hierarchy We have already seen a detail explanation of concept hierarchy in chapter 4. Here we are going to see the concept hierarchy of the movie and award database. 34 Figure 11 Concept Hierarchy The above figure shows the concept hierarchy of the data mart. It shows year, movie and names dimensions that can climb up the concept hierarchy to award category and award types. All these dimensions can finally wrap up to form the award collection. 35 5.5 Open Queries and Results of OLAP Operations Now we need to understand how the OLAP operations work using the above data base. Once the tables are loaded into the database, open query transactions are executed to retrieve information and demonstrate to the users on how the OLAP operations work. a) ROLL UP: As we saw in chapter 4 a roll up operation summarizes data by performing aggregation on a data cube in any one of the following ways: 1) by climbing up a concept hierarchy for a dimension 2) by dimension reduction We are going to see query executions for rollup in two ways: i) Roll using a single table In order to understand what is a single table we need to look at the snapshot below 36 Table 13 Single table dataset The table above contains data that summarizes all award winners from different award categories from movies in a particular year. So in order to obtain the total number of awards for a movie in a particular year from the table above, we can use the following query: 37 SELECT Year, movie, count (awardtype) FROM OLAP.AwardCollection GROUP BY year, movie with ROLLUP; In the above query, the select operation will select the respective year and movie dimensions. The count (awardtype) will count the total number of award types. The from clause explains that the year and movie dimensions should be selected from the Award collection table from the OLAP database created using MySql. The group by function is used for grouping the result set by one or more columns. The below table shows the query output for roll up using single table. Table 14 Roll up using single table dataset ii) Roll up using star schema We have already seen the star schema of the movie and award database earlier in this chapter. So, in order to obtain the total number of awards for a movie in a particular year using the star schema we have to use the following query: 38 SELECT Y.year, M.movie, COUNT(A.AwardType) as NumAwards FROM OLAP.AwardCollectionFact as ACF , OLAP.Years as Y, OLAP.categories as C, OLAP.movie as M, OLAP.Awards as A WHERE ACF.idYear=Y.idYear and ACF.idcategory=C.idcategories and ACF.idMovie=M.idMovie and ACF.idAward=A.idAward GROUP BY year, movie with ROLLUP; From the above query, the select operation will select the respective year and movie dimensions. The count (awardtype) will count the total number of award types and we have provided an alias name for awardtype as numawards. The alias here is used to provide a temporary name for the column that displays the count of total number of awards. The from clause explains that the year, movie dimensions should be selected from the Award collection fact table from the OLAP database created using MySql. We already saw about primary key and foreign key earlier in this chapter. The primary key from each dimension table is matched with the foreign keys in the awardcollectionfact table. This helps preventing duplication of data and data loss. Finally the group by function is used for grouping the result set by one or more columns. The below table shows the query output for roll up using star schema. 39 Table 15 Roll up result using star schema b) ROLL DOWN As we saw before Roll down operation is the reverse of roll up operation. The operation can be performed in either two ways: i) By stepping down a concept hierarchy ii) By introducing new dimension In order to demonstrate roll down, in roll up for a particular movie was performed the courseware. As an example "atonement" movie was taken to demonstrate roll down. The following query shows how rollup is performed on the database to obtain the number of awards obtained by the particular movie "atonement". 40 SELECT Y.year, M.movie, COUNT(A.AwardType) as NumAwards FROM OLAP.AwardCollectionFact as ACF , OLAP.Years as Y, OLAP.categories as C, OLAP.movie as M, OLAP.Awards as A WHERE ACF.idYear=Y.idYear and ACF.idcategory=C.idcategories and ACF.idMovie=M.idMovie and ACF.idAward=A.idAward and M.movie='Atonement' GROUP BY year, movie with ROLLUP; From the above query, the select operation will select the respective year and movie dimensions. The count (awardtype) will count the total number of award types and we have provided an alias name for awardtype as numawards. The alias here is used to provide a temporary name for the column that displays the count of total number of awards. The from clause explains that the year, movie dimensions should be selected from the Award collection fact table from the OLAP database created using MySql. The primary key from each dimension table is matched with the foreign keys in the awardcollectionfact table. We have provided the selection of movie as "atonement" in the where clause. Finally the group by function is used for grouping the result set by one 41 or more columns. The below table shows the query output for finding the number of awards won by atonement movie using roll up. Table 16 Atonement movie awards From the above figure, we see that atonement movie has won a total of 28 awards. Now we apply roll down on the same movie to obtain more detailed data about atonement movie. The below query shows how to perform roll down on atonement movie, so that it travels down the concept hierarchy and obtains all the minute details of atonement movie. SELECT category, movie, year, COUNT(AwardType) as NumAwards, name FROM OLAP.categories as C LEFT JOIN (SELECT ACF.idcategory as idc, category as ca, movie, year, AwardType,name FROM OLAP.AwardCollectionFact as ACF, OLAP.movie as M, OLAP.Years as Y, OLAP.Awards as A, OLAP.categories as C WHERE ACF.idMovie=M.idMovie and M.movie='Atonement' and 42 ACF.idYear=Y.idYear and ACF.idAward=A.idAward and ACF.idcategory=C.idcategories) as FACT on C.idcategories=FACT.idc GROUP BY category,year, movie; When we take a look at the query, we find that it is a nested query. So lets take a look at the inner query written in bold format. The select clause will select category from awardcollection fact table, and other dimensions such as cateogry, movie, year, awardtype and name from the tables awardcollectionfact and dimension tables such as movie, years, awards, categories using from clause. The where clause helps select the atonement movie. We see that a left join operation is performed. Left join combines two queries and returns all rows from the left table (categories table), with the matching rows in the right table (awardcollectionfact) and give null when there is no matching records in the right side. The outer query makes a selection of category, movie and year from categories table. The left join operation thus returns the rows of categories table matched with awardcollectionfact table and groups the columns by category, year and movie using group by function. The below snapshot is an output of the above query. Table 17 Roll down on atonement movie 43 From the above figure we see the value 0 provided for records that did not match between the two tables categories and awardcollectionfact. c) Slicing The slice operation selects one dimension on a given cube and give us a new sub cube. Suppose from our movie data mart, we just need to slice which movie bagged awards and in which year, then we need to apply the following query: SELECT Y.year, M.movie FROM OLAP.AwardCollectionFact as ACF , OLAP.Years as Y, OLAP.movie as M WHERE ACF.idYear=Y.idYear and ACF.idMovie=M.idMovie GROUP BY year, movie; From the above query, the select operation will select the respective year and movie dimensions. The from clause explains that the year, movie dimensions should be selected from the Award collection fact table and dimension tables such as years and movie. The where clause performs referencing of foreign and primary keys between the above mentioned tables. Finally the group by function is used for grouping the result set by one or more columns based on year and movie dimensions. The below table shows the query output for slicing. 44 Table 18 Slice operation on movie and year d) Dicing The dice operation performs selection of two or more dimension on a given cube and give us a new sub cube. So, from our movie data mart we want to dice movie and the corresponding year for which "Leonardo di caprio" won awards, then we have to apply the following query: SELECT year, movie, name FROM OLAP.AwardCollectionFact as ACF, OLAP.Years as Y, OLAP.movie as M WHERE ACF.idYear=Y.idYear and ACF.idMovie=M.idMovie and Y.Year > 2003 and name like '%Leon%' order by year; 45 From the above query, the select operation will select the respective year, name and movie dimensions. The from clause explains that the above dimensions should be selected from the Award collection fact table and dimension tables such as years and movie. The where clause performs referencing of foreign and primary keys between the above mentioned tables and performs selection based on the needed output. Here we need to obtain list of movie names for which Leonardo di caprio has won the movie awards. The below table shows the output of dice operation: Table 19 Dice operation on movie, year and name e) Scoping This operation restricts the view of the database objects to a specified subset. Here we have restricted the scope of view of the database to academy awards between certain years, say 2003 to 2006. Scoping is advantageous when we have a large amount of data, where we want to limit the access of information to certain subset. The below query shows how scoping works: 46 SELECT year, movie, name, AwardType FROM OLAP.AwardCollectionFact as ACF , OLAP.Years as Y, OLAP.movie as M, OLAP.Awards as A WHERE ACF.idYear=Y.idYear and ACF.idMovie=M.idMovie and ACF.idAward=A.idAward and A.AwardType='Academy Award' and Y.Year > 2003 and Y.Year < 2006 order by Year, movie; From the above query, the select operation will select the respective year, name, movie and awardtype dimensions. The from clause explains that the above dimensions should be selected from the Award collection fact table and dimension tables such as years, awards and movie. The where clause performs referencing of foreign and primary keys between the above mentioned tables and performs selection based on the needed output. Here we want to scope the view of the database to academy awards between the years 2003 to 2006. The below table shows the output of scope operation: Table 20 Scope operation using year dimension 47 Chapter 6 COURSEWARE INTEGRATION In order to provide the users with a complete knowledge on the important concepts of datamining and datawarehousing, this courseware is integrated with two other coursewares created earlier for the datamining and datawarehousing course. The coursewares that are integrated together are: a) A courseware for datawarehousing (url: http://athena.ecs.csus.edu/~enroll/enrollDW/Intro.php) b) A courseware on Extract, Transform and Load (ETL) process (url: http://athena.ecs.csus.edu/~web_etl/etl/) c) A courseware of ONLINE ANALYTICAL PROCESSING CUBE (url: http://athena.ecs.csus.edu/~OLAP/OLAP/introduction.php) The coursewares were mainly integrated to make it advantageous for the users to access all the important topics of datamining and datawarehousing in a single platform. The below snapshot shows the webpage with the integrated courseware. 48 Figure 12 Snapshot of the integrated courseware 49 Chapter 7 COURSEWARE EVALUATION The previous chapters discussed about the contents of the courseware. This chapters will discuss the assessment made on the courseware. This is a major element that will add success to the courseware. The main motive of the courseware is that the users should find it effective in understanding the topics of OLAP. As it will be easier for them to learn and obtain knowledge about the topic in detail. In order to make sure that the courseware reached the users, evaluation process was conducted to evaluate how much the users understood the courseware. The courseware was mainly prepared for the course CSc 177, which is data mining and data warehousing. Students of this course was introduced to the courseware with a demonstration. After the demonstration the class of students were given a task of understanding the courseware and evaluate it. The feedback provided by the students were positive and encouraging. The students were asked to give feedback to the survey questions below: Think about your learning experience in using OLAP courseware, how would you rate your learning experience? How would you rate the look and feel of the courseware? How user friendly is the tool? What do you think about the tools feature? How was the courseware presented to you? 50 How would you rate the illustration and examples? Are the topics covered in this courseware organized properly? Did the examples help you in completing the exercises easily? Is learning the OLAP concepts using this courseware easier than reading a book? The students gave positive feedbacks to most of the questions and also gave suggestions on how to improve, or what to add in the courseware. The below graph shows the statistics to some of the questions from the student's feedback: Figure 13 Feedback results for questions 1, 5 and 6 51 Figure 14 Feedback for questions 7, 8 and 9 The students were also given an opportunity to comment on the courseware. The students felt the overall experience of the courseware was good and helped in understanding the fundamentals of OLAP. They also felt that the exercises at the end helped them think and apply their knowledge of what they understood from the examples demonstrated. Students also suggested to add some animations or videos on how the OLAP operations work. The suggestions from feedback will be included in future work which will be covered in the next chapter. 52 Chapter 8 CONCLUSION A static textbook presentation of the key concepts of OLAP cube would not provide an opportunity for users to learn the concepts clearly through interactive exercises and demonstrations. Although there are a lot of websites that explain the concepts of OLAP clearly, there is no live demonstrations of the operations of OLAP. So an e-learning courseware that would illustrate the key concepts of OLAP and its operations was necessary. The main objective of this project was to develop a web based interactive courseware that supports a unified learning experience. As a conclusion to the project report the objective and scope of the project as stated earlier was achieved The courseware would now illustrate the key concepts of OLAP cube using example demonstrations. The courseware is freely accessible to any user. The courseware is also designed to be user friendly and one does not require a lot of time to learn how to use the courseware. The courseware will also allow users to understand the concepts of data mining, and implement it on raw data that is provided to the users for performing the exercise section. This project gave me an opportunity to understand the concepts of OLAP deeply and also work with Mysql and php, thus enhancing my skills in both the languages. 53 Appendix A Rollup Report 54 55 56 57 Appendix B Roll Down Report 58 Appendix C Slicing Report 59 60 61 Appendix D Scoping Report 62 63 Appendix E Dicing Report 64 Bibliography 1. "Data Mining: Concepts and Techniques" by Jiawei Han, Micheline Kamber, and Jian Pei, 3rd ed., Morgan Kaufmann 2012 2. "Data Modeling Techniques for Data Warehousing" by Chuck Ballard, Dirk Herreman, Don Schau, Rhonda Bell, Eunsaeng Kim and Ann Valencic 3. Fundamentals of Data Warehouses by Jarke, M., Lenzerini, M., Vassiliou, Y., Vassiliadis. P, 2nd, rev. and extended ed. 2003, XV, 224 p. 4. The datawarehouse toolkit - The definitive guide to Dimensional modeling by Ralph kimball, Margy Ross, 3rd ed., Wiley 2013 5. Data models in the OLAP world. [Online] Available: http://www.tutorialspoint.com/dwh/dwh_OLAP.htm 6. OLAP operations by Andrei Pandre. [Online] Available: http://apandre.wordpress.com/data/datacube/ 7. Data Warehouse and OLAP. [Online] Available: http://www.cs.ccsu.edu/~markov/ccsu_courses/DataMining-2.html 65 66