Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A COURSEWARE FOR DATA WAREHOUSING Manashree Laxmikant Kulkarni B.E., Rashtrasant Tukdoji Maharaj Nagpur University, 2006 PROJECT Submitted in partial satisfaction of the requirements for the degree of MASTER OF SCIENCE in COMPUTER SCIENCE at CALIFORNIA STATE UNIVERSITY, SACRAMENTO FALL 2010 A COURSEWARE FOR DATA WAREHOUSING A Project by Manashree Laxmikant Kulkarni Approved by: __________________________________, Committee Chair Dr. Meiliu Lu __________________________________, Second Reader Dr. Du Zhang ____________________________ Date ii Student: Manashree Laxmikant Kulkarni I certify that this student has met the requirements for format contained in the University format manual, and that this project is suitable for shelving in the Library and credit is to be awarded for the project. __________________________, Graduate Coordinator Dr. Nikrouz Faroughi Department of Computer Science iii ___________________ Date Abstract of A COURSEWARE FOR DATA WAREHOUSING by Manashree Laxmikant Kulkarni Data warehousing is one of the important approaches for data integration and data preprocessing. The objective of this project is to develop a web-based interactive courseware to help beginning data warehouse designers to reinforce the key concepts of data warehousing using a case study approach. The case study is to build a bottom up data warehouse for a university student enrollment prediction data mining system. This data warehouse is able to generate summary reports as input data files for a data mining system to predict future student enrollment. The data source include: (1) the enrollment data from California State University at Sacramento, and (2) the related public data of California. In the courseware, we build the data warehouse systematically using a set of four demonstrations covering the following data warehousing topics: fundamentals, design principle, building an enterprise data warehouse using an incremental approach and aggregation. Every demonstration has the capability of data reporting for the end users upon their requests. iv We integrate the courseware with an introductory data warehousing and data mining class. This class of 20 students evaluated the effectiveness of this tool. Addition of feedback link to the courseware website for the end users is one of the results obtained from this evaluation. , Committee Chair Dr. Meiliu Lu ______________________ Date v ACKNOWLEDGMENTS I would like to express my deep and sincere gratitude to my project advisor, Dr. Meiliu Lu for her support and guidance throughout the project. I thank her for giving me an opportunity to work on a unique idea and put it into reality. She provided me valuable advice and untiring help during the development of the project. Her detailed and constructive comments were very beneficial not only during the phase of website development for the project but also during the phase of report writing. Without her encouragement and personal guidance, the success of this project would not have been possible. My sincere thanks to Dr. Du Zhang for his detailed review and productive remarks on the project report. I also thank Dr. Nikrouz Faroughi for his review and advice for the successful completion of this project. My warm thanks to the University Library at California State University, Sacramento for providing me with books and resources helpful in my project research. I owe my deepest thanks to my family for their love and support throughout my life. I am indebted to my father, Late Mr. Laxmikant Kulkarni, whose faith and hard work provided me the encouragement and support to pursue my Master’s degree. My loving thanks to my mother, Mrs. Jayashree Kulkarni, my brother, Mr. Ashish Kulkarni, and my grandfather, Mr. K. K Panse for their love and constant support during difficult moments. I owe my loving thanks to my husband, Mr. Prasad Shah, without his support and understanding, this project would have been impossible. I extend my thanks to all those who have helped me directly or indirectly in the completion of this project. Last but not the least; thanks are to the Almighty for all the blessings. vi TABLE OF CONTENTS Page Acknowledgments.....................................................................................................................vi List of Tables ........................................................................................................................... ix List of Figures ........................................................................................................................... x Chapter 1. INTRODUCTION .............................................................................................................. 1 2. BACKGROUND ................................................................................................................ 6 3. COURSEWARE DESIGN ............................................................................................... 11 4. ENROLLMENT DATA WAREHOUSE DESIGN .......................................................... 14 Interviewing ............................................................................................................... 15 Purpose of Enrollment Data Warehouse .................................................................... 16 Enrollment Case Study Data mart Design ................................................................. 17 Enrollment Case Study Data mart Refinement .......................................................... 23 Enrollment Data Reporting ........................................................................................ 25 5. ENTERPRISE DATA WAREHOUSE ............................................................................ 29 Enterprise Data Warehouse for Enrollment Case Study ............................................ 30 Incremental Approach on Enrollment EDW.............................................................. 31 Enrollment Case Study EDW Design ........................................................................ 31 Enrollment EDW Data Reporting .............................................................................. 35 6. AGGREGATION ON ENROLLMENT DATA WAREHOUSE..................................... 38 Aggregation ............................................................................................................... 38 vii Performance Parameter .............................................................................................. 39 Aggregate Schema Design on Enrollment Case Study .............................................. 42 Performance Analysis ................................................................................................ 44 7. COURSEWARE EVALUATION .................................................................................... 48 8. CONCLUSION .................................................................................................................. 50 Appendix A. Enrollment Report ............................................................................................ 53 Appendix B. Enrollment with Socioeconomic Report ........................................................... 55 Appendix C. Enrollment Prediction Report ........................................................................... 57 Appendix D. Documentation on Courseware Website ........................................................... 59 Bibliography ............................................................................................................................ 60 viii LIST OF TABLES Page 1. Table 1 Data Warehouse and Database ..............................................................10 2. Table 2 Enrollment Summary Report .................................................................22 3. Table 3 User-desired Query Report ....................................................................28 4. Table 4 Enrollment Prediction Report ................................................................37 5. Table 5 Query Output without Aggregation .......................................................46 6. Table 6 Query Output with Aggregation ............................................................46 ix LIST OF FIGURES Page 1. Figure 1 A Courseware for Data Warehousing.....................................................4 2. Figure 2 Data Warehousing Vs. Flat Files ............................................................7 3. Figure 3 Framework of Courseware ...................................................................11 4. Figure 4 Courseware Demonstrations: Demo A and Demo B ............................14 5. Figure 5 Initial Data mart on Enrollment ...........................................................18 6. Figure 6 Time Dimension Table .........................................................................18 7. Figure 7 Student Classification Dimension Table ..............................................19 8. Figure 8 Enrollment Fact Table ..........................................................................20 9. Figure 9 Enrollment Star Schema ......................................................................21 10. Figure 10 Refined Enrollment Data mart............................................................24 11. Figure 11 Enrollment Graph for Computer Science Department .......................25 12. Figure 12 Snapshot of User Input .......................................................................26 13. Figure 13 Courseware Demonstration: Demo C .................................................29 14. Figure 14 Socioeconomic Dimension Table .......................................................32 15. Figure 15 Prediction Fact Table ..........................................................................33 16. Figure 16 Enrollment EDW ................................................................................34 17. Figure 17 Aggregate Functions...........................................................................38 18. Figure 18 Aggregate Design Methodology .........................................................39 19. Figure 19 Enrollment Aggregate Table ..............................................................42 20. Figure 20 Enrollment Aggregate Schema ...........................................................43 21. Figure 21 Feedback Component .........................................................................49 x 1 Chapter 1 INTRODUCTION Every institution, small or big, requires exploitation of a large scale of chronological data. An analytical prediction model for this data can facilitate imperative management functions such as decision making and planning. The data warehouse has been playing a critical role in data preprocessing and data integration. It allows speedy repossession of input data for data mining and data analysis. The outcome of data reporting, data analysis and data mining tools support management planning for budget analysis, resource allocation, forecasting, prediction, and other business processes [1, 2]. A data warehouse is storage of historical data for a business, an experiment or any other enterprise. It consists of selectively extracted data from a primary source or any other source inter-related with the primary data [3]. It reduces the cost-per-analysis due to the simpler and standardized structures in contrast to the application databases. A data warehouse is an Online Analytical Processing (OLAP) system [4, 2] that is vital to an enterprise for making business decisions and responding to analytical questions crucial for a business process. Hence, a data warehouse becomes more resourceful for a business process than the Online Transaction Processing (OLTP) systems [4]. The main idea of this courseware project is to provide a quick learning tool for data warehousing. The courseware is a 3-tier web application entitled “The Courseware for Data Warehousing”. It illuminates basic concepts, design principles, and performance enhancement techniques of data warehousing. This application is an e - learning tool integrated into a course website for a Computer Science course, CSc 177: Data Warehousing and Data Mining, in California State University, Sacramento. The courseware supplements the data warehousing 2 topics of this course such as aggregation. We explain the topics in the courseware in depth and allow students to explore. The courseware also provides a quick reference to the students who have not taken any course on data warehousing topics. The tool supports the course material using illustrative examples, interactive demonstrations and visual diagrams to the topic explanation. This gives students interest and insight in the learning process. The students can assess their understanding of data warehousing via interactive quizzes provided at the end of each demonstration. The courseware provides a systematic method for designing a data warehouse. We develop the data warehouse on a case study solely for the purpose of education. The case study uses the student enrollment data from California State University, Sacramento. In the courseware, we demonstrate steps to build a data warehouse for the enrollment data. This tool not only illustrates the data warehousing design process but also reveals some of the incorrect practices throughout the process. We identify ways to circumvent these incorrect practices effectually. In our case study, we build an enterprise data warehouse for the student enrollment data of the College of Engineering and Computer Science in California State University, Sacramento. The data sources for this project are the student enrollment data from the California State University at Sacramento and the enrollment-related social and economic data of the California State. The main intention of designing a data warehouse is to prepare input data for an existing data mining system. The data stored in a data warehouse is the preprocessed data that forms an input for the data mining tools [3, 2]. In our case study, we build the enrollment data warehouse that contains the preprocessed enrollment data. The summary reports retrieve the preprocessed data from the data warehouse. The data reporting tools generate such user-defined summary reports. The reported data can be 3 the input to the data mining tools. These tools perform data mining on the input data and provide the desired results like student enrollment predictions [1]. Moreover, the data warehouse is capable of storing the data mining results and can generate summary reports for these results. In our case study, we design the enrollment data warehouse capable of generating summary reports on the student enrollment predictions. The summary reports provide statistics essential for decision making on college budget analysis, new faculty hiring, course demands, facility provisions, etc. The summary reports identify the data patterns and predict potential data values. This technique of data warehousing can be valuable to any enterprise for accurate estimation, forecasting, resource allocation, budget analysis, better management planning, decision-making and improvement in business performance measures like productivity, ROI (Return On Investment), profit, etc [5, 2]. Figure 1 shows a snapshot of the courseware tool’s introduction page. You can visit the courseware at the following URL: http://gaia.ecs.csus.edu/~enroll/enrollDW/Intro.php. The courseware divides the topics into four demonstrations. The first demonstration, Demo A, explains how to identify the purpose and the user requirements of a data warehouse. It demonstrates the design for a simple data mart. The second demonstration, Demo B, helps recognize the purpose of refining a data mart. This section demonstrates the refining process of the data mart while in compliance with the preceding design. The third demonstration, Demo C, shows the method of building an enterprise data warehouse escalating the data mart design from the former section. The fourth demonstration, Demo D, gives the idea of aggregation technique in amplifying the performance of the data warehouse. In addition, this section shows the comparison on the performance of data warehouse with and without aggregation. Furthermore, the topic emphasizes on generation of summary reports. Each demonstration provides interactive user sessions to generate summary reports as per the user specifications. The user sessions input the 4 user requirements and generate user-desired reports. These demonstrations also explain query development and query execution in data reporting. Figure 1 A Courseware for Data Warehousing As a part of this project, we carry out a study on the effectiveness of the courseware tool. We integrate the courseware with an data warehousing and data mining class in Spring 2010. This class of 20 students evaluated the first version of the courseware. The integration of the courseware to a data warehousing class and the subsequent courseware evaluations substantiates the success of this tool. In this chapter, we presented an overview of the project on the courseware for data warehousing. We introduced the case-based approach of building the data warehouse on enrollment data. In the next chapter, we explain the contextual part of the courseware. In the chapters 3 through 6, we describe the design of the courseware website and explain the four 5 demonstrations of the courseware. In chapter 7, we summarize the results and feedback on the courseware tool. In chapter 8, we conclude the project report and include the imminent possibilities of the courseware. 6 Chapter 2 BACKGROUND In the first chapter, we introduce the data-warehousing concept and the significance of the data warehouse to a business process. In this chapter, we provide comprehensive description for the enrollment case study and the enrollment data sources used in our courseware. The idea of the case study originated from a thesis on “Enrollment projection through data mining” by Svetlana S. Aksenova [1]. In her project report, the author presents a remarkable use of the data mining tools to build the enrollment projection models. We noticed that this process utilizes the historical enrollment data in form of the flat files for the data mining tools. This process also included preprocessing of a large amount of data. The preprocessing of the large amount of data from the flat files is time consuming and needs a lot of labor. Hence, we consider developing a data warehouse on the enrollment data. By doing so the data mining tools can directly consume the data from the data warehouse without recurrent preprocessing activities. In addition, we also take into consideration the data changing according to the dynamic user needs. The data warehouse overcomes the disadvantages of continually processing and repeatedly inputting data from flat files. Figure 2 shows the difference of inputting data to the data mining tools from a data warehouse versus flat files. 7 Figure 2 Data Warehousing Vs. Flat Files Before designing any data warehouse, designers define the purpose of the data warehouse. The purpose of the data warehouse identifies the management questions, user requirements and enterprise measurements. In our case study, the management of the University might need information on the factors that affect the enrollment data or the effect of unemployment rate on the enrollment value. Many questions might arise like what is the enrollment headcount for the last year. These questions relate either to the overall business process or to an individual transaction [4]. A large number of query transactions executed on a data warehouse retrieve this information. There is also a possibility that the nature of management questions change with time. To meet these dynamic and continuing management/user requirements, there is a need to store a large amount of historical data in an easy to retrieve and efficient manner like a data warehouse. The user-requirements can help determine the historical data needed to be stored in the data warehouse. The interviewing process [4, 2] identifies these requirements. In our case study, there are two goals of building the enrollment data warehouse: 8 (1) Enrollment reporting: User should be able to generate summary reports. These reports display the relationship and interdependency among various attributes of the historical data sources. The reports help to answer various management questions related to enrollment data. They retrieve selective data on basis of the user conditions in a user query. (2) Enrollment prediction: The data-mining project inputs the reports or the preprocessed data from the data warehouse and performs data analysis. The purpose is to predict values for the student enrollment count using data mining and analysis for the forthcoming years. Analysts identify the data mining algorithms [3] that produce a negligibly small error in prediction values. The difference between the real values and predicted value gives the error value. If this error value is acceptably small, the predictions are as good as real values for the forecasted student enrollment values. The management needs to exploit this forecasted data for decision-making process. The decision-making includes budget planning, curriculum planning, faculty hiring, resource allocation, income evaluation from tuition, etc [1, 6]. Historical Data: The historical data is stored into a data warehouse as a preprocessed data. In our case study, we use two sources of historical data required for the enrollment data: 1. Enrollment data and other enrollment related data from the University [1, 6] 2. Socio-economic data that influences enrollment from the State of California [1] The data collected from the College of Engineering and Computer Science for the last 30 years include enrollment values per semester for graduate and undergraduate students. The data collected from the California State is also for last 30 years and include the socioeconomic figures such as the employment rate, population, income, etc. The enrollment data from the Computer Science department and the socio-economic data from the State are the only real time data [1, 6]. Other department enrollment values are generated using excel spreadsheets using RANDOM () and RANDBETWEEN () functions [7] for courseware purpose only. The real data is mostly 9 numeric data available in form of flat files such as excel, spreadsheets, etc. and other online operational systems. We classify this data into spatial and chronological dimensions to preprocess and prepare data for the data loading process [5, 2]. The spatial attributes include department, college, location and the temporal attributes consists of term and year. There are several ways of data loading to a data warehouse. In this project, we do the following steps for data loading process: (1) Convert all flat files into one format of Comma Separated versions .csv files. (2) Execute the below MySQL query on the data warehouse [8]: // (input name of the flat file) LOAD DATA LOCAL INFILE ‘enrolldata.csv’ // (input name of table) INTO TABLE Enroll_Fact // (table columns separated by comma) FIELDS TERMINATED BY ',' // (input name of the table columns) (new_students, transferred_students, continuing_students, returning_students); From the historical data, the university data provides enrollment report generation. Both the university and state data together provide input for the data mining tools. Hence, the data warehouse provides an efficient way of preprocessing, reporting and analyzing the historical data. One might say that databases organize the data much more efficiently than flat files, then why data warehousing. Table 1 gives a general idea of the differences between the data warehouse and database [4]. 10 Differences Database Process Type Transactions Data Warehouse Analytical queries and report generations Read and Write Read Only (Insert, Update, Delete) (Select) Current data Historical and current data Execution of business Measurement of business process process Query type Data Purpose/Application Table 1 Data Warehouse and Database In this chapter, we obtain a detail understanding of the objective to build a data warehouse for the enrollment case study. In the next chapter, we provide the structure of the courseware. We explain the 3-tier architecture and components of the courseware website. 11 Chapter 3 COURSEWARE DESIGN In this chapter, we describe the courseware architecture in detail. The courseware, based on the principles of n-tier web applications [9], is a 3-tier web application that is conveniently accessible to the data warehouse learners all round the world. The 3-tiers employed in this project mainly consist of the web interface, the logic tier and the data tier [9]. Figure 3 Framework of Courseware Presentation Tier: The web interface written in PHP, HTML and JavaScript offers structure to this tool. The structure organizes the subject matter into introduction, demonstrations, quizzes and references. It exhibits a series of steps for building a successful data warehouse. The user-interactive interface empowers report generation, knowledge assessment, tool evaluation, and user-interactive illustrations. The web interface displays the illustrative examples and visual diagrams that support the topics. 12 Logic Tier: This tier administers the execution behind the web interface. It controls the flow of data, from the data warehouse to the web display. This tier is responsible for business logic. The business logic comprises of the database services like query structure, procedures, and the user services. It takes care of the server-side code executions such as input validation, content display, database security etc. This tier is also responsible for data access. It performs computation and valuation, and devises decisions on the historical enrollment data and enrollment prediction values in our case study. Data Tier: This tier includes the data warehouse parameters, the data sources to be stored in the data warehouse and the other data related functions. In our case study, the primary data is the student enrollment data from California State University at Sacramento, and the secondary data is the enrollment related socioeconomic facts obtained from the California state agencies. This tier stores these primary and secondary data sources. This tier also stores the data analysis and the data mining results executed against the data warehouse. It integrates existing data sources, new data generated and data operations for the data warehouse relevant to that business process [9, 8]. The courseware tool has the advantages of 3-tier architecture like integration of data and services, high performance due to client server technology and improved security. Consequently, we get a more robust application. In the presentation part of courseware website, it presents how to design the enrollment data warehouse through a set of four demonstrations. The demonstrations cover the following topics: (1) fundamentals of data warehouse, (2) data warehouse design principle, (3) building an enterprise data warehouse using an incremental approach, and (4) aggregation. Each demonstration presents detailed description on building the data warehouse via set of steps. Every step has text, diagram, and ready-to-go query runs. Furthermore, the courseware outlines the 13 theory that behinds each subject and provides a set of quiz problems for self-evaluation. In the upcoming chapters, we discuss these demonstrations in detail. 14 Chapter 4 ENROLLMENT DATA WAREHOUSE DESIGN In this chapter, we elucidate the first two demonstrations in the courseware tool. In demo A, we show the design process of the initial data mart for the enrollment data warehouse. This demonstration explains how to define the objective for building a data warehouse using interviewing process. Demo B shows how to refine the initial data mart designed in the previous demonstration. Both the demonstrations have user interactive facility to generate summary reports against the enrollment data mart. Figure 4 shows the design steps included in Demo A and Demo B from the courseware website. Figure 4 Courseware Demonstrations: Demo A and Demo B Through these demonstrations, we commence the design of the data warehouse using an enrollment case study. Using the case study approach, we describe the principles and techniques crucial for the data warehouse design. 15 Interviewing Before any designing process, we should be acquainted with the purpose for building a data warehouse. We design the data warehouse for a business process. We identify the business process and the parameters of this process during the process of interviewing [10, 2]. Interviewing is the technique of talking to people who know the process well. Generally, the management or the end users to the data warehouse are the suitable interviewees in the interview process and the interviewer is the designer or the group of designers of the data warehouse. The interviewers form a list of questions that would assist purposeidentification. This process of interviewing takes place throughout the design process. The designer, according to his/her design needs, decides on the number of interviewing phases. The first phase of interview takes place before outlining the initial design. The results of this interview are useful for providing a skeleton for a data warehouse. The second phase mostly occurs before proceeding to the physical design. The third phase can occur at the refinement process of the data warehouse design. Many interviewing phases can occur depending on how often the design needs to be refined. Interview phases also occur during the evaluation stage of the data warehouse. If the users are completely content by the results of the data warehouse design, possibly there is no necessity to carry out interviews any further. The courseware integrated with the data-warehousing course aims at designing an enrollment data warehouse. The end user to the enrollment data warehouse is the end user of the courseware. Hence, we start the interview process for the initial data mart design with the instructor of the course, CSc 177 Data warehousing and data mining. A few question-answer sessions held for the first phase interview helps initiate the design of the enrollment data warehouse. These interview sessions generated answers to enrollment data selection, queries’ 16 executions, format of summary reports, identification of time and space dimensions for data classification, formation of consensus between memory and performance for the data warehouse, etc. Some examples of the interview questions are: 1. What enrollment data do the end users desire? 2. Into what categories the enrollment data classifies or in which format do the end users desire the summary reports? 3. What attributes related to enrollment should the query result display? The data warehouse gets the capability of answering the user and management questions and it is during the interview processes that we find out the relevant facts that interests the end users and get the minute details of the business process. In our case study, we use dimensional modeling principles to design a data warehouse. A dimensional model consists of a group of fact tables and dimension tables. Interviewing process helps identify the grain detail of the fact table and the attributes of the dimension tables. For example, for generating reports from the data warehouse, interviewing determine whether the reports should be on monthly, quarterly or yearly basis. Another interviewing exercise would be the generation of refined dimensional model from draft dimensional model. The interview would take place between the end users of draft dimensional model and the designers. The feedback on the draft model would help designers to include the missing attributes and refine the model effectively. Purpose of Enrollment Data Warehouse In the previous section, we determine how to collect the user requirements through the interviewing process. The user requirements for the enrollment data warehouse demand the 17 preparation of pre-processed data as an input for data mining tools and the provision of user facility to generate summary reports, categorizing them by term, year, degree and college. There exist two types of summary reports: (1) enrollment reports for graduate and undergraduate students for the last 30 years; and (2) demographic factors on each year’s enrollment data. This combination of type (1) and (2) forms the input data for a data mining system to output enrollment prediction reports for future 5 years. Hence, we start the design process of an enrollment data warehouse in consideration to these requirements. Enrollment Case Study Data mart Design In this section, we start designing an initial data mart for the enrollment data warehouse. The first phase of interview gives us a splendid idea on the user requirements. The basic user requirement is to generate summary reports categorized by year and term, and by student classification on degree. The user also needs the enrollment headcount of the students classified per enrollment as new, continuing, returned or transferred. With this knowledge on data, we can design the draft schema for the data mart. Figure 5 shows the draft dimensional model for the enrollment data. We design the data mart using the dimensional modeling principles [11, 12, 4]. The dimension model classifies the data related to the process into facts and dimensions. These principles facilitate efficient use of physical space. 18 Figure 5 Initial Data mart on Enrollment On interviewing, we obtain that the user needs to generate report for enrollment count (enrollment count is the measurement) in a particular year (year is an attribute). Analyzing the query, we notice that the time parameter breaks the measurement into useful subsets (filter by year). Hence, we identify the first dimension for the enrollment data mart as the time dimension. The dimensions segregate the measurements into useful subsets. While designing the dimension table, the attributes that qualify queries or break out measurements into useful subsets, hold together into one dimension table [10]. According to dimensional modeling principles, the dimension tables are short and wide, i.e. they can have a large number of columns. The dimension table clusters the attributes of that dimension. Hence, each column of the dimension table correlates to an attribute of the dimension [2]. Figure 6 Time Dimension Table 19 We design the time dimension as a table with the attributes year and term as the columns to the table. Each table has a primary key [10] that makes each row unique for enrollment classification. Figure 6 shows the Time Dimension table. Similarly, we can design a Student Classification Dimension table as shown in Figure 7. Every dimension table has a primary key. We create this key while loading the historical data. In this demonstration, MySQL AUTO_INCREMENT generates unique keys in the MySQL tables [8]. For a more informative reporting, the dimension tables should be rich with attributes. The design of dimension table also determines the relation of dimensions to the facts and their appearance in the reports. By a similar approach, we can identify other dimensions in the dimensional model. Figure 7 Student Classification Dimension Table In the draft dimensional model, we declare the quantity (i.e. enrollment count) as facts. The facts recorded are the enrollment counts for newly enrolled students, continuing students, transferred students, returning students and the total number of students enrolled. According to the dimensional modeling principles, the facts are the measurements that evaluate the process. They are mostly numerical in nature. The fact table groups the measurements (referred to as facts) and the attributes of the facts. 20 The fact table not only gives the required measurement but also the relationship among the dimensions and measurement. Enrollment fact table has foreign keys referencing to the dimensional tables via the Time Dimension ID and the Student Classification ID. The primary key of the fact table is a concatenated key involving a subset of the foreign keys. The fact table is the dependent table in the schema design. These tables are narrow and deep i.e. they can have a large number of rows. Each row in the fact table gives the facts at same level of detail. Figure 8 shows the columns of the enrollment fact table as the types of enrollments, the primary key and the foreign keys that reference the dimension tables. Figure 8 Enrollment Fact Table In addition, the enrollment fact contains the attribute "eligible to continue" count related to the “continuing students” attribute. We do this to avoid having a separate dimension table for this attribute. If we design such a dimension table for “eligible to continue”, it would have the same rows as the fact table and would cause data redundancy. Figure 9 shows a sketch of an initial data mart design for the enrollment in form of a star schema. The star schema displays the relationship among different entities. A star schema is a set 21 of tables in a relational database designed according to the principles of dimensional modeling [10]. It is the simplest kind of data warehouse schema in which one or more fact tables reference one or more dimension tables [2]. We design the enrollment star schema to optimize the queries that have large data access. It consists of one fact table stating enrollment facts, and the dimension tables linked to the fact table through the corresponding foreign keys. Queries against such a schema include a variety of combinations among dimensions and facts. Hence, star schema not only facilitates RDBMS capabilities but also add the ability to answer variety of management or end user questions [2]. Figure 9 Enrollment Star Schema After designing the enrollment star schema, we load the historical data from the flat files to the corresponding tables using the data loading process as described earlier. Thus, we get the enrollment data mart for the data warehouse design. We can generate summary reports against this enrollment data mart. We use MySQL queries to retrieve this data. This is the last step of the Demo A in the courseware. The Demo A gives the end users the facility to extract data from the enrollment data mart according to their requirements. Let us suppose, the end user wants to generate report for Computer Science graduate students for Fall 2000. The query to generate summary report for this 22 inquiry takes the conditional values ‘graduate’, ‘Fall’, and ‘2000’ as user input. The courseware use MySQL queries to generate summary reports. The MySQL query formed for this inquiry has the query structure similar to one below: SELECT Time Dimension Table Year, Time Dimension Table Term, New Students, Transferred Students, Returning Students, Students Eligible to Continue, Continuing Students FROM Enrollment Fact Table INNER JOIN Time Dimension Table ON Enrollment Time ID = Time ID INNER JOIN Student Classification Dimension Table ON Enrollment Student Class ID = Student Class ID WHERE Enrollment ID IN (SELECT Enrollment ID FROM Enrollment Fact Table WHERE Enrollment Time ID IN (SELECT Time ID FROM Time Dimension Table WHERE year = 2000 and term = ’Fall’) AND Enrollment Student Class ID IN (SELECT Student Class ID FROM Student Classification Dimension Table WHERE degree = ‘graduate’)) Table 2 shows the summary report generated by the courseware for this inquiry. Year 2000 Term Fall New 39 Transferred 114 Returning 117 Eligible to Continue 156 Table 2 Enrollment Summary Report Continuing Students 31 23 Enrollment Case Study Data mart Refinement In Demo B, the enrollment data mart model is incrementally refined by iterating the steps of design process from Demo A. Refinement helps meet additional user requirements such as omission of old data values or integration of new data sources. The main purpose of refining is to get all the relevant data into the data mart in conformance to the initially designed model. The purpose of refining the enrollment data mart is as follows [12, 5, 2]: 1. Increasing the capability to answer management questions over other departments 2. Including missing data such as tuition fees 3. Expanding the data model structure to get the effect of socioeconomic factors on the enrollment values We need to expand the enrollment data mart slowly over other departments in the college. While refining the data mart, we design the data mart such that it is easily scalable over other colleges under the California State universities. Hence, we require another dimension, the Academic dimension, for the enrollment data mart. The refinement needs a second phase of interviewing. We identify the attributes of academic dimension during this second phase. This stage of refinement gives an opportunity to include new data that was missing in the data mart previously. The steps of designing the initial data mart are critical because we iterate these steps on the initial design to refine the model with more relevant subject areas. In the refinement process, we iterate the following steps from Demo A: 1. Identify the relevant data related areas. 2. Determine attributes and relations between different areas by the process of interviewing. 3. Load the new data such that it conforms to the data model. 4. Iterate these steps until all the areas relevant to the data are covered [2]. 24 We design the new dimension, Academic Dimension, and append it to the model such that it conforms to the initial data mart design. The attributes of the dimension comprise department, college and location. To establish the relationship between the academic unit dimension and the measurement (enrollment data), we need a referential integrity key with the academic unit dimension table. We add a new reference key for this dimension in the enrollment fact table. The primary key of the enrollment fact table is the concatenated key of the reference keys to the time dimension, student classification dimension and academic dimension tables. The star schema includes the updated fact table with the new dimension table. Figure 10 shows the refined dimensional model. Figure 10 Refined Enrollment Data Mart We use historical data to refine the data warehouse. We load the data from the departments in the College of Engineering and Computer Science. The data is loaded in such a way that it conforms to the refined data mart design. We load real data for Computer Science department and generate data for all other departments in College of Engineering and Computer 25 Science. This data is randomly generated using data generation tools like Microsoft Excel 2007 RAND () and RANDBETWEEN () functions [7] for experimental purpose only. Enrollment Data Reporting The data mart is ready to respond to user queries to generate summary reports of the type (1) as stated in section 4.2 of this chapter. Figure 11 shows approximate one such report in form of a graph for the enrollment values. The graph shows the total number of undergraduate students enrolled in Computer Science department for the past 30 years. Similarly, we can generate reports in form of text input for the data mining system [3]. Figure 11 Enrollment Graph for Computer Science Department The end users generate a variety of enrollment summary reports. Various queries execute against the enrollment data mart and display these user-desired reports. The reports can display data accurately by using INNER JOINS in query languages like MySQL/SQL [8]. The data access time depends on the query structure and the database table hierarchy. The queries govern the generation of summary reports. The designers optimize these queries to improve the speed of data access and the performance of data warehouse [see chapter 6]. Query optimization offers efficiency to the data warehouse so that the end users view the reports in a few milliseconds. 26 The Demo A reports and Demo B reports in the courseware give user the facility to input values for generation of enrollment reports. Figure 12 shows one such snapshot of user input in Demo B reports. In this query, the user wants to know the new student enrollment count for California State University, Sacramento for Mechanical department in College of Engineering and Computer Science for Spring 2004 semester. Figure 12 Snapshot of User Input The query executed against the refined data mart is as follows: SELECT Academic Dimension Table Department, Time Dimension Table Year, Time Dimension Table Term, New Students, Transferred Students FROM Enrollment Fact Table INNER JOIN Time Dimension Table ON Enrollment Time ID = Time ID 27 INNER JOIN Student Classification Dimension Table ON Enrollment Student Class ID = Student Class ID INNER JOIN Academic Dimension ON Enrollment Academic ID = Academic ID WHERE Enrollment ID IN (SELECT Enrollment ID FROM Enrollment Fact Table WHERE Enrollment Time ID IN (SELECT Time ID FROM Time Dimension WHERE year = 2004 AND term = ‘Spring’) AND Enrollment Student Class ID IN (SELECT Student Class ID FROM Student Classification Dimension WHERE degree = ‘undergraduate’) AND Enrollment Academic ID IN (SELECT Academic ID FROM Academic Dimension WHERE university = ‘California State University, Sacramento’ AND college = ‘Engineering and Computer Science’ AND department = ‘Mechanical’)) Table 3 shows the resultant output retrieved by this query. 28 Department Year Term New Transferred Mechanical 2004 Spring 47 87 Table 3 User-desired Query Report In this chapter, we incorporate only the student enrollment data to generate the summary reports. In the next chapter, we extend the dimensional modeling design to build an enterprise data warehouse. In the enterprise data warehouse for enrollment data, we include the socioeconomic data together with the enrollment data. The reports generated against the enterprise data warehouse not only display the facts but also show users the effect of socioeconomic factors on the student enrollment values. In addition, the next chapter elucidates Demo C of the courseware. . 29 Chapter 5 ENTERPRISE DATA WAREHOUSE In this chapter, we visit the third demonstration, Demo C, of the courseware tool. Demo C illustrates the design process of the enterprise data warehouse for the enrollment case study in a systematic way. The design process clarifies how to expand the dimensional modeling design over an enterprise and conform to the design of enrollment data warehouse devised so far. Demo C provides a user interactive facility to generate enrollment summary reports against the enterprise data warehouse. Figure 13 shows the design steps of Demo C from the courseware website. Figure 13 Courseware Demonstration: Demo C The first section gives the idea of the enterprise data warehouse for the enrollment case study. It identifies the data sources valuable for the enrollment enterprise. The subsequent section describes the methodology of designing the enterprise data warehouse for the enrollment case study. The concluding section shows the increased capability of the enterprise data warehouse for 30 data reporting and data analysis over a wide range of data sources such as enrollment data and socioeconomic data. Enterprise Data Warehouse for Enrollment Case Study An enterprise data warehouse (EDW) is a warehouse for the enterprise data and other relevant data. The EDW optimizes data for analyzing, querying, and reporting purposes [10, 12, 2]. The EDW (enterprise data warehouse) mainly integrates data from various systems. This data in combination is more valuable and can satisfy user queries that are unanswerable by any other operational system. The EDW updates the data periodically. Consequently, the underlying architecture of the EDW develops a query processing support offering efficiency and performance to the data warehouse [10, 2]. The best designs of an EDW consist of schema designs. The schemas are an integrated series of conformed dimension tables and transaction-grained fact tables. They develop a business into a complete analytical warehouse [12, 5, 2]. The goal of the EDW (enterprise data warehouse) for the enrollment case study is to provide consistent and accurate enrollment related information in an organized and secured manner. In our case study, the enterprise is the university. The researchers, executive level managers, administrators and enterprise owners are some of the end users to the EDW. The enrollment data becomes easily and speedily accessible to the end users via the enrollment EDW. Query processing and analysis against the enrollment EDW present the impact of social and economic factors of California State on the statistics of student enrollment of the university. 31 The courseware tool uses the incremental approach described in the next section to design the enrollment EDW. Incremental approach on Enrollment EDW There are two approaches in designing an enterprise data warehouse. The first approach is the traditional approach in which the design is ready before loading any data in the data warehouse. Explicitly, the data is loaded in the data warehouse in the final stages. The second approach in the incremental approach in which the EDW is build a subject area at a time. Unlike the traditional approach, the data is loaded for each subject area design individually. The design continues iterating itself through aggressive feedback rotations with the users [10, 12, 2]. In our case study on enrollment analysis, we design the EDW using an incremental approach. The former demonstrations comprise the subject area of enrollment analysis. Demo C increments the design by including a new subject area, enrollment analysis using socioeconomic data, to our data warehouse. We use the dimensional modeling principles to increment the design for the enrollment EDW (enterprise data warehouse). Enrollment Case Study EDW Design We begin with the process of interviewing [10, 2] to identify the socio-economic factors, which influence the enrollment statistics of the universities in California. The data collected consists of attributes like population, employment rate, graduation rate and tuition fees. These attributes, categorized by year, form the new dimension for socio-economic data. Figure 14 shows the Socio-economic dimension table designed. 32 Figure 14 Socioeconomic Dimension Table The data mining process using the data mining tools and techniques [3] carried on the historical data, the enrollment data and socioeconomic data combined, can aid predict student enrollment values for coming years. [1] These predictions need to be stored in the data warehouse. We create a new fact table, Prediction fact table, to store the forecasted results of data mining from [1]. According to (Svetlana Aksenova, 2007), the data mining result include the predicted values and the residual values for new students, transferred students, returning students and continuing students [1]. We realize that these values form the grains of the new fact table. The fact table requires establishing relation with relevant data. Hence, the fact table needs to reference the dimension tables using foreign keys. The primary key on the fact table indexes each data row distinctively. Figure 15 shows the prediction fact table. 33 Figure 15 Prediction Fact Table The star schema for the enrollment EDW consists of two fact tables along with their respective dimension tables. The dimensions for prediction fact table are time dimension, academic unit dimension, student classification dimension and socioeconomic dimension. Some of the dimensions in the prediction fact table are common with the enrollment fact table. Hence, both the fact tables use these dimension tables mutually. Figure 16 shows the star schema for enrollment EDW (enterprise data warehouse). We load the socioeconomic data and prediction data [1] from the historical data sources into the socioeconomic table and the prediction fact table respectively. Correspondingly, we load the reference keys to the dimension tables into the prediction fact table. 34 Figure 16 Enrollment EDW Demo C shows how to build a series of interlocking star schema [4] where each star schema corresponds to one subject area. The design of enrollment EDW (enterprise data warehouse) using an incremental approach is complete. The next section exhibits the importance of building the enrollment EDW. It explains how the EDW provides value to the organization. The data reporting and data analysis performed against the EDW verifies that the enrollment EDW provides a consistent and pertinent view of enterprise data [2]. 35 Enrollment EDW Data Reporting The enrollment data warehouse is ready for testing and deployment. Testing evaluates data reporting and ETL processing on the enrollment and prediction data. It makes the enrollment EDW ready to respond to user queries and generate summary reports not only of type (1) but also of type (2) as per stated in section 4.2 of chapter 4. It ensures quality, consistency and correctness in the user-desired data reports generated by user queries [5]. In the case study for enrollment, we write queries in MySQL query language and then test queries for data reporting purposes. The following example gives an idea on query logic to retrieve data as required by the end users. Let us say, the user needs to compare the actual enrollment value and the predicted enrollment value for newly enrolled graduate students in fall 2000 for the College of Engineering and Computer Science. One of the ways to form a query is as follows: SELECT Student Classification Dimension Degree, Academic Dimension Department, Time Dimension Year, Term, New Enrollment Count, New Predicted Value FROM Prediction Fact INNER JOIN Socioeconomic Dimension ON (Prediction Socioeconomic ID = Socioeconomic ID) , Enrollment Fact INNER JOIN Student Classification Dimension ON (Enrollment Student Classification ID = Student Classification ID) INNER JOIN Academic Dimension ON (Enrollment Academic ID = Academic ID) INNER JOIN Time Dimension 36 ON (Enrollment Time ID = Time ID) WHERE Enrollment ID IN (SELECT Enrollment ID FROM Enrollment Fact WHERE Enrollment Time ID IN (SELECT Time ID FROM Time Dimension WHERE year = 2000 AND term = 'fall') AND Enrollment Academic ID IN (SELECT Academic ID FROM Academic Dimension WHERE college ='Engineering and Computer Science' AND university ='California State University, Sacramento') AND Enrollment Student Classification ID IN (SELECT Student Classification IDFROM Student Classification Dimension WHERE degree = 'Graduate') ) AND Prediction Fact ID IN (SELECT Prediction Fact ID FROM Prediction Fact WHERE Prediction Time ID IN (SELECT Time ID FROM Time Dimension WHERE year = 2000 AND term = 'fall') AND Prediction Socioeconomic ID IN (SELECT Socioeconomic ID FROM Socioeconomic Dimension WHERE year = 2000) AND Prediction Academic ID IN (SELECT Academic ID FROM Academic Dimension WHERE college ='Engineering and Computer Science' AND university ='California State University, Sacramento') AND Prediction Student Classification ID 37 IN (SELECT Student Classification ID FROM Student Classification Dimension WHERE degree = 'Graduate') ) AND Enrollment Academic ID = Prediction Academic ID AND Enrollment Time ID = Prediction Time ID AND Enrollment Student Classification ID = Prediction Student Classification ID; Table 4 shows the report on the actual enrollment values and the predicted enrollment values obtained from this query. Department Actual New Students Enrolled Predicted number of new students Computer Science 39 38 Civil 32 187 Mechanical 73 29 Electrical 70 158 Computer Engineering 130 85 Table 4 Enrollment Prediction Report Such prediction reports can give the predicted values and the actual values for the past years. These reports can form input for data mining tools to predict the enrollment values for future years. This chapter concludes the design of enterprise data warehouse for enrollment case study. To summarize, the courseware provided steps to build an enterprise data warehouse for the enrollment analysis case study. In the next chapter, we discuss the performance of the enterprise data warehouse and describe the performance improving technique called aggregation. 38 Chapter 6 AGGREGATION ON ENROLLMENT DATA WAREHOUSE In the earlier chapters, we presented the three demonstrations of the courseware tool. These demonstrations illustrated how to build an enterprise data warehouse (EDW) using a case study. In this chapter, we explain the final demonstration, Demo D, of the courseware tool. This demonstration provides an example of improving the data warehouse performance and the method to implement it. We accomplish this by using aggregation on the data warehouse. Aggregation An aggregate value is the result of an aggregate function. The aggregate functions are the mathematical functions such as sum, average, maximum, minimum or any user defined function [13]. Figure 17 shows the aggregate functions. Figure 17 Aggregate Functions The process of designing the schema for aggregate values in data warehousing is aggregation [14]. Figure 18 summarizes this process. To implement aggregation on the data warehouse, we start with identifying the aggregates vital for the enterprise. The aggregate valuable for the enrollment data warehouse is the total headcount of enrollment. The second step 39 is to design the enrollment aggregate schema for the aggregate values. Next, we calculate the sum total values using aggregate functions. Finally, we load the calculated enrollment values into the aggregate fact table. Figure 18 Aggregate Design Methodology The idea of aggregation is to improve the performance of the data warehouse. For this purpose, we identify the parameters that help improve data warehouse performance in the next section. We discuss how the aggregation influences the performance parameters of a data warehouse. Performance Parameter Data reporting demands that the end users receive quick and accurate results. The data reports should reach at the user end within a few milliseconds. This is one of the key aspects to improve the data warehouse performance. Hence, the performance parameter of the data warehouse is the query execution time. To improve the performance of the data warehouse, we need to decrease the query execution time. We can accomplish this by designing aggregate tables. In our case study, the enrollment fact table contains the enrollment values, which are atomic in nature. The management or the end users perform a number of data reporting operations on the data warehouse. Let us assume that the majority of data reports generated consists of the 40 aggregate values. For example, the users query to retrieve data on the total enrollment for the year 2000. Each time the user needs this data, the data warehouse needs to execute the following query [8]: SELECT Time Dimension Table Year, SUM (New Students + Transferred Students + Returning Students + Continuing Students) AS total enrollment FROM Enrollment Fact Table INNER JOIN Time Dimension Table ON Enrollment Time ID = Time ID INNER JOIN Student Classification Dimension Table ON Enrollment Student Class ID = Student Class ID INNER JOIN Academic Dimension ON (Enrollment Academic ID = Academic ID) WHERE Enrollment ID IN (SELECT Enrollment ID FROM Enrollment Fact Table WHERE Enrollment Time ID IN (SELECT Time ID FROM Time Dimension Table WHERE year = 2000) AND Enrollment Academic ID IN (SELECT Academic ID FROM Academic Dimension WHERE college ='Engineering and Computer Science' AND university ='California State University, Sacramento')) GROUP BY Time Dimension year 41 Each time we execute this query, the query accesses many data rows even if the values in these rows are not required by the end users in the query result. The query reads all the enrollment values from each row and hence, takes a lot of time to execute. The sum function over a particular year gives us the total enrollment count for that particular year. The sum is the addition of enrollment values for both the terms (fall, spring) of a year plus the enrollment values for graduate and undergraduate students. It also has to sum up the count of new, transferred, returning and continuing students for that year. The query needs time to perform the calculation for summing up all the enrollment values. Hence, the execution time for this query is addition of the time (in milliseconds) to connect to the server where the data warehouse is located plus the time (in milliseconds) to retrieve all the required cells and the time (in milliseconds) to calculate the sum function. To improve the efficiency of the data warehouse, we need to make these queries run faster. We can reduce the execution time in the following two ways: 1. By eliminating the time to calculate the sum function 2. By reducing the time required to retrieve the number of row Aggregate tables implement these two ways to reduce the query execution time. The queries run faster if they read only the pre-calculated aggregate values instead of performing calculation on a number of rows. We calculate the aggregate of the commonly requested values and store them in some table. By doing this, the query has to access only a few rows in aggregate tables instead of accessing all the data rows containing enrollment values. In the next section, we design the aggregate tables and the aggregate schema for the enrollment case study. 42 Aggregate Schema Design on Enrollment Case Study The aggregate schema is the star schema of aggregate tables. Each aggregate table is a fact table. It has the aggregate values for one aggregate function [13]. The process starts by identifying the frequently queried aggregate. Interviewing process identifies the aggregate functions that the users queries frequently. Let us assume that the most required aggregate for our enrollment data warehouse is the enrollment headcount grouped by year. We design an aggregate table for the sum function on enrollment values. Figure 19 Enrollment Aggregate Table Figure 19 shows the sum aggregate fact table. An aggregate fact table is similar to a base fact table except that the facts are the aggregate values. The aggregate fact table consists of the foreign keys that reference the dimension tables. In other words, the aggregate values are stored in the fact tables categorized by the dimensions. The interview process helps us identify the dimensions for the aggregate table. We derive these dimensions using the base dimensions: socioeconomic dimension, student classification dimension, time dimension and academic unit dimension. We identify that the user require aggregates calculated on a yearly basis. Hence, the time dimension for this schema should have a primary key for each year. In this schema, we do not need the student classification dimension 43 because the aggregate value consists of both the undergraduate and graduate student count. We can modify the academic dimension as per user needs. Here, we do not modify the academic dimension and the socioeconomic dimension. We design the aggregate schema using these dimension tables connecting with the aggregate fact table. An aggregate star schema [14] is similar to base schema with the aggregate table as the fact table. The dimension tables conform to the base schema design. The base dimension tables give an idea on the aggregate dimension tables. Hence, we can reuse the same dimension tables or redesign the dimension tables with modification if required. The primary key of the aggregate fact table is the combination of reference keys that reference the dimension tables [14]. Figure 20 Enrollment Aggregate Schema Figure 20 shows the aggregate schema for the sum function. In this manner, we can design the aggregate schema for the other desired aggregate functions. After designing the aggregate schema, we calculate the sum aggregate for each year and load these values in the fact 44 table. The query execution on the aggregate table should display the same output, as the base schema would do. The output should be accurate and consistent. When we implement aggregation, the most important task is the maintenance of the aggregate tables [14]. The addition of new values or deletion of old values change the aggregate value of the attribute. In our case study, we would add new enrollment values for the upcoming years. The aggregate table should reflect these additions of enrollment values. The new aggregate values for the upcoming years need to be calculated and loaded into the aggregate table. This task of maintaining and keeping the aggregate table current is crucial in aggregation. The short programs like stored procedures, triggers, or code help maintain the aggregate table in the data warehouse. These programs execute every time there are changes in the values in the base schema that would affect the aggregate schema. The maintenance is an ongoing process. Aggregate tables are refreshed either on addition of new row or updating an existing one. Performance Analysis The aggregate tables store the pre-aggregated values which otherwise are aggregated during query executions. Aggregation reduces the necessity of inner joins and group by clauses in queries. The non-aggregation query has to scan numerous rows and join the related values to display the result. On the other hand, the query against aggregate tables read only a small number of rows from the aggregate tables. We design the aggregate schema allowing for each row of the aggregate table to summarize average of 20 rows of base table [2, 14]. Now, let us compare the performance of the query execution on the data warehouse with and without aggregation. Considering the previous example, the user needs to know the total 45 number of students enrolled in the year 2000 in the College of Engineering and Computer Science at CSUS. We discussed the query formed against the base schema before. This time we add one more column for the function COUNT (*) as scanned rows. The query is as follows [8]: SELECT Time Dimension Table Year, SUM (New Students + Transferred Students + Returning Students + Continuing Students) AS Total enrollment, COUNT (*) AS Scanned rows FROM Enrollment Fact Table INNER JOIN Time Dimension Table ON Enrollment Time ID = Time ID INNER JOIN Student Classification Dimension Table ON Enrollment Student Class ID = Student Class ID INNER JOIN Academic Dimension ON (Enrollment Academic ID = Academic ID) WHERE Enrollment ID IN (SELECT Enrollment ID FROM Enrollment Fact Table WHERE Enrollment Time ID IN (SELECT Time ID FROM Time Dimension Table WHERE year = 2000) AND Enrollment Academic ID IN (SELECT Academic ID FROM Academic Dimension WHERE college ='Engineering and Computer Science' AND university ='California State University, Sacramento')) GROUP BY Time Dimension year 46 This query executes on the enrollment star schema (the base star schema), which does not implement aggregation. Table 5 shows the output of this query. Year Total enrollment Scanned rows 2000 7292 20 Table 5 Query Output without Aggregation This COUNT (*) function gives the number of rows in the table. The count shows the number of rows accessed for each resultant row (i.e. for each year). The total enrollment count for a particular year is the sum of enrollment values for two types of degree students (graduate and undergraduate), for two terms (fall and spring) and for five programs of the college (Civil, Mechanical, Computer Science, Computer Engineering, and Electrical Engineering). Thus, the total number of rows accessed for a single resultant row (for a particular year) is 2 * 2 * 5 = 20. Thus, the total number of rows accessed to obtain the total enrollment count for year 2000 is 20. Now, let us form the query against the aggregate schema that would output the same result on enrollment. SELECT year, total enrollment, COUNT (*) AS scanned rows FROM Enrollment Aggregate Fact WHERE year = ‘2000’ GROUP BY year; Year Total enrollment Scanned rows 2000 7292 1 Table 6 Query Output with Aggregation 47 We execute both the queries several times and note down the execution time each time. We calculate the mean values of these observations. Approximately, the time required to execute the first query is about 0.050 milliseconds. The second query that includes aggregation needs about 0.030 milliseconds to execute on the enrollment data warehouse. After performing these query executions, we notice that the time required to execute the first query is much more than the time required executing the second one. We carried out such executions against the enrollment data warehouse for variety of other queries. These experiments and observations verified that aggregation reduces the query execution time and improves the performance of enrollment data warehouse. This chapter concludes the discussion on courseware demonstrations. In the next chapter, we discuss the evaluation of the courseware and provide a prospective to this project. 48 Chapter 7 COURSEWARE EVALUATION In the earlier chapters, we completed the discussion on the contents of the courseware. In this chapter, we validate the assessment on the courseware. This substantiates the operational success of the courseware tool. The success depends on how effective the end users (learners or students) find the courseware tool in understanding the data-warehousing topic. As a part of this project, we carried out a study on testing the effectiveness of the courseware tool. We integrated the courseware with an introductory data warehousing and data mining class in Spring 2010. We introduced courseware as an eLearning tool to the students of Computer Science course, CSc 177: Data Warehousing and Data Mining, in California State University, Sacramento. This class of 20 students evaluated the first version of the courseware. The students were the upper division undergraduate and graduate students of the Computer Science Department We conducted a survey on courseware in this class. The students stayed personally engaged in using courseware to understand the fundamentals on data warehousing. The overall assessment from this student group on this courseware was extremely encouraging. We achieved positive feedback from the survey takers. The survey takers found that the courseware is very accessible and helpful to understand the fundamentals of data warehousing. They also found that the figures and examples are supportive. According to them, the courseware complemented the course lectures very well. In addition, the students were able to follow the steps and illustrations in the courseware very easily. They found the simplicity and natural progression of the courseware website useful for learning. The quizzes in the courseware became handy for them to review for tests. 49 We also obtained constructive feedback from the students on the courseware. The feedback suggests that the results generated from the demo required further verification. It would be beneficial to integrate a data-preprocessing component and a data-mining component to the courseware. Improvement in enrollment prediction system, data mining system and application for the enterprise data warehouse would be advantageous. Based on the input from this student group, we added an on-line feedback component for the tool users. The Figure 21 shows the snapshot of this component. This component collects tool evaluation data from the users providing us a quantitative measurement on degree of user satisfaction. It also allows the user to offer constructive suggestions to us in an on-going basis. We believe that this component is necessary for the success of a developing courseware. It makes the courseware more efficient and durable while offering it the scope for improvement. Figure 21 Feedback Component 50 Chapter 8 CONCLUSION Although there are other online courseware tools such as (Kevin Woods, 2007) [9] for various learning topics, we have not found an on-line courseware exclusively devoted for data warehousing such as this courseware. This tool provides a whole development life cycle of a data warehouse using a case study with a set of supplementary examples. The main advantages of courseware are the usefulness, the scope, and the accessibility of this tool to the beginning datawarehouse designers and developers. Through this courseware, we presented a comprehensive design and functionalities of a web based tool for learning fundamental concepts of data warehousing. The courseware demonstrates the importance of data warehousing in an enterprise. It offers a systematic method to design a data warehouse using a case based approach. In this case study, we develop the data warehouse for the university using a bottom up approach [2, 10, 12]. The data sources include: (1) the enrollment data from California State University at Sacramento, and (2) the related public data of California [1]. The courseware not only provides the enrollment data-warehouse design for the university but also demonstrates the capability of data warehouse for data reporting, data mining and data analysis on these data sources. The courseware further illuminates the performance parameters of a data warehouse. It validates improvement in the data warehouse performance by comparing the performance parameters (query execution time) on the data warehouse with and without implementing aggregation. 51 Finally, we substantiate the success of the courseware by integrating the courseware in the data warehousing class and obtaining continuous feedback from the students. A feedback link in the website contributes to the ongoing evaluation of courseware from the online users. The courseware provides enormous opportunities for development. There are many areas for future research work extending this project, which include strengthening of the case study structure, refinement of concept description and web presentation, and addition of new components on other related topics. The list of to-be-added case study topics include: ETL, data mining and data preprocessing [3]. This project allows me to combine theory and implementation of data warehousing principles into a great learning experience. It offered a practice of data generation, design, real time data collection, data loading, data extraction and data analysis. It also provided an opportunity to develop a 3-tier application using PHP, HTML, JavaScript and MySQL from scratch. In addition, it provided a foundation for imminent professional progress on technical areas such as data warehousing and web development. Future work for this project can include new topics into the courseware such as ETL. 52 APPENDICES 53 APPENDIX A Enrollment Report Enrollment Report generated by Courseware on the data warehouse for last 5 years for undergraduate students enrolled in Engineering College Department Electrical Civil Mechanical Electrical Computer Engineering Computer Science Mechanical Civil Computer Engineering Computer Science Electrical Computer Engineering Computer Science Mechanical Civil Computer Engineering Computer Science Electrical Civil Mechanical Civil Computer Engineering Computer Science Electrical Civil Mechanical Electrical Computer Engineering Computer Science Year 2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2008 2008 2008 2008 2008 2008 2008 2008 2008 Term Fall spring Fall spring Fall Fall spring Fall spring spring spring Fall Fall spring Fall spring spring Fall spring Fall Fall spring spring Fall spring Fall spring Fall Fall New Transferred Continuing Returning 41 67 21 86 89 127 47 28 64 28 57 84 57 61 132 50 120 137 45 109 20 80 320 54 149 102 44 122 80 119 42 148 103 61 118 101 20 50 380 76 25 117 106 95 123 96 63 46 25 85 321 32 51 89 65 142 61 90 43 70 44 92 25 114 45 44 381 45 99 43 39 60 94 106 29 38 86 113 109 49 35 80 79 113 149 145 11 91 100 38 375 6 81 56 19 136 90 27 89 147 118 107 15 78 136 63 104 125 139 85 14 44 45 89 321 5 54 Department Mechanical Civil Mechanical Electrical Computer Engineering Computer Science Mechanical Civil Computer Engineering Computer Science Electrical Computer Engineering Computer Science Mechanical Civil Computer Engineering Computer Science Electrical Civil Mechanical Electrical Year 2008 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 Term spring spring fall spring fall fall spring fall spring spring fall fall fall spring fall spring spring fall spring fall spring New Transferred Continuing Returning 107 88 116 50 33 74 38 101 53 86 33 119 29 148 17 27 99 89 74 52 340 86 322 30 94 94 30 29 150 140 11 38 117 104 82 54 320 44 366 60 150 148 27 30 71 83 45 89 340 82 322 10 98 28 27 50 73 122 37 114 149 78 45 31 260 49 344 50 98 121 26 125 138 125 91 50 49 65 43 96 76 120 40 46 55 APPENDIX B Enrollment with Socioeconomic Report Enrollment Reports generated by Courseware on the data warehouse for new graduate students for last 5 years with the socioeconomic factors Department year term Electrical CSc Civil Mechanical Comp Engg. Electrical CSc Civil Mechanical Comp Engg. Civil Mechanical Comp Engg. Electrical CSc Civil Mechanical Comp Engg. Electrical CSc Mechanical Comp Engg. Electrical CSc Civil Mechanical Comp Engg. 2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2008 2008 2008 2008 2008 2008 2008 fall spring spring fall fall spring fall fall spring spring spring fall fall spring fall fall spring spring fall spring fall fall spring fall fall spring spring Unemployment Tuition ($) BS graduate New rate rate Enrolled 6 1008 51 97 6 1008 51 22 6 1008 51 111 6 1008 51 129 6 1008 51 54 6 1008 51 95 6 1008 51 30 6 1008 51 58 6 1008 51 45 6 1008 51 128 6 1125 51 56 6 1125 51 114 6 1125 51 28 6 1125 51 68 6 1125 51 32 6 1125 51 120 6 1125 51 119 6 1125 51 49 6 1125 51 117 6 1125 51 13 5.9 1230 50 27 5.9 1230 50 150 5.9 1230 50 58 5.9 1230 50 200 5.9 1230 50 113 5.9 1230 50 61 5.9 1230 50 126 56 Department year term Electrical CSc Civil CSc Civil Mechanical Comp Engg. Electrical CSc Civil Mechanical Comp Engg. Electrical Civil Mechanical Comp Engg. Electrical CSc Civil Mechanical Comp Engg. Electrical CSc 2008 2008 2008 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 fall spring spring fall fall spring spring fall spring spring fall fall spring fall spring spring fall spring spring fall fall spring Fall Unemployment Tuition ($) BS graduate New rate rate Enrolled 5.9 1230 50 27 5.9 1230 50 430 5.9 1230 50 59 5.8 1335 41 230 5.8 1335 41 109 5.8 1335 41 49 5.8 1335 41 116 5.8 1335 41 64 5.8 1335 41 340 5.8 1335 41 99 5.8 1335 41 44 5.8 1335 41 31 5.8 1335 41 40 5.8 1440 32 87 5.8 1440 32 43 5.8 1440 32 135 5.8 1440 32 56 5.8 1440 32 345 5.8 1440 32 85 5.8 1440 32 126 5.8 1440 32 36 5.8 1440 32 133 5.8 1440 32 654 57 APPENDIX C Enrollment Prediction Report Enrollment prediction report generated by Courseware on the data warehouse for undergraduate students for last 5 years Department Electrical Computer Science Computer Engineering Civil Mechanical Computer Science Computer Engineering Electrical Mechanical Civil Electrical Computer Science Computer Engineering Civil Mechanical Computer Science Computer Engineering Electrical Mechanical Civil Computer Engineering Civil Mechanical Computer Science Computer Engineering Electrical Mechanical Civil Electrical Year 2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2008 2008 2008 2008 2008 2008 2008 2008 2008 Term spring fall fall spring fall spring spring fall spring fall spring fall fall spring fall spring spring fall spring fall fall spring fall spring spring fall spring fall spring Total predicted 185 525.686 145 114 95 479.003 288 90 341 321 179 528.076 201 470 552 475.377 544 289 205 160 192 467 374 467.678 236 35 308 135 206 Total Enrolled 300 474 411 291 233 526 383 215 417 389 343 463 328 267 357 515 275 241 347 264 282 353 318 519 396 292 361 307 428 58 Department Computer Science Civil Mechanical Computer Science Computer Engineering Electrical Mechanical Civil Electrical Computer Science Computer Engineering Computer Science Computer Engineering Electrical Mechanical Civil Electrical Computer Science Computer Engineering Civil Mechanical Year 2008 2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 Term fall spring fall spring spring fall spring fall spring fall fall spring spring fall spring fall spring fall fall spring fall Total predicted 527.456 35 370 444.786 155 107 451 309 305 525.891 79 425 402 439 119 392 359 516.457 418 221 246 Total Enrolled 460 246 291 790 357 355 247 339 221 778 314 703 303 370 203 346 282 754 288 404 253 59 APPENDIX D Documentation on Courseware Website Please see the attached CD-ROM containing the code files for the Courseware website design in HTML, PHP, JavaScript and MySQL. 60 BIBLIOGRAPHY 1. Aksenova, Svetlana S., "Enrollment projection through data mining", MS project report, CSUS, 2005. 2. Prof. Lu, CSc -177 Lecture Notes, Spring 2010. Course Website: http://gaia.ecs.csus.edu/~mei/177/csc177.html 3. Jiawei Han, Micheline Kambe, “Data Mining: Concepts and Techniques”, 2nd Edition, Morgan Kaufmann Publishers, 2006. 4. Christopher Adamson, Michael Venerable, “Data Warehouse Design Solutions”, Wiley Publishing Inc., 1998. 5. Ralph Kimball, Laura Reeves, Margy Ross, Warren Thornthwaite, “The Data warehouse Lifecycle Toolkit: Expert Methods for Designing, Developing, and Deploying Data Warehouses”, Wiley Publishing Inc., 1998. 6. Computer Science Reports, Office of Institutional Research, California State University, Sacramento [Online]. Available: http://www.oir.csus.edu/Reports/FactBook/DEPT/CSC.cfm 7. Microsoft Excel Support [Online]. Available: http://office.microsoft.com/en-us/excel-help/ 8. MySQL Reference Manual [Online]. Available: http://dev.mysql.com/doc/refman/5.0/en/ 9. Kevin C. Woods, “XML data representation and transformations for bioinformatics”, MS project report, CSUS, 2007. 10. Imhoff, Galemmo and Geiger, “Mastering Data Warehouse Design”, Wiley Publishing Inc., 2003. 11. W. H. Inmon, “Building the Data Warehouse”, John Wiley & Sons, Inc, NY, 2005. 12. Ralph Kimball, Margy Ross, “The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling”, Wiley Publishing Inc., 2003. 61 13. Jim Gray et al., “Data Cube: A Relational Aggregation operator Generalizing Group-By, Cross-Tab, and Sub-Totals”, Kluwer Academic Publishers, 1997. 14. Adamson, “Mastering Data Warehouse Aggregates Solutions”, Wiley Publishing Inc., 2006.