Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Microsoft Jet Database Engine wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Relational model wikipedia , lookup
Functional Database Model wikipedia , lookup
Clusterpoint wikipedia , lookup
Strategic University Management System Exploring Business Intelligence with Pentaho Marcella Wong CS 491 Abstract Business Intelligence, or BI, focuses on the consolidation, organization and reporting of data for analytical purposes. This project explores Pentaho, a third party open-source BI tool, through the build and operation of a practical BI environment for a university application. The environment consists of reporting and graphing capabilities that allows the user, most likely a university administrator, the ability to analyze databases consisting of student survey results and test scores. Users will have the ability to visualize, for example, where CSULA students stand compared to the rest of the nation in terms of certain standardized tests or how students and faculty members rate certain learning outcomes. Administrators will be able to identify which areas students are struggling in and excelling in. These features allow the user to hone in on the important trends and statistics regarding his/her institution, providing the basis for making informed decisions. Introduction Databases exist in all types of companies and institutions. They contain a lot of valuable information, but are often underutilized. Companies, for example, may want to know what impact price change will have on purchasing behavior. The answer can be found in their database. The issue is finding a smart and easy way to pull appropriate data from tables and present it in such a way that is easy to analyze and evaluate. For this project, we have applied the same concepts to a university environment. Thousands of student records are stored in school databases, including class grades and GPA’s. Analysis on student databases can give faculty members insight on, for instance, how much students actually learn by viewing the correlation between their class grades and their scores on an MFT exam. Through BI, those analyses can easily be made by using BI tools to select appropriate data, transform and operate on them and generate graphs and charts to let faculty evaluate for themselves how well our students are doing. For this project, the tools used for the handling of data, creation and editing of reports will be from Pentaho. Even though Pentaho can be integrated with other non-Pentaho tools, those provided by Pentaho are already compatible with each other so no further configuration of different modules will be necessary. The goal of this project is to introduce the capabilities that BI can offer. To do that, we will create a system to effectively turn raw data into a more visually appealing form. We will be using student information and the CSNS survey information provided by the CSULA Computer Science department. Various reports will be built with both types of data. Technological Background The technology behind Pentaho is their BI Platform, which is made up of the Pentaho server and an Eclipse-based Design Studio. In addition, the Pentaho Data Integration and Pentaho Report Designer tools are imperative to the completion of this project. The Solution Engine is the main component of the BI Platform. It controls the access to data and reports by connecting the data sources with user components. It reads and executes an Action Sequence. Then, it performs certain actions and accesses different types of data based on the specific Action Sequence. Pentaho BI Platform Pentaho Report Solution Engine Database Figure 1. The Solution Engine connects to the data source, takes the necessary data and produces a report. An Action Sequence is an XML document that defines the task that the solution engine performs. It contains components that the Solution Engine requires before execution, which includes inputs (from data source or user), data types, resource types and action types. Inputs are parameters that are taken either from users or from data sources. <inputs> <quarter type="string"> <sources> <request>quarter</request> </sources> <default-value><![CDATA[Spring]]></default-value> </quarter> </inputs> Figure 2. Declaring inputs in Action Sequences. In Figure 2, user input is required and the value is assigned to the variable, “quarter.” If no user input was provided, then the default value for “quarter” is “Spring.” Resource types include the additional information that the Solution Engine needs to know. <resources> <report-definition> <solution-file> <location>figure1.xml</location> <mime-type>text/xml</mime-type> </solution-file> </report-definition> </resources> Figure 3. Declaring resource types in an Action Sequence. In Figure 3, the type of resource shown is a “solution-file,” which is a file on the file system relative to the location of the Action Sequence document. In the example above, the file “figure1.xml” contains information that the Solution Engine needs. “Figure1.xml” is another XML document that contains the format of the report. How the data is displayed, the types of graphs used, etc., are all specified in the “Figure1.xml” document. It is created by using the Pentaho Report Designer tool. Types of action contains the steps the Solution Engine must follow in order to complete the task. <action-definition> <component-name>SQLLookupRule</component-name> <action-inputs> <quarter type="string"/> </action-inputs> <action-outputs> <prepared_component type="sql-query" mapping="chart"/> <Quarter type="string"/> <Percentage type="string"/> <ID type="string"/> </action-outputs> <component-definition> <query><![CDATA[SELECT Quarter, Percentage, ID FROM figure1 WHERE ID in ({quarter});]]> </query> <live><![CDATA[true]]></live> <driver><![CDATA[com.mysql.jdbc.Driver]]></driver> <user-id><![CDATA[root]]></user-id> <password><![CDATA[abcd]]></password> <connection> <![CDATA[jdbc:mysql://localhost:3307/projects]]> </connection> </component-definition> </action-definition> Figure 4. Action types in an Action Sequence. The Action Sequence in Figure 4 instructs the Solution Engine to go into the database based on the given connections and perform the SQL statement and return the results for the variables “Quarter,” “Percentage,” and “ID.” Once the Solution Engine gathers all the data and performs the instructed operations, a report is produced. System Overview As mentioned, several tools must be used to implement an entire BI system using Pentaho – Pentaho Data Integration, Pentaho Report Designer and Pentaho Design Studio. An important feature of BI is the ability to take raw data and present it in such a way that makes it meaningful. In this project, a set of student data and survey was fed into this system. Some fields include class grades, exam scores for student data and survey ratings for survey data. Data sources for this particular system are Excel sheets and database. The Pentaho Data Integration tool was used to join various Excel sheets with databases, perform operations on selected fields and return a reportable table. Several Excel files containing Collaborative, Written and Oral Communication ratings for each student were provided. Using the Data Integration tool, transformations were created. A transformation contains a set of rules that the tool must perform. In this case, the Excel files were combined to calculate the average rating for each attribute. The calculated average was loaded into the database as a separate table, which will be used later for reporting. Also with the same tool, jobs were created. Jobs are similar to process chains. For example, if more than one transformation needs to be executed, a job can execute all of them at once or in sequence. Once the transformations were complete, Report Designer was used to build the design for the report. From there, the table to be reported on is specified and the format and layout of the report is produced. Data, tables and charts are organized to meet reporting needs. When the appropriate layout is created, a .xml and .xaction file will be generated. Both files are fed into the Pentaho server and are read and executed by the Solution Engine. The XML file contains the layout information of the report (e.g. the type of chart and the header and footer information). The .xaction file is the Action Sequence that contains instructions to retrieving data. An Action Sequence contains connection information and SQL statements stating what to do in the database. The Action Sequence that Report Designer produces may not be enough to satisfy reporting needs, since it will only contain connection information. For situations where user input is required, Design Studio is needed. Some of the reports that have been created allow users to enter which quarter he/she wants to display. Reports like that require Design Studio to help correctly format .xaction file. Recall that an Action Sequence is an XML file as well. Directly typing the XML tags without the help of Design Studio can be tedious and error-prone. The completed XML and Action Sequence needs to be placed in the correct folders for the Pentaho sever to read and execute. The report can be viewed via web browser. Log into Pentaho and navigate to the Solutions and select the report to view. Design and Implementation The three tools that I have mentioned earlier that Pentaho provides will be used for both the flat file and database data. Flat Files Reports regarding course evaluations, course grades and GPA’s are provided in the form of flat files, which in this case, is in Excel format. Since there are several different types of reports that were loaded using flat files, the following discussion will be based on the report that displays the student rankings based on Oral and Written skills. The data must be operated on and loaded into a database in order for Pentaho to retrieve and run a report on it. The entire process – extracting, loading and managing of the data – is called a transformation (See Figure 5). By using the Data Integration tool, transformations were created for each report. For flat files, transformations include the following steps: Extract table/data from flat files Perform operations to meet reporting needs Load final table/data into a database There will be two transformations for this particular report – one for extraction and loading of Excel files and another for transforming the data. Figure 5. An example of a transformation. After data has been extracted from flat files and loaded into the database, the different types of files get joined together and loaded back into the database. The provided flat files came in eight different files. Four files were rankings based on the Spring and Fall quarters of 2006 and 2007 CS 437 class, and the other four from the Spring and Fall quarters of 2006 and 2007 CS 491 class. In total, eight classes were evaluated. The first transformation consists of loading all the files into the database as eight separate tables. A second transformation was created to alter and clean up the data so it can be ready for reporting (See Figure 5). As shown in Figure 5, the first level contains database input icons. A SQL statement can be explicitly written to tell the tool which attribute to use and what to do with it. In this case, a SQL statement was written to calculate the average ratings for all the tables. The next two levels are appending icons. It tells the tool to append one table to another. Before this, CS 437 and CS 491 rankings are still separate. The fourth level is a join icon, which joins the two classes on the “Quarter” attribute. The fifth level gets rid of any attributes that are unnecessary for reporting. Finally, the last level outputs a table that a report can be built on top of. See Figure 6 for the final output of the two transformations. Figure 6. This is the final table after undergoing some transformations. Since a reportable table now exists in the database, a report can be built using Report Designer. Many different elements can be incorporated into reports, with the most popular being graphs. This report includes a graph and a table displaying the graph data. To build the report, connection information is defined to have the graph refer to the table that was created from the transformations. Details of the reports, including type of graph, colors, and labels, were also defined. The table displaying the graph data was added by specifying which column the report needs to pull data out from. Report Designer’s main purpose is to format the look and feel of the report. Design Studio is where reports can be made a little more dynamic. The report created from Report Designer was brought into Design Studio to add filters. Instead of displaying all four quarters, the user can indicate which quarter(s) he/she wants to see. Once that is finished, entire process for making a report is complete. Figure 7. Schema for database, Surveys. Database CSNS survey information was gathered from a database. To demonstrate building reports from that, the discussion below will be based on the report displaying the average ratings of learning outcome 1 from different constituents. The schema of relative tables in the database is shown in Figure 7. Since the schema is more complex than the example from the flat file, the transformations were also much more intricate. Part of the transformation is shown in Figure 8. Data was retrieved using various types of joins. The transformation relevant to retrieving student information is shown in Figure 8. The alumni, faculty and employer information was also collected the same way. In the same transformation, the four types of newly gathered information are appended to one table, so a report can be built from it. Figure 8. Part of the transformation for retrieving data from survey information. Other than defining the transformations, the process for creating the report using Report Designer was the same as using flat files. By using transformations, data from various sources, tables and databases can be extracted and operated on together to output into one source. System Evaluation This BI system consists of 12 canned reports, each with the ability to display the report in pdf, csv, Excel, Word, and html formats. Six of them come from flat files and the other six from databases. Due to time constraints, only a couple of reports allow the users to filter out the type of information to be displayed. From transformation to report creation and generation, this system is able to demonstrate the power of BI in the reporting aspect. We were able to demonstrate the ability of an ETL tool and show an alternate and simpler way of collecting data. Moreover, this project demonstrates the ease and strengths of BI in that it can easily convert raw data into meaningful diagrams. Conclusion This project has achieved what was planned. We were able to demonstrate a different type of technology that was different from what we are familiar with. A BI system was successfully implemented and the resulting reports came out as expected. Even though we accomplished the task of learning a new technology, the experience was much more valuable. We realized that self-teaching a tool with poor documentation is a real challenge. Because of the lack of documentation and tutorials, performing simple operations on any of the three tools took a lot of research. Moreover, as time progressed and as we are more familiar with the programs, we discovered tricks and shortcuts. We not only reached our goal, but also gained valuable lessons. This project did not only prove that self-teaching a new technology is tough, but that experience plays a big role to the completion time of the final project. Appendices a) Pentaho Open Source Business Intelligence - http://www.pentaho.com