Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Semantic Web wikipedia , lookup
Clusterpoint wikipedia , lookup
Data center wikipedia , lookup
Expense and cost recovery system (ECRS) wikipedia , lookup
3D optical data storage wikipedia , lookup
Web analytics wikipedia , lookup
Database model wikipedia , lookup
Information privacy law wikipedia , lookup
Data vault modeling wikipedia , lookup
Data analysis wikipedia , lookup
PENTAHO: AN OPEN SOURCE BUSINESS INTELLIGENCE SOLUTION Roberto Guglielmi Raffaella Parri Abstract Business Intelligence (BI) is the capability of an organization to understand its business (processes, customers, sources, systems and market opportunity areas), in order to take prompt and suitable strategic decisions. Nowadays, BI solutions, whose purpose is to “provide the right information to the right person at the right moment”, are adopted by almost any company. BI solutions are based on a Data Warehouse, i.e., a system (with its associated processes and tools) which is able to convert data produced by operational systems into useful information for analysis and elaboration tools. Data Warehouse systems aim to integrate, in a single Database, information coming from different sources (both internal, i.e., from company operational systems, and external). Data is then made consistent and useful for companies that are able to overview and manage a lengthy trend by executing simple and quick queries. Therefore, one of the most common Business Intelligence applications is a software tool which is able to produce report and to fulfill analysis on huge amounts of data. Having data without being able to read and comprehend it may be a critical element of failure, especially in a competitive market like the current one, which is passing throughout a worldwide crisis. The Pentaho suite provides an efficient and effective open source solution to BI problems. Many Altran Italia customers have already adopted it and are taking advantages from its analysis functionalities, supported by Altran Italia specialists. Pentaho application is cross many market sectors, e.g., from Public Administration, to Telecommunication, Space and Defense. 1. Introduction Pentaho Open BI Suite platform [1, 2, 3] has been developed in 2004 by a team of Business Intelligence (BI) professionals. It is a complete BI platform, since it consists of reporting, analysis (OnLine Analytical Processing, OLAP), dashboards, data mining and data integration (Extract, Transform, Load, ETL) modules. The system may be integrated with appropriate code and uses open source platforms, like Mondrian (server Relation-OLAP), JFreeReport (for software reporting), Data Integration (for ETL, before known as Kettle), Weka (for Data Mining), which make the platform complete, scalable and with high performance. The platform is compatible with ordinary Databases (e.g., MS SQL Server, ORACLE, Teradata, MySQL, AS/400, IBM DB2, SAP R3/System, etc.), it is independent of operative systems and it is realized with J2EE technology, compatible with widespread J2EE Application Servers. Every application of the platform is available through web or web services and may be integrated in a web portal, exploiting the portlet technology. The Pentaho suite, whose current stable version is 3.8, consists of the following components: • a BI platform that provides a framework useful for execution of developed applications and ordinary services, e.g., logging, auditing, security, scheduling, ETL, web services; • several functionalities that include reporting, analysis, workflow, dashboards and data mining. The platform integrates different well-known open source projects: JFreeReport [4]: a Java library for reports creation; Mondrian OLAP [5]: an OLAP engine, which allows to interactively analyze wide data sets contained in SQL databases; JPivot [6]: a custom library that allows to visualize an OLAP schedule and read data with drill down and roll up operations1; Data Integration [7]: a powerful instrument for ETL process creation; Weka [8]: a tool for data mining activities, offering pre-processing, classification, regression functionalities, association rules and clustering algorithms, that may be used in order to better understand the business and to improve service performances through a predictive analysis; Schema Workbench [9]: an instrument for the creation of OLAP structures. 1 Drilling down (or up) is a specific analytical technique whereby the user navigates among levels of data ranging from the most summarized (up) to the most detailed (down). A roll-up involves computing all of the data relationships for one or more dimensions. To do this, a computational relationship or formula might be defined. Altran Italia Technology Review No. 6 - June 2011 < 27 • Pentaho Design Studio, a set of administration and BI solution design tools, integrated in the Eclipse environment. This tool, offering wizard and graphic functionalities, allows developers and analyst to perform OLAP analysis and to create reports and dashboards. The Pentaho suite is based on servers, engines and other components, therefore it can be easily integrated with tools providing the following services/functionalities: J2EE server, security, portal, workflow, rules, collaboration, content management, data integration, analysis and modeling features of the system. For example, it is possible to replace the report component JFreeReport with Eclipse BIRT [10] or JasperReports [11]. The modules discussed in the next sections refer to the community edition version which can be freely downloaded from http://sourceforge.net. Some of these modules are also part of the enterprise version (e.g., Data Integration), which is available from the official site http:// www.pentaho.com. 2. Suite components Components of the Pentaho BI Suite are showed in Figure 1 and they are discussed in the following sub-sections. Source: Adaptation from http://www.pentaho.org/ Figure 1. Components of Pentaho Open BI Suite. 28 > Altran Italia Technology Review No. 6 - June 2011 2.1. User Console Through the User Console (completely customizable in graphics, logo, colors, etc.) users can browse the directories, access the stored solutions (reports, OLAP analysis, dashboards) and create new ones using a wizard (ad hoc report) based on business use cases and graphic formats. Solutions created by the user may be published and shared, according to the operations associated to the profile. Figures 2, 3, and 4 show different functionalities offered by the User Console. Figure 4. User Console. Analysis Report (OLAP). 2.2. Integration with third party applications Figure 2. User Console. Main window. In the platform, some agents allow to interface Pentaho with data sources of different type: ERP/CRM, legacy systems, local data in any format (text, delimited text, xls, etc.) and other applications that can be accessed through DB or web services. Moreover, it is possible to develop ad hoc connectors in order to interface Pentaho with any DB or application. 2.3. Pentaho Data Integration This module is an effective ETL tool and it uses an innovative approach based on Metadata, that allows to declare ETL processes in terms of what to do without specifying how to do. In this way, the user is focused on creating complex transformations or jobs2, using a drag-and-drop based graphic environment and s/he does not need specific competences in creating personalized code. It is up to the Data Integration engine optimizing transformations and jobs designed in the graphic environment, transforming them into commands and instruction code comprehensible for the system. The module presents the following main features: • more than 80 objects library are available to build transformations and jobs; Figure 3. User Console. Ad Hoc Report. • plug-in connectors are available for integration with major commercial ERP; A transformation is a process involving the application of a set of rules or functions to the extracted data. A job is a different set of transformations or other instructions. 2 Altran Italia Technology Review No. 6 - June 2011 < 29 • there are more than 30 plug-ins for Database (both commercial and open source) connections as well as file reading, e.g., flat files3, xls, XML, etc.; • there is the possibility to extend plug-ins by ad hoc development or to include third party or open source packages; • the module is 100% Java with broad, cross-platform support; • the module supports enterprise class performance and scalability, and Massively Parallel Processing (MPP)4 through transformation cluster execution; 2.4. Metadata Editor In every BI tool, the physical schema, according to which data are stored, is masked by an interface, that provides users with an abstract vision of information and facilitates them in defining queries for data extraction and analysis. Metadata Editor (Figure 6) is a Pentaho component, specifically designed for this task. Data to be presented to final users are organized in Business Views, that contain logical clustering of objects (categories) that represents data mapping or data derivation from a DB. • data, user interface and metadata are fully separated; • it fully integrates with any other Pentaho Open BI tool for scheduling, workflow, reporting and analysis. One of the most useful features is the possibility of simultaneously using heterogeneous data sources. For example, it is possible to design ETL processes that may have, as input, different DB instances (e.g., of different technologies/vendors), a text table or a web service (e.g., a query between an Oracle table and an Excel file); the data are cross-checked, transformed and subsequently uploaded in the destination DB. Figure 5 show an example of a transformation involving two streams of different data flows in a single output. Figure 6. Metadata Editor. Example of Metadata. 2.5. Pentaho Reporting Figure 5. Data Integration. Example of transformation. Flat files are data files that contain records with no structured relationships. Massive Parallel Processing is, essentially, a computer system with many independent processing units, that run in parallel. 3 4 30 > Altran Italia Technology Review No. 6 - June 2011 The most used way of publishing information towards final users is by creating reports: in a BI environment, 75-80% of users uses reports, while the remaining 15-20% uses OLAP analysis tools. Pentaho Reporting (Figure 7) is a tool that allows developers to produce reports for final users. It allows to access data that can be either relational or in OLAP or XML format, and it returns outputs in well-known format, e.g., Adobe PDF, HTML and Microsoft Excel. Pentaho Reporting allows to integrate any kind of document, therefore it presents a high level of scalability. It supports both JDBC and JNDI connections with the most widespread Databases like Oracle, DB2, MySQL, MS-SQL Server, etc. If it is interfaced with metadata, it allows further simplification for developers, since they can create data extraction queries using business objects, without knowing the physical data schema. Pentaho Reporting contains the JFreeReport library. some flags are reported. Differently from a reporting tool, it does not need pre-defined queries in order to retrieve the results, but data are analyzed by means of drag, drop and drill operations and then are retrieved through complex and analytic queries. Moreover, it is possible to switch rows with columns to analyze data from a different point of view. Pentaho Analysis server operates on OLAP cubes whose description is contained in an XML file. Free graphic tools (e.g., Schema Workbench) allow to design and publish cubes into Pentaho server (Figure 9). Pentaho Analysis is based on Mondrian OLAP and JPivot. Figure 7. Pentaho Reporting. 2.6. Pentaho Analysis The term OLAP (On-Line Analytical Processing) refers to a set of software techniques for the interactive and rapid analysis of huge amount of data. Pentaho Analysis (Figure 8) is a powerful tool supporting the socalled knowledge workers, whose tasks involve developing or using knowledge: it helps them to effectively operate by acquiring the view and knowledge they need to take decisions. This tool allows interactive analysis of data, by providing a rich interface where several analysis dimensions (e.g., time, product, customer) and Figure 9. Schema Workbench. 2.7. Pentaho Dashboard Figure 8. Pentaho Analysis (OLAP). As far as BI applications are concerned, dashboard is a software solution to show, to final users, high level Key Performance Indicators (KPI) through attractive and intuitive interfaces. Most dashboards contain only graphical content: instead of using numbers, parameters are symbolized with images, counters, quadrants and, sometimes, graphics. The purpose is to provide an overview of a wide business area, allowing managers to visualize the business state at a glance and to analyse it. All these functionalities are offered by Pentaho Dashboard: its integration with Pentaho Reporting and Pentaho Analysis modules allows users to perform drill operations on a single analysis object, in order to better identify factors that have contributed to the achievement of positive and/or negative results. Figure 10 and Figure 11 show dashboard examples; furthermore it is possible to integrate data with geographical databases (e.g., Google maps). Altran Italia Technology Review No. 6 - June 2011 < 31 different panels associated to different data mining tasks. This tool may load data from several sources like files, URLs and databases. The format of supported files includes ARFF WEKA, CSV, LibSVM and C4. The majority of the tasks, which may be addressed with Explorer, may be managed also through the graphic interface Knowledge Flow, shown in Figure 13. Pentaho Weka is provided with an environment for comparison of results using various mining algorithms or data sets. Figure 10. Pentaho Dashboard. Figure 12. Pentaho Weka Explorer. Figure 11. Pentaho Dashboard. Integration with Google maps. 2.8. Pentaho Weka As far as Data Mining analytic functionalities are concerned, the platform is based on the tool Weka, written in Java and distributed under GNU Public License. This tool includes advanced preprocessing components, learning algorithms and evaluation methods that can be invoked through graphic interfaces, in order to create complex functions flow. Weka has several graphical interfaces which make the access to basic functionalities easier. The principal interface is Explorer, showed in Figure 12 presenting 32 > Altran Italia Technology Review No. 6 - June 2011 Figure 13. Pentaho Weka Knowledge Flow. 2.9. Pentaho Design Studio In order to create analytical documents (called Pentaho solutions), it is possible to use Design Studio (Figure 14), a plug-in for the well-known IDE Eclipse that Pentaho suggested as developing environment. The business logic is defined inside the Process Action environment, that lists the possible actions to execute: query, report, OLAP analysis, ETL processes, messages printing, Javascript code execution, charts, e-mail sending, etc. inter-operation, and user interaction of each subsystem is defined by a collection of documents called Solution Definition. They are XML documents that: • define business process (in XPDL format), • indicate the activities that have to be carried out when called by a web service; the activities definition includes: data sources, queries, report templates, delivery and notification rules, business rules, dashboards, and analytic views. The specific features of this architecture are: • modular components, which may be replaced or added; • control system using the Simple Network Management Protocol (SNMP); • repositories that may be hosted even on an RDBMS external to the platform. It is possible to use both open source databases (e.g., MySQL, FireBird) and proprietary ones (e.g. Oracle, SQL Server, DB2); • security functionalities including role-based security, business rules, and logging; Java Open Single Sign-On (JOSSO) is supported and LDAP may be used for external security system integration; Figure 14. Pentaho Design Studio. 2.10. Pentaho BI Platform Pentaho BI platform provides the architecture and infrastructure required to realize BI solutions. The platform is ruled by a workflow engine and it can be easily integrated in business processes. The workflow manages the main services, such as authentication, logging, auditing, workflow, web services, security, scheduling, etc. 3. Architecture Pentaho engine is based on Pentaho Server. Pentaho Server can be executed in any J2EE application server, specifically Apache, JBOSS AS, Websphere, WebLogic and Oracle AS. Figure 15 provides an overview of the platform architecture. Pentaho Server contains the engine underlying the components producing reports, analysis, business rules, workflow and email/desktop notification. These modules are integrated in order to be able to solve any BI problem. In a BI solution, the behavior, • integration of several well-known open source components, such as: OLAP Server and JPivot Analysis Front-End, MySQL RDBMS, Shark and JaWE Workflow, Data Integration (ETL), Tomcat Application server, Hibernate and Portal, Weka Data Mining, Eclipse Workbench and BIRT reporting components, JOSSO single sign-on and LDAP integration, Mozilla Rhino Javascript Processor; • usage of international standards and protocols: XML: W3C’s Extensible Markup Language, JSR-94: JCP’s Rules Engine API, JSR-168: JCP’s Portlet Speco, SVG: W3C’s Scalable Vector Graphics XPDL: WFMC’s XML Process Definition Language, XForms: W3C’s Web Forms, MDX: Microsoft’s OLAP Query Language, WSBPEL: Oasis’s Web Services Business Process Execution Language, WSDL: W3C’s Web Services Description Language. Altran Italia Technology Review No. 6 - June 2011 < 33 Source: Adaptation from http://wiki.pentaho.com/display/ServerDoc1x/Architecture Figure 15. Pentaho Architecture. 4. Solutions Pentaho architecture solutions represent the set of functionalities available to final users. Typically, they consist in reports, dashboards and OLAP analysis. Analysis solutions are analytical documents stored in a repository and accessible through a web interface. Analytical documents are based on Pentaho Action Sequences: they are XML documents, uploaded and interpreted by the platform, that define the smallest complete task that the solution engine can perform. An Action Sequence defines input parameters and actions which have to be subsequently done. Some Action Sequences may be integrated with parameters coming from 34 > Altran Italia Technology Review No. 6 - June 2011 different sources, or more Action Sequences may be subsequently combined in order to make an action sequence’s output the input of another, creating flexible applications. 5. Business Applications In order to fully exploit the potentials of open source Pentaho BI suite, the authors and their team have been engaged in the creation of both stand alone solutions (e.g., Job Market Watches) and solutions integrated into existing applications (e.g.,Security Assessment tools, Document Management tools). Biography Customizations have been performed on Pentaho, involving code manipulation for report access panels creation, OLAP cubes graphic layout restructuring, home page portals personalization with implementation of company logos, introductive and descriptive sections, images and texts, personalized access to the toolbar page and menu bar. In the future, it is planned the implementation of dashboards integrated with Google Maps that are able to provide Statistic information obtained by combining Google Maps georeferencing services with Pentaho platform Solution. 6. Conclusions Pentaho represents an effective open source alternative for most common BI commercial platforms and even a better solution that is able to satisfy any informative demand, since it enriches the analysis with workflow functionalities and with integrable process management. This allows to exploit advanced BI technologies to perform business operations, such as purchase and income analysis, customer analysis, HR reporting, financial reporting, KPI dashboards, Supply Chain analysis and operative reports. Furthermore, this solution presents advantages in terms of: • license cost lowering, • integration with third party products, • intuitive, interactive and customizable access, • open technology sharing. These characteristics enable the product to develop comprehensive solutions. Glossary BI DWH DB OLAP ETL KPI XPDL SNMP RDBMS JOSSO Business Intelligence Data Warehouse Database On-Line Analytical Processing Extract, Transform, Load Key Performance Indicator XML Process Definition Language Simple Network Management Protocol Relational Database Management System Java Open Single Sign-On Roberto Guglielmi is a Professional Consultant in Altran Italia, that he joined in June 2005. Always interested in technology, in 1990 he began working on ICT projects in several companies, where he was able to gain extensive and varied knowledge on different market sectors with a focus on Private Banking. Over the last few years, he worked as Senior Technical Analyst and System Integrator on projects on different areas (mainly, Telecommunications, Aerospace and Defense, Energy). He is currently involved on solutions and projects about Business Intelligence (both proprietary platforms and open source solutions) in Financial Services market sector. He also collaborates with Altran Italia Business Intelligence Solution, whose mission to is offer support to all BI projects. Biography Raffaella Parri works in Altran Italia since March 2006. She began her professional career as a software developer, and she has successively been employed as functional and technical analyst. Currently, she is engaged in Business Intelligence projects, covering different operation areas and different market sectors: Risk Analysis in the field of Security Assessment for Telecommunications, Enterprise Performance Management and Enterprise Data Warehouse in the field of Public Administration, BI Application in the field of Aerospace and Defense. She is specialized in the open source platform Pentaho and she is mainly involved in Multidimensional Data Modeling and Report Design, project and design of ETL process, Functional Analysis and DWH Design. She collaborates with Altran Italia Business Intelligence Solution, whose mission is to support all BI projects. Altran Italia Technology Review No. 6 - June 2011 < 35 Bibliography [1]. [2]. [3]. [4]. [5]. [6]. [7]. [8]. [9]. [10]. [11]. 36 R. Bouman, J. Dongen. “Pentaho Solutions – Business Intelligence and Datawarehousing with Pentaho and MySQL”. Wiley, 2009. Pentaho Community Wiki, http://wiki.pentaho.com R. Talamo, “White Paper Pentaho Piattaforma di Business Intelligence” (internal document, in Italian). JFreeReport, http://wiki.pentaho.com/display/ Reporting/JFreeReport+0.9 Mondrian OLAP, http://mondrian.pentaho.com/ JPivot, http://jpivot.sourceforge.net/ Data Integration (Kettle) http://kettle.pentaho.com/ Weka, http://weka.pentaho.com Schema Workbench, http://mondrian.pentaho.com/ documentation/workbench.php BIRT, http://www.eclipse.org/birt/ JasperReports, http://jasperforge.org/projects/jasperreports > Altran Italia Technology Review No. 6 - June 2011