Download COPERTINA_TEC_6.qxd:Layout 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Semantic Web wikipedia , lookup

Data model wikipedia , lookup

Clusterpoint wikipedia , lookup

Data center wikipedia , lookup

Expense and cost recovery system (ECRS) wikipedia , lookup

3D optical data storage wikipedia , lookup

Web analytics wikipedia , lookup

Database model wikipedia , lookup

Information privacy law wikipedia , lookup

Data vault modeling wikipedia , lookup

Data analysis wikipedia , lookup

Imagery analysis wikipedia , lookup

Business intelligence wikipedia , lookup

Transcript
PENTAHO: AN OPEN SOURCE
BUSINESS INTELLIGENCE SOLUTION
Roberto Guglielmi
Raffaella Parri
Abstract
Business Intelligence (BI) is the capability of an
organization to understand its business (processes,
customers, sources, systems and market opportunity
areas), in order to take prompt and suitable strategic
decisions. Nowadays, BI solutions, whose purpose is to
“provide the right information to the right person at the
right moment”, are adopted by almost any company.
BI solutions are based on a Data Warehouse, i.e., a
system (with its associated processes and tools) which is
able to convert data produced by operational systems into
useful information for analysis and elaboration tools. Data
Warehouse systems aim to integrate, in a single
Database, information coming from different sources
(both internal, i.e., from company operational systems,
and external). Data is then made consistent and useful for
companies that are able to overview and manage a
lengthy trend by executing simple and quick queries.
Therefore, one of the most common Business Intelligence
applications is a software tool which is able to produce
report and to fulfill analysis on huge amounts of data.
Having data without being able to read and comprehend it
may be a critical element of failure, especially in a
competitive market like the current one, which is passing
throughout a worldwide crisis.
The Pentaho suite provides an efficient and effective open
source solution to BI problems. Many Altran Italia
customers have already adopted it and are taking
advantages from its analysis functionalities, supported by
Altran Italia specialists. Pentaho application is cross
many market sectors, e.g., from Public Administration,
to Telecommunication, Space and Defense.
1. Introduction
Pentaho Open BI Suite platform [1, 2, 3] has been developed in
2004 by a team of Business Intelligence (BI) professionals. It is a
complete BI platform, since it consists of reporting, analysis (OnLine Analytical Processing, OLAP), dashboards, data mining and
data integration (Extract, Transform, Load, ETL) modules. The
system may be integrated with appropriate code and uses open
source platforms, like Mondrian (server Relation-OLAP), JFreeReport
(for software reporting), Data Integration (for ETL, before known
as Kettle), Weka (for Data Mining), which make the platform
complete, scalable and with high performance.
The platform is compatible with ordinary Databases (e.g., MS SQL
Server, ORACLE, Teradata, MySQL, AS/400, IBM DB2, SAP
R3/System, etc.), it is independent of operative systems and it is
realized with J2EE technology, compatible with widespread J2EE
Application Servers. Every application of the platform is available
through web or web services and may be integrated in a web
portal, exploiting the portlet technology.
The Pentaho suite, whose current stable version is 3.8, consists
of the following components:
• a BI platform that provides a framework useful for execution of
developed applications and ordinary services, e.g., logging,
auditing, security, scheduling, ETL, web services;
• several functionalities that include reporting, analysis, workflow,
dashboards and data mining. The platform integrates different
well-known open source projects:
JFreeReport [4]: a Java library for reports creation;
Mondrian OLAP [5]: an OLAP engine, which allows to interactively analyze wide data sets contained in SQL databases;
JPivot [6]: a custom library that allows to visualize an OLAP
schedule and read data with drill down and roll up operations1;
Data Integration [7]: a powerful instrument for ETL process
creation;
Weka [8]: a tool for data mining activities, offering pre-processing, classification, regression functionalities, association
rules and clustering algorithms, that may be used in order to
better understand the business and to improve service performances through a predictive analysis;
Schema Workbench [9]: an instrument for the creation of
OLAP structures.
1
Drilling down (or up) is a specific analytical technique whereby the user navigates among
levels of data ranging from the most summarized (up) to the most detailed (down).
A roll-up involves computing all of the data relationships for one or more dimensions.
To do this, a computational relationship or formula might be defined.
Altran Italia Technology Review No. 6 - June 2011 <
27
• Pentaho Design Studio, a set of administration and BI solution
design tools, integrated in the Eclipse environment. This tool,
offering wizard and graphic functionalities, allows developers
and analyst to perform OLAP analysis and to create reports
and dashboards.
The Pentaho suite is based on servers, engines and other components, therefore it can be easily integrated with tools providing
the following services/functionalities: J2EE server, security, portal,
workflow, rules, collaboration, content management, data integration,
analysis and modeling features of the system. For example, it is
possible to replace the report component JFreeReport with Eclipse
BIRT [10] or JasperReports [11]. The modules discussed in the
next sections refer to the community edition version which can be
freely downloaded from http://sourceforge.net. Some of these
modules are also part of the enterprise version (e.g., Data Integration), which is available from the official site http:// www.pentaho.com.
2. Suite components
Components of the Pentaho BI Suite are showed in Figure 1 and
they are discussed in the following sub-sections.
Source: Adaptation from http://www.pentaho.org/
Figure 1. Components of Pentaho Open BI Suite.
28
> Altran Italia Technology Review No. 6 - June 2011
2.1. User Console
Through the User Console (completely customizable in graphics,
logo, colors, etc.) users can browse the directories, access the
stored solutions (reports, OLAP analysis, dashboards) and create
new ones using a wizard (ad hoc report) based on business use
cases and graphic formats. Solutions created by the user may be
published and shared, according to the operations associated to
the profile. Figures 2, 3, and 4 show different functionalities
offered by the User Console.
Figure 4. User Console. Analysis Report (OLAP).
2.2. Integration with third party
applications
Figure 2. User Console. Main window.
In the platform, some agents allow to interface Pentaho with data
sources of different type: ERP/CRM, legacy systems, local data in
any format (text, delimited text, xls, etc.) and other applications
that can be accessed through DB or web services. Moreover, it is
possible to develop ad hoc connectors in order to interface
Pentaho with any DB or application.
2.3. Pentaho Data Integration
This module is an effective ETL tool and it uses an innovative
approach based on Metadata, that allows to declare ETL processes
in terms of what to do without specifying how to do. In this way,
the user is focused on creating complex transformations or jobs2,
using a drag-and-drop based graphic environment and s/he does
not need specific competences in creating personalized code. It is
up to the Data Integration engine optimizing transformations and
jobs designed in the graphic environment, transforming them into
commands and instruction code comprehensible for the system.
The module presents the following main features:
• more than 80 objects library are available to build transformations
and jobs;
Figure 3. User Console. Ad Hoc Report.
• plug-in connectors are available for integration with major commercial ERP;
A transformation is a process involving the application of a set of rules or functions
to the extracted data.
A job is a different set of transformations or other instructions.
2
Altran Italia Technology Review No. 6 - June 2011 <
29
• there are more than 30 plug-ins for Database (both commercial
and open source) connections as well as file reading, e.g., flat
files3, xls, XML, etc.;
• there is the possibility to extend plug-ins by ad hoc development
or to include third party or open source packages;
• the module is 100% Java with broad, cross-platform support;
• the module supports enterprise class performance and scalability,
and Massively Parallel Processing (MPP)4 through transformation
cluster execution;
2.4. Metadata Editor
In every BI tool, the physical schema, according to which data are
stored, is masked by an interface, that provides users with an abstract vision of information and facilitates them in defining queries
for data extraction and analysis. Metadata Editor (Figure 6) is a
Pentaho component, specifically designed for this task. Data to be
presented to final users are organized in Business Views, that
contain logical clustering of objects (categories) that represents
data mapping or data derivation from a DB.
• data, user interface and metadata are fully separated;
• it fully integrates with any other Pentaho Open BI tool for
scheduling, workflow, reporting and analysis.
One of the most useful features is the possibility of simultaneously
using heterogeneous data sources. For example, it is possible to
design ETL processes that may have, as input, different DB
instances (e.g., of different technologies/vendors), a text table or
a web service (e.g., a query between an Oracle table and an Excel
file); the data are cross-checked, transformed and subsequently
uploaded in the destination DB. Figure 5 show an example of a
transformation involving two streams of different data flows in a
single output.
Figure 6. Metadata Editor.
Example of Metadata.
2.5. Pentaho Reporting
Figure 5. Data Integration.
Example of transformation.
Flat files are data files that contain records with no structured relationships.
Massive Parallel Processing is, essentially, a computer system with many independent
processing units, that run in parallel.
3
4
30
> Altran Italia Technology Review No. 6 - June 2011
The most used way of publishing information towards final users is
by creating reports: in a BI environment, 75-80% of users uses
reports, while the remaining 15-20% uses OLAP analysis tools.
Pentaho Reporting (Figure 7) is a tool that allows developers to
produce reports for final users. It allows to access data that can
be either relational or in OLAP or XML format, and it returns
outputs in well-known format, e.g., Adobe PDF, HTML and Microsoft
Excel. Pentaho Reporting allows to integrate any kind of document,
therefore it presents a high level of scalability. It supports both
JDBC and JNDI connections with the most widespread Databases
like Oracle, DB2, MySQL, MS-SQL Server, etc. If it is interfaced
with metadata, it allows further simplification for developers,
since they can create data extraction queries using business
objects, without knowing the physical data schema. Pentaho
Reporting contains the JFreeReport library.
some flags are reported. Differently from a reporting tool, it does
not need pre-defined queries in order to retrieve the results, but
data are analyzed by means of drag, drop and drill operations and
then are retrieved through complex and analytic queries. Moreover,
it is possible to switch rows with columns to analyze data from a
different point of view. Pentaho Analysis server operates on OLAP
cubes whose description is contained in an XML file. Free graphic
tools (e.g., Schema Workbench) allow to design and publish cubes
into Pentaho server (Figure 9).
Pentaho Analysis is based on Mondrian OLAP and JPivot.
Figure 7. Pentaho Reporting.
2.6. Pentaho Analysis
The term OLAP (On-Line Analytical Processing) refers to a set of
software techniques for the interactive and rapid analysis of huge
amount of data.
Pentaho Analysis (Figure 8) is a powerful tool supporting the socalled knowledge workers, whose tasks involve developing or using
knowledge: it helps them to effectively operate by acquiring the
view and knowledge they need to take decisions. This tool allows
interactive analysis of data, by providing a rich interface where
several analysis dimensions (e.g., time, product, customer) and
Figure 9. Schema Workbench.
2.7. Pentaho Dashboard
Figure 8. Pentaho Analysis (OLAP).
As far as BI applications are concerned, dashboard is a software
solution to show, to final users, high level Key Performance
Indicators (KPI) through attractive and intuitive interfaces. Most
dashboards contain only graphical content: instead of using
numbers, parameters are symbolized with images, counters,
quadrants and, sometimes, graphics. The purpose is to provide an
overview of a wide business area, allowing managers to visualize
the business state at a glance and to analyse it. All these functionalities are offered by Pentaho Dashboard: its integration with
Pentaho Reporting and Pentaho Analysis modules allows users to
perform drill operations on a single analysis object, in order to
better identify factors that have contributed to the achievement of
positive and/or negative results. Figure 10 and Figure 11 show
dashboard examples; furthermore it is possible to integrate data
with geographical databases (e.g., Google maps).
Altran Italia Technology Review No. 6 - June 2011 <
31
different panels associated to different data mining tasks. This
tool may load data from several sources like files, URLs and databases. The format of supported files includes ARFF WEKA, CSV,
LibSVM and C4. The majority of the tasks, which may be addressed
with Explorer, may be managed also through the graphic interface
Knowledge Flow, shown in Figure 13.
Pentaho Weka is provided with an environment for comparison of
results using various mining algorithms or data sets.
Figure 10. Pentaho Dashboard.
Figure 12. Pentaho Weka Explorer.
Figure 11. Pentaho Dashboard.
Integration with Google maps.
2.8. Pentaho Weka
As far as Data Mining analytic functionalities are concerned, the
platform is based on the tool Weka, written in Java and distributed
under GNU Public License. This tool includes advanced preprocessing components, learning algorithms and evaluation methods
that can be invoked through graphic interfaces, in order to create
complex functions flow. Weka has several graphical interfaces
which make the access to basic functionalities easier.
The principal interface is Explorer, showed in Figure 12 presenting
32
> Altran Italia Technology Review No. 6 - June 2011
Figure 13. Pentaho Weka Knowledge Flow.
2.9. Pentaho Design Studio
In order to create analytical documents (called Pentaho solutions),
it is possible to use Design Studio (Figure 14), a plug-in for the
well-known IDE Eclipse that Pentaho suggested as developing environment. The business logic is defined inside the Process Action
environment, that lists the possible actions to execute: query,
report, OLAP analysis, ETL processes, messages printing,
Javascript code execution, charts, e-mail sending, etc.
inter-operation, and user interaction of each subsystem is defined
by a collection of documents called Solution Definition. They are
XML documents that:
• define business process (in XPDL format),
• indicate the activities that have to be carried out when called by
a web service; the activities definition includes: data sources,
queries, report templates, delivery and notification rules, business
rules, dashboards, and analytic views.
The specific features of this architecture are:
• modular components, which may be replaced or added;
• control system using the Simple Network Management Protocol
(SNMP);
• repositories that may be hosted even on an RDBMS external to
the platform. It is possible to use both open source databases
(e.g., MySQL, FireBird) and proprietary ones (e.g. Oracle, SQL
Server, DB2);
• security functionalities including role-based security, business
rules, and logging; Java Open Single Sign-On (JOSSO) is
supported and LDAP may be used for external security system
integration;
Figure 14. Pentaho Design Studio.
2.10. Pentaho BI Platform
Pentaho BI platform provides the architecture and infrastructure
required to realize BI solutions. The platform is ruled by a
workflow engine and it can be easily integrated in business
processes. The workflow manages the main services, such as
authentication, logging, auditing, workflow, web services, security,
scheduling, etc.
3. Architecture
Pentaho engine is based on Pentaho Server. Pentaho Server can
be executed in any J2EE application server, specifically Apache,
JBOSS AS, Websphere, WebLogic and Oracle AS. Figure 15
provides an overview of the platform architecture.
Pentaho Server contains the engine underlying the components
producing reports, analysis, business rules, workflow and email/desktop notification. These modules are integrated in order
to be able to solve any BI problem. In a BI solution, the behavior,
• integration of several well-known open source components,
such as:
OLAP Server and JPivot Analysis Front-End,
MySQL RDBMS,
Shark and JaWE Workflow,
Data Integration (ETL),
Tomcat Application server, Hibernate and Portal,
Weka Data Mining,
Eclipse Workbench and BIRT reporting components,
JOSSO single sign-on and LDAP integration,
Mozilla Rhino Javascript Processor;
• usage of international standards and protocols:
XML: W3C’s Extensible Markup Language,
JSR-94: JCP’s Rules Engine API,
JSR-168: JCP’s Portlet Speco,
SVG: W3C’s Scalable Vector Graphics
XPDL: WFMC’s XML Process Definition Language,
XForms: W3C’s Web Forms,
MDX: Microsoft’s OLAP Query Language,
WSBPEL: Oasis’s Web Services Business Process Execution
Language,
WSDL: W3C’s Web Services Description Language.
Altran Italia Technology Review No. 6 - June 2011 <
33
Source: Adaptation from http://wiki.pentaho.com/display/ServerDoc1x/Architecture
Figure 15. Pentaho Architecture.
4. Solutions
Pentaho architecture solutions represent the set of functionalities
available to final users. Typically, they consist in reports, dashboards
and OLAP analysis. Analysis solutions are analytical documents
stored in a repository and accessible through a web interface. Analytical documents are based on Pentaho Action Sequences: they
are XML documents, uploaded and interpreted by the platform,
that define the smallest complete task that the solution engine
can perform. An Action Sequence defines input parameters and
actions which have to be subsequently done. Some Action
Sequences may be integrated with parameters coming from
34
> Altran Italia Technology Review No. 6 - June 2011
different sources, or more Action Sequences may be subsequently
combined in order to make an action sequence’s output the input
of another, creating flexible applications.
5. Business Applications
In order to fully exploit the potentials of open source Pentaho BI
suite, the authors and their team have been engaged in the
creation of both stand alone solutions (e.g., Job Market Watches)
and solutions integrated into existing applications (e.g.,Security
Assessment tools, Document Management tools).
Biography
Customizations have been performed on Pentaho, involving code
manipulation for report access panels creation, OLAP cubes
graphic layout restructuring, home page portals personalization
with implementation of company logos, introductive and descriptive
sections, images and texts, personalized access to the toolbar
page and menu bar.
In the future, it is planned the implementation of dashboards integrated with Google Maps that are able to provide Statistic
information obtained by combining Google Maps georeferencing
services with Pentaho platform Solution.
6. Conclusions
Pentaho represents an effective open source alternative for most
common BI commercial platforms and even a better solution that
is able to satisfy any informative demand, since it enriches the
analysis with workflow functionalities and with integrable process
management. This allows to exploit advanced BI technologies to
perform business operations, such as purchase and income
analysis, customer analysis, HR reporting, financial reporting, KPI
dashboards, Supply Chain analysis and operative reports. Furthermore, this solution presents advantages in terms of:
• license cost lowering,
• integration with third party products,
• intuitive, interactive and customizable access,
• open technology sharing.
These characteristics enable the product to develop comprehensive solutions.
Glossary
BI
DWH
DB
OLAP
ETL
KPI
XPDL
SNMP
RDBMS
JOSSO
Business Intelligence
Data Warehouse
Database
On-Line Analytical Processing
Extract, Transform, Load
Key Performance Indicator
XML Process Definition Language
Simple Network Management Protocol
Relational Database Management System
Java Open Single Sign-On
Roberto Guglielmi is a Professional
Consultant in Altran Italia, that he
joined in June 2005. Always
interested in technology, in 1990
he began working on ICT projects
in several companies, where he
was able to gain extensive and
varied knowledge on different
market sectors with a focus on Private Banking. Over
the last few years, he worked as Senior Technical
Analyst and System Integrator on projects on different
areas (mainly, Telecommunications, Aerospace and
Defense, Energy). He is currently involved on solutions
and projects about Business Intelligence (both
proprietary platforms and open source solutions) in
Financial Services market sector.
He also collaborates with Altran Italia Business
Intelligence Solution, whose mission to is offer support
to all BI projects.
Biography
Raffaella Parri works in Altran
Italia since March 2006. She
began her professional career as a
software developer, and she has
successively been employed as
functional and technical analyst.
Currently, she is engaged in
Business Intelligence projects,
covering different operation areas and different market
sectors: Risk Analysis in the field of Security
Assessment for Telecommunications, Enterprise
Performance Management and Enterprise Data
Warehouse in the field of Public Administration, BI
Application in the field of Aerospace and Defense. She is
specialized in the open source platform Pentaho and she
is mainly involved in Multidimensional Data Modeling and
Report Design, project and design of ETL process,
Functional Analysis and DWH Design. She collaborates
with Altran Italia Business Intelligence Solution, whose
mission is to support all BI projects.
Altran Italia Technology Review No. 6 - June 2011 <
35
Bibliography
[1].
[2].
[3].
[4].
[5].
[6].
[7].
[8].
[9].
[10].
[11].
36
R. Bouman, J. Dongen.
“Pentaho Solutions – Business Intelligence
and Datawarehousing with Pentaho
and MySQL”. Wiley, 2009.
Pentaho Community Wiki,
http://wiki.pentaho.com
R. Talamo, “White Paper Pentaho Piattaforma di Business Intelligence”
(internal document, in Italian).
JFreeReport,
http://wiki.pentaho.com/display/
Reporting/JFreeReport+0.9
Mondrian OLAP,
http://mondrian.pentaho.com/
JPivot, http://jpivot.sourceforge.net/
Data Integration (Kettle)
http://kettle.pentaho.com/
Weka, http://weka.pentaho.com
Schema Workbench,
http://mondrian.pentaho.com/
documentation/workbench.php
BIRT, http://www.eclipse.org/birt/
JasperReports,
http://jasperforge.org/projects/jasperreports
> Altran Italia Technology Review No. 6 - June 2011