Download 1 CHAPTER 2 STUDY LITERATURE 2.1 General Theory 2.1.1 Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Relational model wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Big data wikipedia , lookup

Clusterpoint wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Functional Database Model wikipedia , lookup

Database model wikipedia , lookup

Transcript
CHAPTER 2
STUDY LITERATURE
2.1 General Theory
2.1.1
Data
According to Inmon (2005: 493) Data is recording of facts,
concepts, or instruction on storage medium for communication,
retrieval, and processing by automatic means and presentation as
information that is understandable by human being.
According to Connolly and Begg (2010: 70) Data is the most
important component in a database management system (DBMS)
from the end user point of view where the data act as a bridge
between machines and users
Based on the definition above, we can conclude that data is the
most important component in a database and used as information that
facilitates the information to be understandable by human being.
2.1.2
Information
According to Rainer (2012: 25)Information is data that have
been organized so that they have meaning and value to the recipient
According to Vercellis (2009: 7) Information is the outcome of
extraction and processing activities carried out on data, and it appears
meaningful for those who receive it in a specific domain
Based on the definition above, we can conclude that
Information is data is use to give meaning and to support decision
making
9
10
2.1.3
Database
According to Inmon (2005: 493) Database is a collection of
interrelated data stored (often with controlled, limited redundancy)
according to a schema. A database can serve single of multiple
applications.
According to Connolly and Begg (2010: 65) Database is a
collection of logically related data and described and are designed to
meet the needs of the information needed by an organization.
Based on the definition above, we can conclude that Database
is a collection of data that is use to fulfill the needs of an
organization/company.
2.1.4
Database Management System (DBMS)
According to Inmon (2005: 494) Database Management
System is a computer-based software system used to establish and
manage data.
According
to
Connolly
&Begg(2010:
16)
Database
Management System is a software system that enables users to define,
create, maintain, and control access to the database.
Based on the definition above, we can conclude that Database
Management System is a software that is use to manage data and to
give certain access to the database.
2.1.5
Data Warehouse
According to Inmon (2005: 495) Data Warehouse is collection
of integrated, subject-oriented database design to support DSS
Function, where each unit of data is relevant to some moment in time.
The data warehouse contains atomic data and lightly summarized
data.
According to Kimball (2010: 10) A data warehouse almost
always demands data expressed at the lowest possible grain of each
11
dimension, not because queries want to see individual records, but
because queries need to cut through the database in very precise ways.
The data warehouse must have the right data to support decision
making.
Table 2.1 Kimball Vs. Inmon
Differences
Kimball
Inmon
Methodology
Bottom Up
Top Down
Drive
Business Area
Enterprise
Architecture
Star Schema
Need
Immediate
Requirement
Volatile
Sources
Stable
Changeable
Startup cost
Lower
Higher
Relational Data
Warehouse
Longer time scale
More stable and
growing
According to Connolly and Begg (2010: 125), a data
warehouse is an integrated view of the enterprise data taken from
different operational data sources and various tools end users are able
to access support simple question to very complex with the aim of
supporting decision.
According to Farhan (2011: 2) Data Warehouse is a complex
process comprising many tasks, such as extraction, transformation,
integration, cleaning, key management, history management, and
loading.
Based on the definition above, we can conclude that Data
Warehouse is a collection of organized data and software to support
decision-making process and business process
12
2.1.6
Online Transaction Processing (OLTP)
According to Inmon (2005: 500) OLTP is high-performance
transaction-processing environment
According to Connolly (2010: 1198) OLTP is a system
designed to maximize the capacity of processing the transaction.
Based on the definition above, we can conclude that OLTP is a
high-performance system that is use for transaction.
2.1.7
Online Analytical Processing (OLAP)
According to Inmon (2005: 500) OLAP is departmental
processing for the data mart environment
According to Kimball (2013: 481) OLAP is a sibling of
dimensional star schemas in the relational database, with intelligence
about relationships and calculations defined on the server that enable
faster query performance and more interesting analytics from a broad
range of query tools.
Based on the definition above, we can conclude that OLAP is
software technology that is used to transform data in data mart
environment and related to star schema in a database.
13
Figure 2.1OLAP VS OLTP (Han, 2012:130)
2.1.8
Data Warehouse Characteristic
2.1.8.1 Subject Oriented
According to Inmon (2005: 29) subject oriented is a data
warehouse characteristic where operation system are organized
around the functional applications of the company
14
Figure 2.2Subject Orientation Data (Inmon, 2005:30)
2.1.8.2 Integrated
According to Inmon (2005: 30) Integrated is the most
important characteristic of data warehouse. Data is fed from
multiple, disparate sources into data warehouse. As data is fed, it
is converted, reformatted, re-sequenced, summarized, and so forth.
Figure2.3Integration (Inmon, 2005:31)
15
2.1.8.3 Non Volatile
According to Inmon (2005: 31) nonvolatile is data warehouse
characteristic where data is regularly accessed and manipulated
one record at a time. Data is updated in the operational
environment as a regular matter of course, but data warehouse
data exhibits a very different set of characteristic.
Figure2.4 Nonvolatile (Inmon, 2005:33)
2.1.8.4 Time Variant
According to Inmon (2005: 31) time variant is data warehouse
characteristic where every unit of data in the data warehouse is
accurate as of some moment in time.
16
Figure 2.5Time Variant (Inmon, 2005:33)
2.1.9 Data Warehouse Methodology
Planning method in a data warehouse based on the method developed by
Kimball written in Connolly’s book. According to Connolly &Begg(2010:
1187), there is 9 Data Warehouse methodology that is used to make a Data
Warehouse :
2.1.9.1 Choosing the process
The process (function) refers to the subject matter of a
particular data mart. The first data mart to be built should be the
one that is most likely to be delivered on time, within budget, and
to answer the most commercially important business questions.
2.1.9.2 Choosing the grain
Choosing the grain means deciding exactly what a fact table
record represents. Only when the grain for the fact table is chosen
we can identify the dimension of the fact table. The grain
decision for the fact table also determines the grain of each
dimension tables.
17
2.1.9.3 Identifying and conform the dimensions
Dimensions set the context for asking questions about the facts
in the fact table. A well-built set of dimensions makes the data
mart understandable and easy to use. Identifying dimensions in
sufficient detail to describe things such as client and properties at
the correct grain.
2.1.9.4 Choosing the facts
The grain of the fact table determines which facts can be used
in the data mart. All the facts must be expressed at the level
implied by the grain. In other words, if the grain of the fact table is
an individual property, then all the numerical facts must refer to
these particular sales.
2.1.9.5 Storing pre-calculations in the fact table
Once the facts have been selected each should be re-examined
to determine whether there are opportunities to use precalculations. A common example of the need to store precalculations occurs when the facts comprise a profit and loss
statement. This situation will often arise when the fact table is
based on invoices or sale.
2.1.9.6 Round out dimension table
In this step, we return to the dimensions tables and add as
many text descriptions to the dimensions as possible. The text
descriptions should be as intuitive and understandable to the users
as possible. The usefulness of a data mart is determined by the
scope and nature of the attributes of the dimension table.
18
2.1.9.7 Choosing the duration of the database
The duration measures how far back in time the fact table
goes. In many enterprises, there is a requirement to look at the
same time period a year or two earlier. The older the data, the
more likely there will be problems in reading and interpreting the
old files. This is known as the ‘slowly changing dimensions’
problem.
2.1.9.8 Tracking slowly changing dimensions
The slowly changing dimension problem means, for example,
that the proper description of the old client and the old branch
must be used with the old transaction history. Often, the data
warehouse must assign a generalized key to these important
dimensions in order to distinguish multiple snapshots of clients
and branches over a period of time.
2.1.9.9 Decide the query priorities and the query modes
In this step we consider physical design issues. The most
critical physical design issues affecting the end-user’s perception
of the data mart are the physical sort order of the fact table on disk
and the presence of pre-stored summaries or aggregations. Beyond
these issues there are a host of additional physical design issues
affecting administration, backup, indexing performance, and
security.
19
2.1.10 Data Warehouse Advantage and Disadvantage
2.1.10.1
Advantage of Data Warehouse
According to Connolly &Begg(2010: 1198) , the advantage of
using Data Warehouse is :
2.1.10.1.1 Potential high returns on investment
An organization must commit a huge amount of
resources to ensure the successful implementation of a data
warehouse and the cost can vary enormously from £50,000
to over £10 million due to the variety of technical solutions
available. However, a study by the International Data
Corporation (IDC) in 1996 reported that average three-year
returns on investment (ROI) in data warehousing reached
401%, with over 90% of the companies surveyed
achieving over 40% ROI, half the companies achieving
over 160% ROI, and a quarter with more than 600% ROI
(IDC, 1996).
2.1.10.1.2 Competitive Advantage
The huge returns on investment for those companies
that have successfully implemented a data warehouse is
evidence of the enormous competitive advantage that
accompanies this technology. The competitive advantage
is gained by allowing decision-makers access to data that
can reveal previously unavailable, unknown, and untapped
information on, for example, customers, trends, and
demands.
20
2.1.10.1.3 Increased productivity of corporate decision-makers
Data warehousing improves the productivity of
corporate decision-makers by creating an integrated
database of consistent, subject-oriented, historical data. It
integrates data from multiple incompatible systems into a
form that provides one consistent view of the organization.
By transforming data into meaningful information, a data
warehouse allows corporate decision-makers to perform
more substantive, accurate, and consistent analysis.
2.1.10.2
Problem of Data Warehouse
According to Connolly &Begg(2010: 1200) , the
problem of using Data Warehouse is :
2.1.10.2.1 Underestimation of resource for data estimation
Many developers underestimate the time required to
extract, transform, and load (ETL) the data into the
warehouse. This process may account for a significant
proportion of the total development time, although better
ETL tools are helping to reduce the necessary time and
effort.
2.1.10.2.2 Hidden problems with source systems
Hidden problem associated with the source system
feeding the data warehouse may be identified, possibly
after years of being undetected. The developers must
decide whether to fix the problem in the data warehouse
and/or fix the source systems
21
2.1.10.2.3 Required data not captured
Warehouse projects often highlight a requirement of
data not being captured by the existing source system. The
organization must decide whether to modify the OLTP
systems or create a system dedicated to capturing the
missing data.
2.1.10.2.4 Increased user-end demand
After end-users receive query and reporting tools,
request for support from IS staff may increase rather than
decrease. This is caused by an increasing awareness from
the users of the capabilities and value of the data
warehouse. This problem can be partially alleviated by
investing in easier-to-use, more powerful tools, or in
providing better training for the users. A further reason for
increasing demands on IS staff is that once a data
warehouse is online, it is often the case that the number of
users and queries increase together with request for answer
to more and more complex queries.
2.1.10.2.5 Data homogenization
Large-scale data warehousing can become an exercise
in data homogenization that lessens the value of the data.
For example, when producing a consolidated and
integrated view of the organization’s data, the warehouse
designer may be tempted to emphasize similarities rather
than differences in the data used by different application
area such as property sales and property renting
22
2.1.10.2.6 High demand for resources
The data warehouse can use large amount of disk
space. Many relational databases used for decision support
are designed around star, snowflake, and star-flake
schema. These approaches result in the creation of very
large fact tables. If there are many dimensions to the
factual data, the combination of aggregate tables and
indexes to the fact tables can use up more space than the
raw data.
2.1.10.2.7 Data ownership
Data warehousing may change the attitude of end-users
to the ownership of data. Sensitive data that was originally
viewed and used only by a particular department or
business area, such as sales marketing, may now be made
accessible to others in the organization.
2.1.10.2.8 High maintenance
Data warehouse are high-maintenance systems. Any
reorganization of the business processes and the source
systems may affect the data warehouse. To remain
valuable resources, the data warehouse must remain
consistent with the organization that it supports
2.1.10.2.9 Long-duration projects
A data warehouse represents a single data resource for
the organizations. However, the building of a data
warehouse can take several years, which is why some
organizations are building data marts. Data marts support
23
only the requirement of a particular department or
functional area and can therefore be built more rapidly.
2.1.10.2.10
Complexity of integration
The most important area for the management of data
warehouse is the integration capabilities. This means that
an organization must spend a significant amount of time
determining
how
well
the
various
different
data
warehousing tools can be integrated into the overall
solutions that is needed. This can be a very difficult task,
as there are number of tools for every operation of the data
warehouse, which must integrate well in order that the
warehouse works to the organization’s benefit.
2.1.11 Data Warehouse Component
According to Kimball (2002: 6) each warehouse component serves a
specific function, we need to learn the strategic significance of each
component and how to wield it effectively to win data warehousing game.
One of the biggest threats to data warehousing success is confusing the
component’s roles and function. There are four separate and distinct
component of data warehouse – operational source system, data staging area,
data presentation are, and data access tools.
24
Figure 2.6 Data Warehouse Component (Kimball, 2002: 7)
2.1.11.1
Operational Source Systems
The source system should be thought of as outside the
data warehouse because presumably it have little to no control
over the content and format of the data in these operational
legacy systems. The main priorities of the source systems are
processing performance and availability. Queries against
source system are narrow, one-record-at-a-time queries that
are part of the normal transaction flow and severely restricted
in their demands on the operational systems.
2.1.11.2
Data Staging Area
The data staging are of the data warehouse is both a
storage area and a set of process commonly referred to as
extract-transform-load (ETL). The data staging area is
everything between the operational source systems and the
data presentation area.
25
2.1.11.3
Data Presentation
The data presentation are is where data is organized,
stored, and made available for direct querying by users, report
writers, and other analytical applications. Since the backroom
staging area is off-limits, the presentation are is the data
warehouse as far as the business community is concerned. It is
all the business community sees and touches via data access
tools.
2.1.11.4
Data Access Tools
The final major component of the data warehouse
environment is the data access tools. By definition, all data
access tools query the data in the data warehouse’s
presentation area. Querying, obviously, is the whole point of
using the data warehouse. A data access tools can be as simple
as ad hoc query tool or as complex as a sophisticated data
mining or modeling application. Some of the more
sophisticated data access tools, like modeling or forecasting
tools, actually may upload their result back into operational
source system / the staging area of the data warehouse.
2.1.12 Data Warehouse Architecture
In performing data warehouse design, it’s better determined in
advance the most suitable architecture to the development of the data
warehouse. Connolly and Begg (2010: 1157) gives an overview of the
architecture of a typical data warehouse, which is as follows:
26
Figure 2.7Data Warehouse Architecture (Connolly, 2010:1157)
Existing components in this architecture are as follows:
1. Operational Data
Data sources to the data warehouse are provided from :
-
Mainframe operational data stored in a database first
generation hierarchical and network databases. It is
estimated that the majority of the operational data of
the company is located in the system.
-
Data - data between departments parts stored in diverse
systems such as VSAM file storage, RMS, and
relational DBMS like Informix and Oracle.
-
Internal data stored on workstations and private
servers.
-
External systems such as the Internet, commercial
databases or database associated with a customer or
supplier of the organization.
27
2. Operational Data Store (ODS)
Operational Data Store is a storage medium on the
latest operational data are used and integrated for analysis.
ODS structuring and providing the data the same way as the
data warehouse, but the actual act simply as a temporary
shelter before the data is transferred to the data warehouse.
ODS was created when the operating system was found to be
able to achieve the success of the reporting system. ODS
provides the benefits of a relational database that is used in
decision support data warehouse functions.
3. Load Manager
Load manager displays all the operations associated
with the extraction directly from the data source or in general
of the ODS.
4. Warehouse Manager
Warehouse manager carry out all the operations related
to the management of data in the data warehouse. Operations
carried out by the warehouse manager include:
- Analysis of the data to ensure consistency.
- Transformation and merging of data sources from a
temporary storage area into the data warehouse
tables.
- Making index and view the base tables.
- Performing de-normalization.
- Doing aggregation.
- Backup and archive data.
28
5. Query Manager
Query Manager displays all the operations related to
the user query . Operation displayed by this component
include directing queries on tables that match and schedule
query execution.
6. Detail Data
This area of data storage all the details in the database,
which aims to complete the data, set for the data warehouse. In
many cases, detailed data is not stored online but can be
provided through aggregation of data at the next level.
7. Lightly and Highly Summarized Data
It stores all of the data area lightly and highly
summarized that has been defined previously made by the
warehouse manager. This summarized information purposes to
improve query performance.
8. Archive or backup of data
This area stores all the detail and summary data for
archiving and backup purposes. Although summary data
derived from detailed data, with the summary of the data will
allow online backups. If this data is stored beyond the
retention period for the data is detailed. Data is transferred to
storage archives such as optical disks.
9. Metadata
Metadata is data about data that describes data
warehouse. Metadata is used to build, maintain, manage, and
29
use the data warehouse. Metadata contain the location and
description of the components in the data warehouse. Such as
names, definitions, structure, and content of the data
warehouse and the end user view. Metadata also identify data
sources are integrated and transformed in a data warehouse.
10. End User Access Tools
Development of a data warehouse is to provide
consistent data to the user that will be used to analyze to
provide information to support decision making. Tools that are
used to process data warehouse include:
-
Report and query tools
Reporting tools include production reporting
tools and report writers. Production reporting tools
used to generate a set of operational reports or to
support a number of jobs with a high volume.Report
Writer is a desktop tool designed desktop tools for endusers.
- Application Development Tools.
These tools are used by users to access the data
graphically designed specifically for the client server.
Some applications integrated with OLAP tools and can
access all major database systems.
- Executive Information System (EIS) Tools
Executive information system developed to
support decision-making by the executive. But today,
the EIS can support all levels of processing. Therefore
the executive information system and decision support
30
system increasingly unclear distinction. This is due to
the EIS developer added custom built feature in its
application.
- Online Analytical Processing (OLAP) Tools
Online analytical tools processing is the
concept of multidimensional database and allow users
to analyze the data using a complex and multidimension view.
- Data Mining Tools
Data mining is the process of finding
correlations,patterns, and styles that are useful to
explore large amounts of data using statistical
techniques,math, and artificial intelligence.
2.1.13 The Concept of Data Warehouse Modeling
Generally,
data
warehouse
modeling
techniques
dimensional
modeling. With this technique, it can be made and the fact table dimensions,
as well as relationships or relationships between each of these:
2.1.13.1
Dimensional Modeling
According to Kimball (2013:7) Dimensional modeling is a
longstanding technique for making databases simple. In case after
case, for more than five decades, IT organizations, consultants,
and business users have naturally gravitated to a simple
dimensional structure to match the fundamental human need for
simplicity
31
Dimensional modeling is widely accepted as the preferred
technique for presenting analytic data because it addresses two
simultaneous requirements:
2.1.13.2

Deliver data that’s understandable to the business users.

Deliver fast query performance.
Star Schema
According to Kimball (2013: 40) Star schemas are
dimensional structures deployed in a relational database
management
system
(RDBMS).
Dimensional
models
implemented in relational database management systems are
referred to as star schemas because of their resemblance to a
star-like structure. While dimensional models implemented in
multidimensional database environments arereferred to as
online analytical processing (OLAP) cubes.
Figure 2.8Star Schema VS OLAP (Kimball, 2013: 9)
32
2.1.13.3
Fact Table
According to Kimball (2013: 10) Fact Table in a
dimensional model stores the performance measurements
resulting from an organization’s business process events. The
fact table generally has its own primary key composed of a
subset of the foreign keys. This key is often called a composite
key. Every table that has a composite key is a fact table. Fact
tables express many-to-many relationships.
All others
areDimension tables.
2.1.13.4
Dimension Tables
According to Kimball (2013: 13) Dimension tables are
integral companions to a fact table. The dimension tables
contain the textual context associated with a business process
measurement event. They describe the “who, what, where,
when, how, and why” associated with the event.
Figure 2.9Dimension Table (Kimball, 2013: 13)
33
2.1.13.5
Snowflake Schema
According to Connolly and Begg (2010: 1229)
Snowflake schema is a dimensional data model that has fact
table in the center and surrounded the tables dimensional
normalization.
2.1.13.6
Star-flake Schema
According to Connolly and Begg (2010: 1230) Starflake schema is a dimensional data model that has fact table in
the center and surrounded the tables dimension normalization
and de-normalization
2.1.14 ETL
2.1.14.1
Extract
According to Connolly (2010:1208) Extraction is a
stage at which retrieves data from a data source for EDW (data
warehouse environment), is usually composed of a source
OLTP database and also there are some other sources such as
personal databases, ERP data and usage data from the web.
2.1.14.2
Transform
According to Connolly (2010:1208) Transformation is
the stage which uses several rules or functions of the fetched
data and measure how these data can be used for analysis and
transformation of such sum may include data, the encoding of
data, data fusion, data separation, data calculation and
manufacture of surrogate keys.
34
2.1.14.3
Load
According to Connolly (2010:1209) Loading is the
stage to enter data that has undergone a process of
transformation into the data warehouse.
2.1.14.4
ETL Tools
According to Connolly (2010:1209) ETL Tools is the
process that can be created by creating their own program or
by buying an ETL that are in the market.
2.2 Special Theory
2.2.1
Commodity type
There are some commodities that is used by the
company in doing their transaction. The commodities are :
2.2.1.1 Gold
According to David Evans (2013:3) Gold is old money,
the premium non-government currency, evolved in the
marketplace over 5,000 years. Gold is best viewed as a
superior form of cash that debases much more slowly than
paper currency. Of great political importance at this juncture,
banks and government cannot print it. Gold is a potential rival
to paper currency, also has the potential to undermine the
power and profitability of banks and government.
2.2.1.2 Palm Oil
According to Azmil (2008:1) Palm oil is one of the 17
major oils and fats produced and traded worldwide. Palm oil
35
accounts for almost 30% of the global vegetable oil
productions, with 60% of the sharp represent the overall world
export.
2.2.2
Asset
According to EFRAG (2010: 13) Asset is a resource
controlled by the entity as a result of past events and from
which future economic benefits are expected to flow to the
entity.
2.2.3
Microsoft SQL Server 2008 R2
According to Ross (2010: 3) Microsoft SQL Server
2008 R2 is the most advanced, trusted, and scalable data
platform released to date. Building on the success of the
original SQL Server 2008 release, SQL Server 2008 R2 has
made an impact on organizations worldwide with its
groundbreaking capabilities, empowering end users through
self-service business intelligence (BI), bolstering efficiency
and collaboration between database administrators (DBAs)
and application developers, and scaling to accommodate the
most demanding data workloads.
2.2.4
QlikView
According to QlikTech (2011: 11) QlikView is
software that enables all kinds of users from beginners to
experts to retrieve and assimilate data easily from any source:
databases like SQL Server or Oracle as well as Excel, XML or
text files. Enterprise applications such as SAP may also be
used as data source for a QlikView analysis..
36
Mind Frame
Figure 2.10 Mind Frame
This is the mind frame that we use in making our final project. Consist of 3
phases, which describe 3 main processes in designing a data warehouse. First is we
determine the dimension, fact and the calculation. In the second phase we design the
star schema and etl planning. And in the last phase we design the user interface for
report. If all the phase is complete, then we can get the final result.