* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download 1 CHAPTER 2 STUDY LITERATURE 2.1 General Theory 2.1.1 Data
Survey
Document related concepts
Transcript
CHAPTER 2 STUDY LITERATURE 2.1 General Theory 2.1.1 Data According to Inmon (2005: 493) Data is recording of facts, concepts, or instruction on storage medium for communication, retrieval, and processing by automatic means and presentation as information that is understandable by human being. According to Connolly and Begg (2010: 70) Data is the most important component in a database management system (DBMS) from the end user point of view where the data act as a bridge between machines and users Based on the definition above, we can conclude that data is the most important component in a database and used as information that facilitates the information to be understandable by human being. 2.1.2 Information According to Rainer (2012: 25)Information is data that have been organized so that they have meaning and value to the recipient According to Vercellis (2009: 7) Information is the outcome of extraction and processing activities carried out on data, and it appears meaningful for those who receive it in a specific domain Based on the definition above, we can conclude that Information is data is use to give meaning and to support decision making 9 10 2.1.3 Database According to Inmon (2005: 493) Database is a collection of interrelated data stored (often with controlled, limited redundancy) according to a schema. A database can serve single of multiple applications. According to Connolly and Begg (2010: 65) Database is a collection of logically related data and described and are designed to meet the needs of the information needed by an organization. Based on the definition above, we can conclude that Database is a collection of data that is use to fulfill the needs of an organization/company. 2.1.4 Database Management System (DBMS) According to Inmon (2005: 494) Database Management System is a computer-based software system used to establish and manage data. According to Connolly &Begg(2010: 16) Database Management System is a software system that enables users to define, create, maintain, and control access to the database. Based on the definition above, we can conclude that Database Management System is a software that is use to manage data and to give certain access to the database. 2.1.5 Data Warehouse According to Inmon (2005: 495) Data Warehouse is collection of integrated, subject-oriented database design to support DSS Function, where each unit of data is relevant to some moment in time. The data warehouse contains atomic data and lightly summarized data. According to Kimball (2010: 10) A data warehouse almost always demands data expressed at the lowest possible grain of each 11 dimension, not because queries want to see individual records, but because queries need to cut through the database in very precise ways. The data warehouse must have the right data to support decision making. Table 2.1 Kimball Vs. Inmon Differences Kimball Inmon Methodology Bottom Up Top Down Drive Business Area Enterprise Architecture Star Schema Need Immediate Requirement Volatile Sources Stable Changeable Startup cost Lower Higher Relational Data Warehouse Longer time scale More stable and growing According to Connolly and Begg (2010: 125), a data warehouse is an integrated view of the enterprise data taken from different operational data sources and various tools end users are able to access support simple question to very complex with the aim of supporting decision. According to Farhan (2011: 2) Data Warehouse is a complex process comprising many tasks, such as extraction, transformation, integration, cleaning, key management, history management, and loading. Based on the definition above, we can conclude that Data Warehouse is a collection of organized data and software to support decision-making process and business process 12 2.1.6 Online Transaction Processing (OLTP) According to Inmon (2005: 500) OLTP is high-performance transaction-processing environment According to Connolly (2010: 1198) OLTP is a system designed to maximize the capacity of processing the transaction. Based on the definition above, we can conclude that OLTP is a high-performance system that is use for transaction. 2.1.7 Online Analytical Processing (OLAP) According to Inmon (2005: 500) OLAP is departmental processing for the data mart environment According to Kimball (2013: 481) OLAP is a sibling of dimensional star schemas in the relational database, with intelligence about relationships and calculations defined on the server that enable faster query performance and more interesting analytics from a broad range of query tools. Based on the definition above, we can conclude that OLAP is software technology that is used to transform data in data mart environment and related to star schema in a database. 13 Figure 2.1OLAP VS OLTP (Han, 2012:130) 2.1.8 Data Warehouse Characteristic 2.1.8.1 Subject Oriented According to Inmon (2005: 29) subject oriented is a data warehouse characteristic where operation system are organized around the functional applications of the company 14 Figure 2.2Subject Orientation Data (Inmon, 2005:30) 2.1.8.2 Integrated According to Inmon (2005: 30) Integrated is the most important characteristic of data warehouse. Data is fed from multiple, disparate sources into data warehouse. As data is fed, it is converted, reformatted, re-sequenced, summarized, and so forth. Figure2.3Integration (Inmon, 2005:31) 15 2.1.8.3 Non Volatile According to Inmon (2005: 31) nonvolatile is data warehouse characteristic where data is regularly accessed and manipulated one record at a time. Data is updated in the operational environment as a regular matter of course, but data warehouse data exhibits a very different set of characteristic. Figure2.4 Nonvolatile (Inmon, 2005:33) 2.1.8.4 Time Variant According to Inmon (2005: 31) time variant is data warehouse characteristic where every unit of data in the data warehouse is accurate as of some moment in time. 16 Figure 2.5Time Variant (Inmon, 2005:33) 2.1.9 Data Warehouse Methodology Planning method in a data warehouse based on the method developed by Kimball written in Connolly’s book. According to Connolly &Begg(2010: 1187), there is 9 Data Warehouse methodology that is used to make a Data Warehouse : 2.1.9.1 Choosing the process The process (function) refers to the subject matter of a particular data mart. The first data mart to be built should be the one that is most likely to be delivered on time, within budget, and to answer the most commercially important business questions. 2.1.9.2 Choosing the grain Choosing the grain means deciding exactly what a fact table record represents. Only when the grain for the fact table is chosen we can identify the dimension of the fact table. The grain decision for the fact table also determines the grain of each dimension tables. 17 2.1.9.3 Identifying and conform the dimensions Dimensions set the context for asking questions about the facts in the fact table. A well-built set of dimensions makes the data mart understandable and easy to use. Identifying dimensions in sufficient detail to describe things such as client and properties at the correct grain. 2.1.9.4 Choosing the facts The grain of the fact table determines which facts can be used in the data mart. All the facts must be expressed at the level implied by the grain. In other words, if the grain of the fact table is an individual property, then all the numerical facts must refer to these particular sales. 2.1.9.5 Storing pre-calculations in the fact table Once the facts have been selected each should be re-examined to determine whether there are opportunities to use precalculations. A common example of the need to store precalculations occurs when the facts comprise a profit and loss statement. This situation will often arise when the fact table is based on invoices or sale. 2.1.9.6 Round out dimension table In this step, we return to the dimensions tables and add as many text descriptions to the dimensions as possible. The text descriptions should be as intuitive and understandable to the users as possible. The usefulness of a data mart is determined by the scope and nature of the attributes of the dimension table. 18 2.1.9.7 Choosing the duration of the database The duration measures how far back in time the fact table goes. In many enterprises, there is a requirement to look at the same time period a year or two earlier. The older the data, the more likely there will be problems in reading and interpreting the old files. This is known as the ‘slowly changing dimensions’ problem. 2.1.9.8 Tracking slowly changing dimensions The slowly changing dimension problem means, for example, that the proper description of the old client and the old branch must be used with the old transaction history. Often, the data warehouse must assign a generalized key to these important dimensions in order to distinguish multiple snapshots of clients and branches over a period of time. 2.1.9.9 Decide the query priorities and the query modes In this step we consider physical design issues. The most critical physical design issues affecting the end-user’s perception of the data mart are the physical sort order of the fact table on disk and the presence of pre-stored summaries or aggregations. Beyond these issues there are a host of additional physical design issues affecting administration, backup, indexing performance, and security. 19 2.1.10 Data Warehouse Advantage and Disadvantage 2.1.10.1 Advantage of Data Warehouse According to Connolly &Begg(2010: 1198) , the advantage of using Data Warehouse is : 2.1.10.1.1 Potential high returns on investment An organization must commit a huge amount of resources to ensure the successful implementation of a data warehouse and the cost can vary enormously from £50,000 to over £10 million due to the variety of technical solutions available. However, a study by the International Data Corporation (IDC) in 1996 reported that average three-year returns on investment (ROI) in data warehousing reached 401%, with over 90% of the companies surveyed achieving over 40% ROI, half the companies achieving over 160% ROI, and a quarter with more than 600% ROI (IDC, 1996). 2.1.10.1.2 Competitive Advantage The huge returns on investment for those companies that have successfully implemented a data warehouse is evidence of the enormous competitive advantage that accompanies this technology. The competitive advantage is gained by allowing decision-makers access to data that can reveal previously unavailable, unknown, and untapped information on, for example, customers, trends, and demands. 20 2.1.10.1.3 Increased productivity of corporate decision-makers Data warehousing improves the productivity of corporate decision-makers by creating an integrated database of consistent, subject-oriented, historical data. It integrates data from multiple incompatible systems into a form that provides one consistent view of the organization. By transforming data into meaningful information, a data warehouse allows corporate decision-makers to perform more substantive, accurate, and consistent analysis. 2.1.10.2 Problem of Data Warehouse According to Connolly &Begg(2010: 1200) , the problem of using Data Warehouse is : 2.1.10.2.1 Underestimation of resource for data estimation Many developers underestimate the time required to extract, transform, and load (ETL) the data into the warehouse. This process may account for a significant proportion of the total development time, although better ETL tools are helping to reduce the necessary time and effort. 2.1.10.2.2 Hidden problems with source systems Hidden problem associated with the source system feeding the data warehouse may be identified, possibly after years of being undetected. The developers must decide whether to fix the problem in the data warehouse and/or fix the source systems 21 2.1.10.2.3 Required data not captured Warehouse projects often highlight a requirement of data not being captured by the existing source system. The organization must decide whether to modify the OLTP systems or create a system dedicated to capturing the missing data. 2.1.10.2.4 Increased user-end demand After end-users receive query and reporting tools, request for support from IS staff may increase rather than decrease. This is caused by an increasing awareness from the users of the capabilities and value of the data warehouse. This problem can be partially alleviated by investing in easier-to-use, more powerful tools, or in providing better training for the users. A further reason for increasing demands on IS staff is that once a data warehouse is online, it is often the case that the number of users and queries increase together with request for answer to more and more complex queries. 2.1.10.2.5 Data homogenization Large-scale data warehousing can become an exercise in data homogenization that lessens the value of the data. For example, when producing a consolidated and integrated view of the organization’s data, the warehouse designer may be tempted to emphasize similarities rather than differences in the data used by different application area such as property sales and property renting 22 2.1.10.2.6 High demand for resources The data warehouse can use large amount of disk space. Many relational databases used for decision support are designed around star, snowflake, and star-flake schema. These approaches result in the creation of very large fact tables. If there are many dimensions to the factual data, the combination of aggregate tables and indexes to the fact tables can use up more space than the raw data. 2.1.10.2.7 Data ownership Data warehousing may change the attitude of end-users to the ownership of data. Sensitive data that was originally viewed and used only by a particular department or business area, such as sales marketing, may now be made accessible to others in the organization. 2.1.10.2.8 High maintenance Data warehouse are high-maintenance systems. Any reorganization of the business processes and the source systems may affect the data warehouse. To remain valuable resources, the data warehouse must remain consistent with the organization that it supports 2.1.10.2.9 Long-duration projects A data warehouse represents a single data resource for the organizations. However, the building of a data warehouse can take several years, which is why some organizations are building data marts. Data marts support 23 only the requirement of a particular department or functional area and can therefore be built more rapidly. 2.1.10.2.10 Complexity of integration The most important area for the management of data warehouse is the integration capabilities. This means that an organization must spend a significant amount of time determining how well the various different data warehousing tools can be integrated into the overall solutions that is needed. This can be a very difficult task, as there are number of tools for every operation of the data warehouse, which must integrate well in order that the warehouse works to the organization’s benefit. 2.1.11 Data Warehouse Component According to Kimball (2002: 6) each warehouse component serves a specific function, we need to learn the strategic significance of each component and how to wield it effectively to win data warehousing game. One of the biggest threats to data warehousing success is confusing the component’s roles and function. There are four separate and distinct component of data warehouse – operational source system, data staging area, data presentation are, and data access tools. 24 Figure 2.6 Data Warehouse Component (Kimball, 2002: 7) 2.1.11.1 Operational Source Systems The source system should be thought of as outside the data warehouse because presumably it have little to no control over the content and format of the data in these operational legacy systems. The main priorities of the source systems are processing performance and availability. Queries against source system are narrow, one-record-at-a-time queries that are part of the normal transaction flow and severely restricted in their demands on the operational systems. 2.1.11.2 Data Staging Area The data staging are of the data warehouse is both a storage area and a set of process commonly referred to as extract-transform-load (ETL). The data staging area is everything between the operational source systems and the data presentation area. 25 2.1.11.3 Data Presentation The data presentation are is where data is organized, stored, and made available for direct querying by users, report writers, and other analytical applications. Since the backroom staging area is off-limits, the presentation are is the data warehouse as far as the business community is concerned. It is all the business community sees and touches via data access tools. 2.1.11.4 Data Access Tools The final major component of the data warehouse environment is the data access tools. By definition, all data access tools query the data in the data warehouse’s presentation area. Querying, obviously, is the whole point of using the data warehouse. A data access tools can be as simple as ad hoc query tool or as complex as a sophisticated data mining or modeling application. Some of the more sophisticated data access tools, like modeling or forecasting tools, actually may upload their result back into operational source system / the staging area of the data warehouse. 2.1.12 Data Warehouse Architecture In performing data warehouse design, it’s better determined in advance the most suitable architecture to the development of the data warehouse. Connolly and Begg (2010: 1157) gives an overview of the architecture of a typical data warehouse, which is as follows: 26 Figure 2.7Data Warehouse Architecture (Connolly, 2010:1157) Existing components in this architecture are as follows: 1. Operational Data Data sources to the data warehouse are provided from : - Mainframe operational data stored in a database first generation hierarchical and network databases. It is estimated that the majority of the operational data of the company is located in the system. - Data - data between departments parts stored in diverse systems such as VSAM file storage, RMS, and relational DBMS like Informix and Oracle. - Internal data stored on workstations and private servers. - External systems such as the Internet, commercial databases or database associated with a customer or supplier of the organization. 27 2. Operational Data Store (ODS) Operational Data Store is a storage medium on the latest operational data are used and integrated for analysis. ODS structuring and providing the data the same way as the data warehouse, but the actual act simply as a temporary shelter before the data is transferred to the data warehouse. ODS was created when the operating system was found to be able to achieve the success of the reporting system. ODS provides the benefits of a relational database that is used in decision support data warehouse functions. 3. Load Manager Load manager displays all the operations associated with the extraction directly from the data source or in general of the ODS. 4. Warehouse Manager Warehouse manager carry out all the operations related to the management of data in the data warehouse. Operations carried out by the warehouse manager include: - Analysis of the data to ensure consistency. - Transformation and merging of data sources from a temporary storage area into the data warehouse tables. - Making index and view the base tables. - Performing de-normalization. - Doing aggregation. - Backup and archive data. 28 5. Query Manager Query Manager displays all the operations related to the user query . Operation displayed by this component include directing queries on tables that match and schedule query execution. 6. Detail Data This area of data storage all the details in the database, which aims to complete the data, set for the data warehouse. In many cases, detailed data is not stored online but can be provided through aggregation of data at the next level. 7. Lightly and Highly Summarized Data It stores all of the data area lightly and highly summarized that has been defined previously made by the warehouse manager. This summarized information purposes to improve query performance. 8. Archive or backup of data This area stores all the detail and summary data for archiving and backup purposes. Although summary data derived from detailed data, with the summary of the data will allow online backups. If this data is stored beyond the retention period for the data is detailed. Data is transferred to storage archives such as optical disks. 9. Metadata Metadata is data about data that describes data warehouse. Metadata is used to build, maintain, manage, and 29 use the data warehouse. Metadata contain the location and description of the components in the data warehouse. Such as names, definitions, structure, and content of the data warehouse and the end user view. Metadata also identify data sources are integrated and transformed in a data warehouse. 10. End User Access Tools Development of a data warehouse is to provide consistent data to the user that will be used to analyze to provide information to support decision making. Tools that are used to process data warehouse include: - Report and query tools Reporting tools include production reporting tools and report writers. Production reporting tools used to generate a set of operational reports or to support a number of jobs with a high volume.Report Writer is a desktop tool designed desktop tools for endusers. - Application Development Tools. These tools are used by users to access the data graphically designed specifically for the client server. Some applications integrated with OLAP tools and can access all major database systems. - Executive Information System (EIS) Tools Executive information system developed to support decision-making by the executive. But today, the EIS can support all levels of processing. Therefore the executive information system and decision support 30 system increasingly unclear distinction. This is due to the EIS developer added custom built feature in its application. - Online Analytical Processing (OLAP) Tools Online analytical tools processing is the concept of multidimensional database and allow users to analyze the data using a complex and multidimension view. - Data Mining Tools Data mining is the process of finding correlations,patterns, and styles that are useful to explore large amounts of data using statistical techniques,math, and artificial intelligence. 2.1.13 The Concept of Data Warehouse Modeling Generally, data warehouse modeling techniques dimensional modeling. With this technique, it can be made and the fact table dimensions, as well as relationships or relationships between each of these: 2.1.13.1 Dimensional Modeling According to Kimball (2013:7) Dimensional modeling is a longstanding technique for making databases simple. In case after case, for more than five decades, IT organizations, consultants, and business users have naturally gravitated to a simple dimensional structure to match the fundamental human need for simplicity 31 Dimensional modeling is widely accepted as the preferred technique for presenting analytic data because it addresses two simultaneous requirements: 2.1.13.2 Deliver data that’s understandable to the business users. Deliver fast query performance. Star Schema According to Kimball (2013: 40) Star schemas are dimensional structures deployed in a relational database management system (RDBMS). Dimensional models implemented in relational database management systems are referred to as star schemas because of their resemblance to a star-like structure. While dimensional models implemented in multidimensional database environments arereferred to as online analytical processing (OLAP) cubes. Figure 2.8Star Schema VS OLAP (Kimball, 2013: 9) 32 2.1.13.3 Fact Table According to Kimball (2013: 10) Fact Table in a dimensional model stores the performance measurements resulting from an organization’s business process events. The fact table generally has its own primary key composed of a subset of the foreign keys. This key is often called a composite key. Every table that has a composite key is a fact table. Fact tables express many-to-many relationships. All others areDimension tables. 2.1.13.4 Dimension Tables According to Kimball (2013: 13) Dimension tables are integral companions to a fact table. The dimension tables contain the textual context associated with a business process measurement event. They describe the “who, what, where, when, how, and why” associated with the event. Figure 2.9Dimension Table (Kimball, 2013: 13) 33 2.1.13.5 Snowflake Schema According to Connolly and Begg (2010: 1229) Snowflake schema is a dimensional data model that has fact table in the center and surrounded the tables dimensional normalization. 2.1.13.6 Star-flake Schema According to Connolly and Begg (2010: 1230) Starflake schema is a dimensional data model that has fact table in the center and surrounded the tables dimension normalization and de-normalization 2.1.14 ETL 2.1.14.1 Extract According to Connolly (2010:1208) Extraction is a stage at which retrieves data from a data source for EDW (data warehouse environment), is usually composed of a source OLTP database and also there are some other sources such as personal databases, ERP data and usage data from the web. 2.1.14.2 Transform According to Connolly (2010:1208) Transformation is the stage which uses several rules or functions of the fetched data and measure how these data can be used for analysis and transformation of such sum may include data, the encoding of data, data fusion, data separation, data calculation and manufacture of surrogate keys. 34 2.1.14.3 Load According to Connolly (2010:1209) Loading is the stage to enter data that has undergone a process of transformation into the data warehouse. 2.1.14.4 ETL Tools According to Connolly (2010:1209) ETL Tools is the process that can be created by creating their own program or by buying an ETL that are in the market. 2.2 Special Theory 2.2.1 Commodity type There are some commodities that is used by the company in doing their transaction. The commodities are : 2.2.1.1 Gold According to David Evans (2013:3) Gold is old money, the premium non-government currency, evolved in the marketplace over 5,000 years. Gold is best viewed as a superior form of cash that debases much more slowly than paper currency. Of great political importance at this juncture, banks and government cannot print it. Gold is a potential rival to paper currency, also has the potential to undermine the power and profitability of banks and government. 2.2.1.2 Palm Oil According to Azmil (2008:1) Palm oil is one of the 17 major oils and fats produced and traded worldwide. Palm oil 35 accounts for almost 30% of the global vegetable oil productions, with 60% of the sharp represent the overall world export. 2.2.2 Asset According to EFRAG (2010: 13) Asset is a resource controlled by the entity as a result of past events and from which future economic benefits are expected to flow to the entity. 2.2.3 Microsoft SQL Server 2008 R2 According to Ross (2010: 3) Microsoft SQL Server 2008 R2 is the most advanced, trusted, and scalable data platform released to date. Building on the success of the original SQL Server 2008 release, SQL Server 2008 R2 has made an impact on organizations worldwide with its groundbreaking capabilities, empowering end users through self-service business intelligence (BI), bolstering efficiency and collaboration between database administrators (DBAs) and application developers, and scaling to accommodate the most demanding data workloads. 2.2.4 QlikView According to QlikTech (2011: 11) QlikView is software that enables all kinds of users from beginners to experts to retrieve and assimilate data easily from any source: databases like SQL Server or Oracle as well as Excel, XML or text files. Enterprise applications such as SAP may also be used as data source for a QlikView analysis.. 36 Mind Frame Figure 2.10 Mind Frame This is the mind frame that we use in making our final project. Consist of 3 phases, which describe 3 main processes in designing a data warehouse. First is we determine the dimension, fact and the calculation. In the second phase we design the star schema and etl planning. And in the last phase we design the user interface for report. If all the phase is complete, then we can get the final result.