Download Chapter 1 - azharunisel

FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING Chapter 1. Introduction to Data Warehouse Chapter 1 Introduction to Data Warehouse 1.1 What is Data Warehouse Figure 1: Data warehouse There is a lot of confusion concerning the terms data mining and data warehousing (also referred to as business intelligence in the marketplace today). Many IT professionals use the two terms interchangeably, with little hesitation or regard for the differences between the two types of applications. While the goals of both are related, and often overlap; data mining and data warehousing are dedicated to furnishing different types of analytics, for different types of users and therefore merit their own space. Data warehouse focus primarily on the storage of data to be used as the information basis for tactical and strategic decision making. Such decision making typically requires extensive data manipulation to extract information from data. In short, the data warehouse data are the source for decision support systems. Data warehouse databases are far less time-critical than transaction databases. Data warehouse databases typically store complex data derived from many sources. To make it easier to store and retrieve such complex data, the data warehouse database structure ends to be quite different from Page 1 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING Chapter 1. Introduction to Data Warehouse that of the transaction database. Some of the applications data warehousing can be used for are:         Decision support Trend analysis Financial forecasting Churn Prediction for Telecom subscribers, Credit Card users etc. Insurance fraud analysis Call record analysis Logistics and Inventory management Agriculture Definition of Data Warehouse Two of the pioneers in the field were Bill Inmon and Ralph Kimball i) Bill Inmon, the acknowledged father of the data warehouse, defines it as an integrated, subject oriented, time-variant, non-volatile database that provides support for decision making. Bill Inmon is universally recognized as the "father of the data warehouse." He has over 26 years of database technology management experience and data warehouse design expertise, and has published 36 books and more than 350 articles in major computer journals. His books have been translated into nine languages. He is known globally for his seminars on developing data warehouses and has been a keynote speaker for every major computing association. Before founding Pine Cone Systems, Bill was a co-founder of Prism Solutions, Inc. ii) A data warehouse is a copy of transaction data specifically structured for querying, analysis and reporting – Ralph Kimball Ralph Kimball was co-inventor of the Xerox Star workstation, the first commercial product to use mice, icons, and windows. He was vice president of applications at Metaphor Computer Systems, and founder and CEO of Red Brick Systems. He has a Ph.D. from Stanford in electrical engineering, specializing in man-machine systems. Ralph is a leading proponent of the dimensional approach to designing large data warehouses. He currently teaches data warehousing design skills to IT groups, and helps selected clients with specific data warehouse designs. Ralph is a columnist for Intelligent Enterprise magazine and has a relationship with Sagent Technology, Inc., a data warehouse tool vendor. His book "The Data Warehouse Toolkit" is widely recognized as the seminal work on the subject. Components of data warehouse A data warehouse has three main components: (Source : Data warehouse design, http://www.essentialstrategies.com/services/warehouse.htm) - "central data warehouse" - is a database organized according to the corporate data model. Page 2 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING Chapter 1. Introduction to Data Warehouse - "data marts" — extracts from the central data warehouse that are organized according to the particular retrieval requirements of individual users. - "legacy systems" - where an enterprise's data are currently kept consist of ODS of OLTP Legacy Systems Finance Data Mart Operational Data Store Sales Data Mart Marketing Data Mart Accounting Data Mart Operational Data Store Operational Data Store Organizational Data Warehouse Operational Data Store Figure 2: Data warehouse components. 1.2 The needs for Data Warehouse Today’s business needs data warehouse because: i) Decisions need to be made quickly and correctly, using all available data. ii) Users are business domain experts, not computer professionals. iii) The amount of data doubles every 11 months (source SAS forum 2008), which affects response time and the sheer ability to comprehend its content. iv) Competition is heating up in the areas of business intelligence and added information value. v) Unlocking data resides. - Data warehouses can: provide immediate information delivery, data integration from across the organization, provide a clear view of historical trends, provide new tools for looking at data, and free users from IS department capacity constraints. - The value from data warehouses comes from the reduced cost of system maintenance (assuming that the data warehouse is replacing another system(s)) and in the added revenue from the new business processes resulting from the additional information supplied by the warehouse. Page 3 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING Chapter 1. Introduction to Data Warehouse SAMPLE OF DATA WAREHOUSE SOLUTION Figure 3: How data warehouse solve problems. (Source : Database Systems: Design, Implementation, and Management, Rob and Coronel, 2007) Page 4 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING Chapter 1. Introduction to Data Warehouse 1.3 Characteristic of Data Warehouse Subject Oriented: Data that gives information about a particular subject instead of about a company's ongoing operations. Figure 4: Application Oriention vs Subject Orientation. (Source : Modern Data Warehousing, Mining and Visualization, Marakas, 2002) Subject is equivalent to entity such as employees, customers, products, vendors etc. Application orientation consists of data group/keep by database connected to company’s system@applications. In application orientation, data of products is scattered (in finance, inventory, asset tracking, sales and marketing also in materials planning). While in subject orientation data is centralize under one database/storage. A single-subject data warehouse is typically referred to as a data mart, while data warehouses are generally enterprise in scope. Integrated: Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole. This means there should be consistency among naming conventions, measurements of variables, encoding structures, physical attributes, and other salient data characteristics. An example of this integration is the treatment of codes such as gender codes. Within a single corporation, various applications may represent gender codes in different ways: male vs. female, m vs. f, and 1 vs. 0, etc. In the Data Warehouse, gender is always represented in a consistent way, regardless of the many ways by which it may be encoded and stored in the source data. As the data is moved to the Warehouse, it is transformed into a consistent representation as required. Page 5 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING Chapter 1. Introduction to Data Warehouse Figure 5: Cleansing process assure integrated data in data warehouse (Source : Modern Data Warehousing, Mining and Visualization, Marakas, 2002) Time-variant: All data in the data warehouse is identified with a particular time period. All data in Data Warehouse is accurate as of some moment in time, providing an historical perspective. This differs from the operational environment in which data is intended to be accurate as of the moment of access. The data in the Data Warehouse is, in effect, a series of snapshots. Once the data is loaded into the enterprise data store and data marts, it cannot be updated. It is refreshed on a periodic basis, as determined by the business need. The operational data store, if included in the Warehouse architecture, may be updated. Non-volatile Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business. Data in the Warehouse is static, not dynamic. The only operations that occur in Data Warehouse applications are the initial loading of data, access of data, and refresh of data. For these reasons, the physical design of a Data Warehouse optimizes the access of data, rather than focusing on the requirements of data update and delete processing. (Source: "What is a Data Warehouse?" W.H. Inmon, Prism, Volume 1, Number 1, 1995). Page 6 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING Chapter 1. Introduction to Data Warehouse Other information about data warehouse i) Data warehouse is an environment, not a product. ii) The data in a data warehouse is static, not dynamic. iii) The data warehouse is setup to be the central repository of all organizational data deemed useful for reporting or exploration iv) An organization that spends more on their data warehouse will not, necessarily, be happier with their data ware house. It is dependent upon the design. v) Data warehouses will have users who have very set requirements for the system (farmers) and users who have various needs that they cannot define for the system (explorers). vi) It is a database designed for analytical tasks, using data from multiple applications – data sources. vii) It supports a relative small number of users with relatively long interactions. Figure 6: Data flow in Data warehouse viii) Its usage is read-intensive – summarized. ix) Its content is periodically uploaded (only additions, no deletion and editing) x) Its contains current and historical data to provide a historical perspective of information xi) It contains a few large tables – large volume. xii) Each query frequently results in a large result set and involves frequent full table scan and multiple joins. Page 7 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING Chapter 1. Introduction to Data Warehouse xiii) Among Data warehouse and data mining software  Alacrity, Inc. : Integrated Data Intelligence Software  AlphaBlox : Data Analysis Software  AltaPlan : OLAP Links  Aonix : Object Oriented Modeling Tool and Cleansing Software  Attar Software : Data Mining / Neural Nets  Brio Technologies : Brio Web Warehouse and Decision Support Suite  Bull : Data Warehousing Solutions  Business Objects, Incorporated : WebIntelligence for enterprise decision support  Cognos, Incorporated : Data Warehousing Software Tool Suite  DataFlux : Data Quality and Integration Software  DataMirror : Data Integration, Data Protection, Data Audit Solutions  Decision Point Applications : Packaged Data Warehouse Solutions  Decision Technology : DecisionCentric® Server  Dimensional Insight, Inc. : Reporting and analysis software  Evolutionary Technologies, Inc. : ETI*EXTRACT(r) Tool Suite for Data Warehousing and Data Migration  FileTek : Software for managing massive amounts of atomic data  First Logic : Customer Data Management Software  Hyperion : ESSBASE - High Speed OLAP Processor  IBM : DataGuide  Information Builders : Data management software Data Warehousing, Decision Support, Middleware, Data Access, ...  Intelligent Solutions, Inc. : Claudia Imhof / Data Warehousing and Data Modeling  IRI : CoSORT ETL Software  Kalido : Software for adaptive enterprise data warehousing and master data management  Kenan Systems Corporation : Market Analysis Software  Megaputer Intelligence : Data Mining and Warehousing  Micro Strategy : Relational OLAP (ROLAP) Software and Services  Nautilus Systems, Inc. : Data Warehousing, Data Mining, and Data Visualization software  netcarve Technologies GmbH : Data Warehousing and Data Mining Solutions  NetScheme Solutions Inc.  Pilot Software : Customer and Market Data Analysis Software  Poinpoint Solutions Inc. : Data Warehousing solutions for the insurance industry  Princeton Softech : Database Management and Data Warehouse Software  Query Object Systems Corp : Business Solution Components  Ralph Kimball Associates : A pioneer and leader in the Data Warehouse field  Redbrick Systems : Multidimensional database software  Retek : Data Warehousing for Retail Industry  Rocket Software : Business Intelligence  Rulequest Research : Data Mining Tools  Salford Systems : CART software for tree-structure, non-parametric data analysis  SAS Institute : Data Warehouse and Data Mining Software  Seagate Software : Crystal Reports  Silvon Software : Supply Chain Data Warehousing  SolutionsIQ : Data Warehousing Solutions  Speedware Corp :Business Intelligence Software  Sybase, Inc. : Data Warehousing Database Software  Teleran : Data Warehousing and eCommerce Solutions  Teradata / NCR : Database machine  Thinking Machines Corporation : Data mining software for loyalty management systems  Trillium Software : Data Cleansing and Data Reengineering  Universal Data Solutions, LLC : Len Silverston - Data Modeling and Data Warehouse Page 8 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING Chapter 1. Introduction to Data Warehouse 1.4 Data Mining Figure 7: Data mining in Data warehouse environment i) The development of data mining functions and products is the result of years from many disciplines including databases, information retrieval, statistics, algorithms, machine learning and multimedia. ii) Data warehouse contains useful source of data for explorer (data miner). iii) Data mining can take place without a data warehouse and same goes to data warehouse, but a data warehouse offers data miners many advantages. A combination of data warehouse & data mining offers optimum analysis for business. iv) Data mining is one of data warehouse end user tools besides reporting, query, application development, Executive Information System (EIS) and Online Analytical Processing (OLAP) tools. v) There is no one data mining approach but rather a set of techniques that often can be used in combination with each other to extract the most insight from a set of data. vi) Literature of data mining also uses Knowledge Discovery in Databases (KDD), information discovery, information harvesting, data archeology, data exploration and data pattern processing to represent the knowledge discovery. Page 9 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING Chapter 1. Introduction to Data Warehouse Figure 8: Data mining application screen shot Extra notes: A transaction (production) databases reflects the daily operations of an organization. As the name implies, a transaction database records transactions such as the sale of a product, the enrolment of a student, or the opening of a checking account. Such transactions are time-critical and must be recorded accurately and immediately. Given their critical role in the day-to-day operations – and given the fact that transaction database data are usually the data source for data warehouse databases – much of the focus on the design of transaction databases. Data warehouse databases focus primarily on the storage of data to be used as the information basis for tactical and strategic decision making. Such decision making typically requires extensive data manipulation to extract information from data. In short, the data warehouse data are the source for decision support systems (DSS) or executive information system (EIS). Data warehouse databases are far less time-critical than transaction databases; the former record the historical data record that will be used to formulate pricing decisions, sales forecasts, market positioning, and so on. Data warehouse databases typically store complex data derived from many sources. To make it easier to store and retrieve such complex data, the data warehouse database structure ends to be quite different from that of the transaction database. Page 10 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING Chapter 1. Introduction to Data Warehouse Table 1. Transaction Processing vs Data Warehouse Characteristics Transaction Processing Data Warehouse Volatility Dynamic Static Currentness Current Historical Time dimension Implicity “now” Explicit, visible Granularity Primitive, detailed Detailed and derived summaries Updates Continues, random Periodic, scheduled Tasks Repetitive Unpredictable Flexibility Low High Performance High performance mandatory Lower performance often acceptable Source : Decision support and data warehouse systems by Efrem G Mallach – Mc Graw Hill, 2000 An operational data store (ODS) stores data for a specific application. It feeds the data warehouse a stream of desired raw data. A data mart is a lower-cost, scaled-down version of a data warehouse, usually designed to support a small group of users (rather than the entire firm). A single-subject data warehouse is typically referred to as a data mart, while data warehouses are generally enterprise in scope. Also, data warehouses can be volatile. Due to the large amount of storage required for a data warehouse, (multi-terabyte data warehouses are not uncommon), only a certain number of periods of history are kept in the warehouse. For instance, if three years of data are decided on and loaded into the warehouse, every month the oldest month will be "rolled off" the database, and the newest month added. Data warehouse vs datamart Data Marts Organizational Data Warehouse Corporate Highly granular data Normalized design Robust historical data Large data volume Data Model driven data Versatile General purpose DBMS technologies Departmentalized Summarized, aggregated data Star join design Limited historical data Limited data volume Requirements driven data Focused on departmental needs Multi-dimensional DBMS technologies DW users Every corporation has two types of DW users – Farmers and Explorers Page 11 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING Chapter 1. Introduction to Data Warehouse Farmers know what they want before they set out to find it. They submit small queries and retrieve small nuggets of information. Explorers are quite unpredictable. They often submit large queries. Sometimes they find nothing, sometimes they find priceless nuggets. Cost justification for the DW is usually done on the basis of the results obtained by farmers since explorers are unpredictable. Source System Identification : In order to build the data warehouse, the appropriate data must be located. Typically, this will involve both the current OLTP (On-Line Transaction Processing) system where the "day-to-day" information about the business resides, and historical data for prior periods, which may be contained in some form of "legacy" system. Often these legacy systems are not relational databases, so much effort is required to extract the appropriate data. Data Warehouse Design and Creation: This describes the process of designing the warehouse, with care taken to ensure that the design supports the types of queries the warehouse will be used for. This is an involved effort that requires both an understanding of the database schema to be created, and a great deal of interaction with the user community. The design is often an iterative process and it must be modified a number of times before the model can be stabilized. Great care must be taken at this stage, because once the model is populated with large amounts of data, some of which may be very difficult to recreate, the model can not easily be changed. Data Acquisition: This is the process of moving company data from the source systems into the warehouse. It is often the most time-consuming and costly effort in the data warehousing project, and is performed with software products known as ETL (Extract/Transform/Load) tools. There are currently over 50 ETL tools on the market. The data acquisition phase can cost millions of dollars and take months or even years to complete. Data acquisition is then an ongoing, scheduled process, which is executed to keep the warehouse current to a pre-determined period in time, (i.e. the warehouse is refreshed monthly). Changed Data Capture: The periodic update of the warehouse from the transactional system(s) is complicated by the difficulty of identifying which records in the source have changed since the last update. This effort is referred to as "changed data capture". Changed data capture is a field of endeavor in itself, and many products are on the market to address it. Some of the technologies that are used in this area are Replication servers, Publish/Subscribe, Triggers and Stored Procedures, and Database Log Analysis. Data Cleansing: This is typically performed in conjunction with data acquisition (it can be part of the "T" in "ETL"). A data warehouse that contains incorrect data is not only useless, but also very dangerous. The whole idea behind a data warehouse is to enable decision-making. If a high level decision is made based on incorrect data in the warehouse, the company could suffer severe consequences, or even complete failure. Data cleansing is a complicated process that validates and, if necessary, corrects the data before it is inserted into the warehouse. For example, the company could have three "Customer Name" entries in its various source systems, one entered as "IBM", one as "I.B.M.", and one as "International Business Machines". Obviously, these are all the same customer. Someone in the organization must make a decision as to which is correct, and then the data cleansing tool will change the others to match the rule. This process is also referred to as "data scrubbing" or "data quality assurance". It can be an extremely complex process, especially if some of the warehouse inputs are from older mainframe file systems (commonly referred to as "flat files" or "sequential files"). Data Aggregation: is process is often performed during the "T" phase of ETL, if it is performed at all. Data warehouses can be designed to store data at the detail level (each individual transaction), at some aggregate level (summary data), or a combination of both. The advantage of summarized data is that typical queries against the warehouse run faster. The disadvantage is that information, which may be needed to answer a query, is lost during aggregation. The tradeoff must be carefully weighed, because the decision can not be undone without rebuilding and repopulating the warehouse. The Page 12 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING Chapter 1. Introduction to Data Warehouse safest decision is to build the warehouse with a high level of detail, but the cost in storage can be extreme. Now that the warehouse has been built and populated, it becomes possible to extract meaningful information from it that will provide a competitive advantage and a return on investment. This is done with tools that fall within the general rubric of "Business Intelligence". Business Intelligence (BI): A very broad field indeed, it contains technologies such as Decision Support Systems (DSS), Executive Information Systems (EIS), On-Line Analytical Processing (OLAP), Relational OLAP (ROLAP), Multi-Dimensional OLAP (MOLAP), Hybrid OLAP (HOLAP, a combination of MOLAP and ROLAP), and more. BI can be broken down into four broad fields: Multi-dimensional Analysis Tools: Tools that allow the user to look at the data from a number of different "angles". These tools often use a multi-dimensional database referred to as a "cube". Query tools: Tools that allow the user to issue SQL (Structured Query Language) queries against the warehouse and get a result set back. Data Mining Tools: Tools that automatically search for patterns in data. These tools are usually driven by complex statistical formulas. The easiest way to distinguish data mining from the various forms of OLAP is that OLAP can only answer questions you know to ask, data mining answers questions you didn't necessarily know to ask. Data Visualization Tools: Tools that show graphical representations of data, including complex three-dimensional data pictures. The theory is that the user can "see" trends more effectively in this manner than when looking at complex statistical graphs. Some vendors are making progress in this area using the Virtual Reality Modeling Language (VRML). Metadata Management: Throughout the entire process of identifying, acquiring, and querying the data, metadata management takes place. Metadata is defined as "data about data". An example is a column in a table. The datatype (for instance a string or integer) of the column is one piece of metadata. The name of the column is another. The actual value in the column for a particular row is not metadata - it is data. Metadata is stored in a Metadata Repository and provides extremely useful information to all of the tools mentioned previously. Metadata management has developed into an exacting science that can provide huge returns to an organization. It can assist companies in analyzing the impact of changes to database tables, tracking owners of individual data elements ("data stewards"), and much more. It is also required to build the warehouse, since the ETL tool needs to know the metadata attributes of the sources and targets in order to "map" the data properly. The BI tools need the metadata for similar reasons. Summary: Data Warehousing is a complex field, with many vendors vying for market awareness. The complexity of the technology and the interactions between the various tools, and the high price points for the products require companies to perform careful technology evaluation before embarking on a warehousing project. However, the potential for enormous returns on investment and competitive advantage make data warehousing difficult to ignore. Page 13 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING Chapter 1. Introduction to Data Warehouse Exercise: A. Multiple choice questions. 1. Which of the following statements is NOT TRUE about datamarts? 2. A. It is a smaller, more content version of data warehouse. B. A data mart is a subset of a data warehouse. C. It is a representation of an object. D. It holds integrated data of an organization. The followings are the characteristics of the operational data store, EXCEPT: A. Data are from the day-to-day business operations. B. It holds historical data. C. Data are highly and immediately available. D. None of the above 3. Which of the following activities would not normally be associated with a data warehouse? A. Loading B. Updating C. Accessing D. None of the above. 4. Which of the following is not a characteristic of a data warehouse? A. Data integrated B. Volatile C. Subject oriented D. Time variant 5. Which of the following is a synonym for data mining? A. Knowledge data discovery B. Information analysis C. Coal mining D. None of the above. Page 14 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING Chapter 1. Introduction to Data Warehouse B. True or False questions. 1. Once data enter the data warehouse, it is never removed. 2. According to Inmon, the data warehouse and operational environments are separated. 3. Data marts can serve as a test vehicles for companies exploring the potential benefits of data warehouses. C. Short questions. Question 1 a) Main characteristics of data warehouse are Subject Oriented, Nonvolatile, Time variant and Integrated. Explain each of them. (4 marks) Question 2 a) What is Data Warehousing? (3 marks) b) List FOUR (4) benefits of having data warehouse in an organization? (4 marks) c) List and explain any FOUR (4) characteristic of data warehouse. (8 marks) Question 3 Discuss the differences between transaction databases and data warehouse databases. Page 15

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Chapter 1 - azharunisel