Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING CHAPTER 2 DATA WAREHOUSE PROCESS Chapter 2 Data Warehouse process 2.1 Initiation of Data Warehouse The concept of data warehousing dates back to the late 1980s when IBM researchers Barry Devlin and Paul Murphy developed the "business data warehouse". In essence, the data warehousing concept was intended to provide an architectural model for the flow of data from operational systems to decision support environments. The concept attempted to address the various problems associated with this flow, mainly the high costs associated with it. In the absence of a data warehousing architecture, an enormous amount of redundancy was required to support multiple decision support environments. In larger corporations it was typical for multiple decision support environments to operate independently. Though each environment served different users, they often required much of the same stored data. The process of gathering, cleaning and integrating data from various sources, usually from long-term existing operational systems (usually referred to as legacy systems), was typically in part replicated for each environment. Moreover, the operational systems were frequently reexamined as new decision support requirements emerged. Often new requirements necessitated gathering, cleaning and integrating new data from "data marts" that were tailored for ready access by users. (Source: Basil Soufi) In decision support environment, DSS and EIS systems are very similar in that they present information for decision making; however EIS application typically allow greater flexibility in slicing and dicing data in style most acceptable. A Data warehouse is not the same as a DSS. Rather, a data warehouse is a platform with integrated data of improved quality to support many DSS and EIS application and processes within an enterprise. i) An EIS is a special type of DSS designed to support decision making at the top level of an organization. ii) An EIS may help a CEO to get an accurate picture of overall operations, and a summary of what competitors are doing. iii) These systems are generally easy to operate and present information in ways easy to quickly absorb (graphs, charts, etc.). Page 1 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING CHAPTER 2 DATA WAREHOUSE PROCESS iv) It is not a substitute for other computer-based systems. The EIS actually feeds off these systems. v) It does not turn the executive suite into a haven for computer “techies”. vi) It should be viewed by senior management as a trusted assistant who can be called on when and where necessary. Figure 1. Revolution of data warehouse Problem with the current EIS: the data-processing department was not able to handle huge backlogs of requests for data analysis. Applications data was hidden behind mainframe files and databases, and it was periodically recorded in tapes for specific information manipulation. Page 2 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING CHAPTER 2 DATA WAREHOUSE PROCESS ENTERPRISE SYSTEM TYPE COMPARISON Figure 2: System type comparison (Source : Database Systems: Design, Implementation and Management P.Rob and C. Coronel, 2007) Page 3 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING CHAPTER 2 DATA WAREHOUSE PROCESS 2.2 Data Warehouse Architecture A data warehouse architecture is a description of the elements and services of the warehouse, with details showing how the components will fit together and how the system will grow over time. There is always an architecture, either ad hoc or planned, but experience shows that planned architectures have a better chance of succeeding (Laura Hadley). Figure 3: Data warehouse architecture categories There are 4 categories of data warehouse architecture namely 1) Data architecture, 2) Information architecture, 3) Technical architecture and finally 4) Product architecture Architecture Data Deliverables Information Define what data is needed to meet business user needs. Examine the completeness and correctness of source systems that are needed to obtain data. Identify the data facts and dimensions. Define the logical data models. Establish preliminary aggregation plan. Define the framework for the transformation of data into information from the source systems to information used by the business users. Recommend the data stages necessary for data transform and information access. Develop source-to-target data mapping for each data stage. Review data quality procedures and reconciliation techniques. Define the physical data models. Page 4 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING CHAPTER 2 DATA WAREHOUSE PROCESS Technology Product Define technical functionality used to build a data warehousing and business intelligence environment. Identify available technologies available and review tradeoffs associated between any overlapping or competing technologies. Review current technical environment and company's strategic technical directions. Recommend technologies to be used to meet your business requirements and implementation plan. List product categories needed to implement the technology architecture. Review tradeoffs between overlapping or competing product categories. Outline implementation of product architecture in stages. Identify short list of products in each of these categories. Recommend products and implementation schedule. Figure 4: Descriptions of data warehouse architecture categories (source : http://www.athena-solutions.com/services-design-planning.shtml) Data warehouse technical architecture (By components) Figure 5: Data warehouse with staging and data marts (Source - http://docs.oracle.com/cd/B10500_01/server.920/a96520/concept.htm) Page 5 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING CHAPTER 2 DATA WAREHOUSE PROCESS Figure 5 illustrates an example where purchasing, sales, and inventories are separated. In this example, a financial analyst might want to analyze historical data for purchases and sales. The architecture consists of: Data Sources (operational systems and flat files) Staging Area (where data sources go before the warehouse) Warehouse (metadata, summary data, and raw data) Data Marts (purchasing, sales, and inventory) Users (analysis, reporting, and mining) Note : This architecture also well known as federated data warehouse architecture Data warehouse technical architecture (By layer) Figure 6: Data warehouse internal technical architecture (Source : Modern Data Warehousing, Mining and Visualization, Marakas, 2002) The technical architecture consists of various interconnected elements: - Operational and external database layer – the source data for the DW - Information access layer – the tools the end user access to extract and analyze the data - Data access layer – the interface between the operational and information access layers - Metadata layer – the data directory or repository of metadata information Page 6 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING CHAPTER 2 DATA WAREHOUSE PROCESS Additional layers are: - Process management layer – the scheduler or job controller - Application messaging layer – the “middleware” that transports information around the firm - Physical data warehouse layer – where the actual data used in the DSS are located - Data staging layer – all of the processes necessary to select, edit, summarize and load warehouse data from the operational and external data bases Data warehouse configuration: The virtual data warehouse – the end users have direct access to the data stores, using tools enabled at the data access layer The central data warehouse – a single physical database contains all of the data for a specific functional area The distributed data warehouse – the components are distributed across several physical databases Developing an Architecture When you develop the technical architecture model, draft the architecture requirements document first. Next to each business requirement write down its architecture implications. Group these implications according to architecture areas (remote access, staging, data access tools, etc.) Understand how it fits in with the other areas. Capture the definition of the area and its contents. Then refine and document the model. Thornthwaite recognizes that developing a data warehouse architecture is difficult, and thus warns against using a “just do it” approach, which he also calls “architecture lite.” But the Zachman framework is more than what most organizations need for data warehousing, so he recommends a reasonable compromise consisting of a four-layer process: business requirements, technical architecture, standards, and products. Business requirements essentially drive the architecture, so talk to business managers, analysts, and power users. From your interviews look for major business issues, as well as indicators of business strategy, direction, frustrations, business processes, timing, availability, and performance expectations. Document everything well. From an IT perspective, talk to existing data warehouse/DSS support staff, OLTP application groups, and DBAs; as well as networking, OS, and desktop support staff. Also speak with architecture and planning professionals. Here you want to get their opinions on data warehousing considerations from the IT viewpoint. Learn if there are existing architecture documents, IT principles, standards statements, organizational power centers, etc. Page 7 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING CHAPTER 2 DATA WAREHOUSE PROCESS 2.2 Metadata While metadata is not new, the role of metadata and its importance in the face of the data warehouse certainly is new. For years the information technology professional has worked in the same environment as metadata, but in many ways has paid little attention to metadata. The information professional has spent a life dedicated to process and functional analysis, user requirements, maintenance, architectures, and the like. The role of metadata has been passive at best in this milieu. But metadata plays a very different role in data warehouse. Relegating metadata to a backwater, passive role in the data warehouse environment is to defeat the purpose of data warehouse. Metadata plays a very active and important part in the data warehouse environment. The reason why metadata plays such an important and active role in the data warehouse environment is apparent when contrasting the operational environment to the data warehouse environment insofar as the user community is concerned. Figure 7: Data flow involve metadata (Source-http://www.dwreview.com/Articles/Metadata.html) Simply from the standpoint of who needs help the most in terms of finding one's way around data and systems, it is assumed the DSS analysis community requires a much more formal and intensive level of support than the information technology community. For this reason alone, the formal establishment of and ongoing support of metadata becomes important in the data warehouse environment. Page 8 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING CHAPTER 2 DATA WAREHOUSE PROCESS But there is a secondary, yet important, reason why metadata plays an important role in the data warehouse environment. In the data warehouse environment, the first thing the DSS analyst needs to know in order to do his/her job is what data is available and where it is in the data warehouse. In other words, when the DSS analyst receives an assignment, the first thing the DSS analyst needs to know is what data there is that might be useful in fulfilling the assignment. To this end the metadata for the warehouse is vital to the preparatory work done by the DSS analyst. Figure 8. Metadata layer throughout data warehouse architecture (Source : BI 360) Throughout the entire process of identifying, acquiring, and querying the data, metadata management takes place. Metadata is defined as "data about data". An example is a column in a table. The datatype (for instance a string or integer) of the column is one piece of metadata. The name of the column is another. The actual value in the column for a particular row is not metadata - it is data. Metadata is stored in a Metadata Repository and provides extremely useful information to all of the tools mentioned previously. Metadata management has developed into an exacting science that can provide huge returns to an organization. It can assist companies in analyzing the impact of changes to database tables, tracking owners of Page 9 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING CHAPTER 2 DATA WAREHOUSE PROCESS individual data elements ("data stewards"), and much more. It is also required to build the warehouse, since the ETL tool needs to know the metadata attributes of the sources and targets in order to "map" the data properly. The BI tools need the metadata for similar reasons. i) The name suggests some high-level technological concept, but it really is fairly simple. Metadata is “data about data”. ii) With the emergence of the data warehouse as a decision support structure, the metadata are considered as much a resource as the business data they describe. iii) Metadata are abstractions -- they are high level data that provide concise descriptions of lower-level data. 2 Basic types of metadata: 1) Technical Metadata Technical metadata provides the technical descriptions of data and operations. This information is used by Data Modellers, application programmers, system administrators, database administrators and software tools. Technical metadata includes information about data definition, data format, processes, source data, target data, and the rules and processes that are used to extract, filter, enhance, cleanse, and transform source data to target data etc. 2) Business Metadata Business metadata (data and process) is used by business analysts and end users, and provides a business description of informational objects. It assists end users in locating, understanding, and accessing information in applications, data marts, a data warehouse, or other informational sources. 2.3.1 The Metadata in Action The metadata are essential ingredients in the transformation of raw data into knowledge. They are the “keys” that allow us to handle the raw data. For example, a line in a sales database may contain: 1023 K596 111.21 This is mostly meaningless until we consult the metadata (in the data directory) that tells us it was store number 1023, product K596 and sales of $111.21. Page 10 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING CHAPTER 2 DATA WAREHOUSE PROCESS Metadata can be managed through individual tools: Metadata manager / repository Metadata extract tools Data modeling ETL BI Reporting 2.3.2 The Need for Consistency in the Metadata i) The data warehouse is set up for the benefit of business analysts and executives across all functional areas. ii) In their individual databases, the different areas may define and store data according to their own version of the “truth”. iii) When data are retrieved from these different areas and placed in the warehouse, the transformation and cleansing process ensures that there is a single, integrated “truth” at the organizational level. 2.3.3 Interviewing the Data—Metadata Extraction Regardless of the nature of a query, certain aspects of the metadata are important to all decision-makers. Some of these are: - What tables, attributes and keys does the DW contain? - Where did each set of data come from? - What transformations were applied with cleansing? - How have the metadata changed over time? - How often do the data get reloaded? - Are there so many data elements that you need to be careful what you ask for? Page 11 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING CHAPTER 2 DATA WAREHOUSE PROCESS Sample metadata extraction Figure 9 & 10 : Metadata extraction Page 12 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING CHAPTER 2 DATA WAREHOUSE PROCESS 2.3.4 Components of the Metadata i) Transformation maps – records that show what transformations were applied ii) Extraction history – records that show what data was analyzed iii) Algorithms for summarization – methods available for aggregating and summarizing iv) Data ownership – records that show origin v) Access patterns – records that show what data are accessed and how often 2.3.5 Typical Mapping Metadata Transformation mapping records include: - Identification of original source - Attribute conversions - Physical characteristic conversions - Encoding/reference table conversions - Naming changes - Key changes - Values of default attributes - Logic to choose from multiple sources - Algorithmic changes Page 13 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING CHAPTER 2 DATA WAREHOUSE PROCESS 2.4 Data Warehouse execution Data warehouse development Figure 11. Data warehouse development steps (Source - http://www.baseline-consulting.com/images/Service_Data_Warehouse_Art.jpg ) Kozar assembled a list of “seven deadly sins” of data warehouse implementation: “If you build it, they will come” – the DW needs to be designed to meet people’s needs Underestimating the importance of documenting assumptions – the assumptions and potential conflicts must be included in the framework Failure to use the right tool – a DW project needs different tools than those used to develop an application Life cycle abuse – in a DW, the life cycle really never ends Failure to learn from mistakes – since one DW project tends to be the cause of another, learning from the early mistakes will yield higher quality later. Omission of an architectural framework – you need to consider the number of users, volume of data, update cycle, etc. Ignorance about data conflicts – resolving these takes a lot more effort than most people realize Page 14 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING CHAPTER 2 DATA WAREHOUSE PROCESS Exercise: 1. Why A. B. C. D. data warehouse exist? Needs of bigger relational database Needs of DSS Weakness of relational database Weakness of DSS 2. Choose correct development of data warehouse: A. OLTP, EIS, DSS, DW B. OLTP, DSS, EIS, DW C. EIS, DSS, OLTP, DW D. DSS, EIS, OLTP, DW 3. Which is NOT included in the architecture of data warehouse? A. Data access B. Information access C. Knowledge access D. Wisdom access 4. Which of the following is a valid data warehouse configuration? A. B. C. D. Centralized data warehouse Virtual data warehouse Distributed data warehouse All of the above. 5. The process that records how data from operational data stores and external sources are transformed on the way into the warehouse is referred to as: A. B. C. D. summarization algorithms. transformation mapping. back propagation. extraction history 6. Which of these NOT TRUE about metadata? A. B. C. D. Important in transforming data into information Key that allow handling of the raw data Data of metadata Information about data warehouse Page 15 FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY DATA WAREHOUSE & DATA MINING CHAPTER 2 DATA WAREHOUSE PROCESS 7. Which of the following would not be a good example of metadata? A. B. C. D. The directory of where the data is stored. The rules used for summarization and scrubbing. Where the operational data came from. All of the above are examples of metadata. 8. Which layer of the data warehouse architecture does the end user deal directly with? A. B. C. D. Data access layer Application messaging layer Information access layer None of the above. 9. What are 7 deadly sins by Kozar? A. B. C. D. Myths of developing data warehouse Curse of data warehouse Rules of data warehouse Tips of developing data warehouse 10. Which of the followings from Kozar’s seven deadly sins of DW implementation explained about the importance to focus on user of data warehouse? A. B. C. D. 11. Sin Sin Sin Sin a) 1 3 6 7 List FIVE (5) of the “seven deadly sins” in data warehouse implementation suggested by Kozar. (5 marks) b) Explain any THREE (3) of your answer in 1 a). (6 marks) c) Illustrate different layers in the data warehouse architecture. (10 marks) Page 16