Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data warehousing 1. Data warehouse – A technology, which provides historical data access to the enterprise for analysis and decision making. It collects and stores integrated sets of historical data from multiple operational systems and feeds them to one or more data marts. It may also provide end-user access to support enterprise views of data. It’s also said to be a single integrated source of data for processing information. The data warehouse supports specific characteristics that include the following: Subject-Oriented: Information is presented according to specific subjects or areas of interest, not simply as computer files. De-Normalize: Data warehouse should support the De-normalized architecture Integrated: A single source of information for and about understanding multiple areas of interest. The data warehouse provides one-stop shopping and contains information about a variety of subjects. Non-Volatile: Stable information that doesn’t change each time an operational process is executed. Information is consistent regardless of when the warehouse is accessed. Time-Variant: Containing a history of the subject, as well as current information. Historical information is an important component of a data warehouse. Data Cleansing: converting the data type into general Data Mart: A data structure that is optimized for access. It is designed to facilitate end-user analysis of data. It typically supports a single, analytic application used by a distinct set of workers. Staging Area: Any data store that is designed primarily to receive data into a warehousing environment. OLAP (On-Line Analytical Processing): It supports OLAP conceptual theory. A method by which multidimensional analysis occurs Multidimensional Analysis: The ability to manipulate information by a variety of relevant categories or “dimensions” to facilitate analysis and understanding of the underlying data. It is also sometimes referred to as “drilling-down”, “drilling-across” and “slicing and dicing” Multidimensional Database: Also known as MDDB or MDDBS. A class of proprietary, non-relational database management tools that store and manage data in a multidimensional manner, as opposed to the two dimensions associated with traditional relational database management systems. Schema: It supports basically two types of schema, they are Star and Snowflake Schema. Star Schema: A means of aggregating data based on a set of known dimensions. It stores data multidimensionally in a two dimensional Relational Database Management System (RDBMS), such as Oracle. Snowflake Schema: An extension of the star schema by means of applying additional dimensions to the dimensions of a star schema in a relational environment. Hypercube: A means of visually representing multidimensional data. OLAP Tools: A set of software products that attempt to facilitate multidimensional analysis. Can incorporate data acquisition, data access, data manipulation, or any combination thereof. 2. Difference between Data Warehouse and Operational Database The data warehouse is distinctly different from the operational data based on the day-to-day usage and operational maintenance. Operational Database Warehouse Database Application Oriented Subject Oriented Detailed Summarized Accurate, as of the moment of access Represents values over time, snapshots Can be updated Can not updated Serves the clerical community Serves the managerial community Requirements for processing understood before Initial development Requirements for processing not completely Understood before development Performance sensitive (immediate response Required when entering a transaction) Transaction driven No performance sensitivity (immediacy not required) Analysis driven Control of update a major concern. Control of update is not a issue Current Information Non redundancy More historical Information Supports redundancy Small amount of data used in a process Large amount of data used in a process Static structure; Flexible structure Limited number of data elements for a single record Many records of many data elements 3.1 The Data Warehousing Process 3.1.1 Determine Informational Requirements Identify and analyze existing informational capabilities. Identify from key users the significant business questions and key metrics that the target user. Decompose these metrics into their component parts with specific definitions. Map the component parts to the informational model and systems of record. 3.1.2 Evolutionary and Iterative Development Process When you begin to develop your first data warehouse increment, the architecture is new and fresh. With the second and subsequent increments, the following is true: Start with one subject area (or subset or superset) and one target user group. Continue and add subject areas, user groups and informational capabilities to the architecture based on the organization’s requirements for information, not technology. Improvements are made from what was learned from previous increments. Improvements are made from what was learned about warehouse operation and support. The technical environment may have changed. Results are seen very quickly after each iteration. The end user requirements are refined after each iteration. A data warehouse is populated through a series of steps that 1) Move data from the source environment (extract). 2) Change the data to have desired warehouse characteristics like subject-orientation and time-variance (transform). 3) Place the data into a target environment (load). This process is represented by the acronym ETL (for Extract, Transform and Load). 3.1.3 Complexity of Transformation and Integration The extraction of data from the operational environment to the data warehouse environment requires a change in technology 2. The selection of data from the operational environment may be very complex. 3. Data is reformatted. Data is cleansed. Multiple input sources of data exist. Default values need to be supplied. Summarization of data often needs to be done. The input records that must be read have “exotic” or nonstandard formats. Data format conversion must be done. Massive volumes of input must be accounted for. Perhaps the worst of all: Data relationships that have been built into old legacy program logic must be understood and unraveled before those files can be used as input. 3.2 How the transfer of data done through ETL tools? And it’s Functionality ETL can be defined as Extraction Transformation and Loading. ETL technology can be used in developing a complete Data warehousing database. It involves in getting the data from different sources, manipulating the flow of data through various transformations and finally loaded or sent the data/information to the specified target (D/w). ETL Tool Functionality While the selection of a database and hardware platform is a must, the selection of an ETL tool is highly recommended, but it's not a must. When you evaluate ETL tools, it pays to look for the following characteristics: Functional capability: This includes both the 'transformation' piece and the 'cleansing' piece. In general, the typical ETL tools are either geared towards having strong transformation capabilities or having strong cleansing capabilities, but they are seldom very strong in both. As a result, if you know your data is going to be dirty coming in, make sure your ETL tool has a strong cleansing capability. If you know there are going to be a lot of different data transformations, it then makes sense to pick a tool that is strong in transformation. Ability to read directly from your data source: For each organization, there is a different set of data sources. Make sure the ETL tool you select can connect directly to your source data. Metadata support: The ETL tool plays a key role in your metadata because it maps the source data to the destination, which is an important piece of the metadata. In fact, some organizations have come to rely on the documentation of their ETL tool as their metadata source. As a result, it is very important to select a metadata tool that works with your overall metadata strategy. Popular Tools Data Junction Ascential DataStage Ab Initio Informatica 3.3 How OLAP was in the Process of Analysis? And it’s Functionality OLAP tools are geared towards slicing and dicing of the data. As such, they require a strong metadata layer, as well as front-end flexibility. Those are typically difficult features for any home-built systems to achieve. Therefore, my recommendation is that if OLAP analysis is part of your charter for building a data warehouse, it is best to purchase an existing OLAP tool rather than creating one from scratch. OLAP Tool Functionality Before we speak about OLAP tool selection criterion, we must first distinguish between the two types of OLAP tools, MOLAP (Multidimensional OLAP) and ROLAP (Relational OLAP). MOLAP: In this type of OLAP, a cube is aggregated from the relational data source (data warehouse). When user generates a report request, the MOLAP tool can generate the create quickly because all data is already preaggregated within the cube. ROLAP: In this type of OLAP, instead of pre-aggregating everything into a cube, the ROLAP engine essentially acts as a smart SQL generator. The ROLAP tool typically comes with a 'Designer' piece, where the data warehouse administrator can specify the relationship between the relational tables, as well as how dimensions, attributes, and hierarchies map to the underlying database tables. Right now, there is a convergence between the traditional ROLAP and MOLAP vendors. ROLAP vendor recognize that users want their reports fast, so they are implementing MOLAP functionalities in their tools; MOLAP vendors recognize that many times it is necessary to drill down to the most detail level information, levels where the traditional cubes do not get to for performance and size reasons. So what are the criteria for evaluating OLAP vendors? Here they are: Ability to leverage parallelism supplied by RDBMS and hardware: This would greatly increase the tool's performance, and help loading the data into the cubes as quickly as possible. Performance: In addition to leveraging parallelism, the tool itself should be quick both in terms of loading the data into the cube and reading the data from the cube. Customization efforts: More and more, OLAP tools are used as an advanced reporting tool. This is because in many cases, especially for ROLAP implementations, OLAP tools often can be used as a reporting tool. In such cases, the ease of front-end customization becomes an important factor in the tool selection process. Security Features: Because OLAP tools are geared towards a number of users, making sure people see only what they are supposed to see is important. By and large, all established OLAP tools have a security layer that can interact with the common corporate login protocols. There are, however, cases where large corporations have developed their own user authentication mechanism and have a "single sign-on" policy. For these cases, having a seamless integration between the tool and the in-house authentication can require some work. I would recommend that you have the tool vendor team come in and make sure that the two are compatible. Metadata support: Because OLAP tools aggregates the data into the cube and sometimes serves as the frontend tool, it is essential that it works with the metadata strategy/tool you have selected. Popular Tools Business Objects Cognos Hyperion Microsoft Analysis Services MicroStrategy 3.4 How important are the reporting tools? How they were used as a part of Analysis? And it’s Functionality There is a wide variety of reporting requirements, and whether to buy or build a reporting tool for your business intelligence needs is also heavily dependent on the type of requirements. Typically, the determination is based on the following: Number of reports: The higher the number of reports, the more likely that buying a reporting tool is a good idea. This is not only because reporting tools typically make creating new reports easier (by offering reusable components), but they also already have report management systems to make maintenance and support functions easier. Desired Report Distribution Mode: If the reports will only be distributed in a single mode (for example, email only, or over the browser only), we should then strongly consider the possibility of building the reporting tool from scratch. However, if users will access the reports through a variety of different channels, it would make sense to invest in a third-party reporting tool that already comes packaged with these distribution modes. Ad Hoc Report Creation: Will the users be able to create their own ad hoc reports? If so, it is a good idea to purchase a reporting tool. These tool vendors have accumulated extensive experience and know the features that are important to users who are creating ad hoc reports. A second reason is that the ability to allow for ad hoc report creation necessarily relies on a strong metadata layer, and it is simply difficult to come up with a metadata model when building a reporting tool from scratch. Reporting Tool Functionalities Data is useless if all it does is sit in the data warehouse. As a result, the presentation layer is of very high importance. Most of the OLAP vendors already have a front-end presentation layer that allows users to call up pre-defined reports or create ad hoc reports. There are also several report tool vendors. Either way, pay attention to the following points when evaluating reporting tools: Data source connection capabilities In general there are two types of data sources, one the relationship database, the other is the OLAP multidimensional data source. Nowadays, chances are good that you might want to have both. Many tool vendors will tell you that they offer both options, but upon closer inspection, it is possible that the tool vendor is especially good for one type, but to connect to the other type of data source, it becomes a difficult exercise in programming. Scheduling and distribution capabilities In a realistic data warehousing usage scenario by senior executives, all they have time for is to come in on Monday morning, look at the most important weekly numbers from the previous week (say the sales numbers), and that's how they satisfy their business intelligence needs. All the fancy ad hoc and drilling capabilities will not interest them, because they do not touch these features. Based on the above scenario, the reporting tool must have scheduling and distribution capabilities. Weekly reports are scheduled to run on Monday morning, and the resulting reports are distributed to the senior executives either by email or web publishing. There are claims by various vendors that they can distribute reports through various interfaces, but based on my experience, the only ones that really matter are delivery via email and publishing over the intranet. Security Features: Because reporting tools, similar to OLAP tools, are geared towards a number of users, making sure people see only what they are supposed to see is important. Security can reside at the report level, folder level, column level, row level, or even individual cell level. By and large, all established reporting tools have these capabilities. Furthermore, they have a security layer that can interact with the common corporate login protocols. There are, however, cases where large corporations have developed their own user authentication mechanism and have a "single sign-on" policy. For these cases, having a seamless integration between the tool and the in- house authentication can require some work. I would recommend that you have the tool vendor team come in and make sure that the two are compatible. Customization Every one of us has had the frustration over spending an inordinate amount of time tinkering with some office productivity tool only to make the report/presentation look good. This is definitely a waste of time, but unfortunately it is a necessary evil. In fact, a lot of times, analysts will wish to take a report directly out of the reporting tool and place it in their presentations or reports to their bosses. If the reporting tool offers them an easy way to pre-set the reports to look exactly the way that adheres to the corporate standard, it makes the analysts jobs much easier, and the time savings are tremendous. Export Capabilities The most common export needs are to Excel, to a flat file, and to PDF, and a good report tool must be able to export to all three formats. For Excel, if the situation warrants it, you will want to verify that the reporting format, not just the data itself, will be exported out to Excel. This can often be a time-saver. Popular Tools Business Objects (Crystal Reports) Cognos Actuate Informatica is one of the several tools that support ETL technology. Business Objects and Cognos supports both OLAP and Reporting features. Informatica Informatica supports Client/Server technology. Informatica product comes in two editions Desktop and Enterprise. Power Mart is a Desktop Edition Power Center is a Enterprise Edition