Download Data warehousing - dbmanagement.info

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Entity–attribute–value model wikipedia , lookup

Big data wikipedia , lookup

Clusterpoint wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Database model wikipedia , lookup

Transcript
Data warehousing
1. Data warehouse – A technology, which provides historical data access to the enterprise for
analysis and decision making.
It collects and stores integrated sets of historical data from multiple operational systems and feeds them
to one or more data marts. It may also provide end-user access to support enterprise views of data. It’s
also said to be a single integrated source of data for processing information.
The data warehouse supports specific characteristics that include the following:
Subject-Oriented: Information is presented according to specific subjects or areas of interest, not simply
as computer files.
De-Normalize: Data warehouse should support the De-normalized architecture
Integrated: A single source of information for and about understanding multiple areas of interest. The
data warehouse provides one-stop shopping and contains information about a variety of subjects.
Non-Volatile: Stable information that doesn’t change each time an operational process is executed.
Information is consistent regardless of when the warehouse is accessed.
Time-Variant: Containing a history of the subject, as well as current information. Historical information
is an important component of a data warehouse.
Data Cleansing: converting the data type into general
Data Mart: A data structure that is optimized for access. It is designed to facilitate end-user analysis of
data. It typically supports a single, analytic application used by a distinct set of workers.
Staging Area: Any data store that is designed primarily to receive data into a warehousing environment.
OLAP (On-Line Analytical Processing): It supports OLAP conceptual theory. A method by which
multidimensional analysis occurs
Multidimensional Analysis: The ability to manipulate information by a variety of relevant categories or
“dimensions” to facilitate analysis and understanding of the underlying data. It is also sometimes referred
to as “drilling-down”, “drilling-across” and “slicing and dicing”
Multidimensional Database: Also known as MDDB or MDDBS. A class of proprietary, non-relational
database management tools that store and manage data in a multidimensional manner, as opposed to the
two dimensions associated with traditional relational database management systems.
Schema: It supports basically two types of schema, they are Star and Snowflake Schema.
Star Schema: A means of aggregating data based on a set of known dimensions. It stores data
multidimensionally in a two dimensional Relational Database Management System (RDBMS),
such as Oracle.
Snowflake Schema: An extension of the star schema by means of applying additional dimensions
to the dimensions of a star schema in a relational environment.
Hypercube: A means of visually representing multidimensional data.
OLAP Tools: A set of software products that attempt to facilitate multidimensional analysis. Can
incorporate data acquisition, data access, data manipulation, or any combination thereof.
2. Difference between Data Warehouse and Operational Database
The data warehouse is distinctly different from the operational data based on the day-to-day usage and
operational maintenance.
Operational Database
Warehouse Database
Application Oriented
Subject Oriented
Detailed
Summarized
Accurate, as of the moment of access
Represents values over time, snapshots
Can be updated
Can not updated
Serves the clerical community
Serves the managerial community
Requirements for processing understood before
Initial development
Requirements for processing not completely
Understood before development
Performance sensitive (immediate response
Required when entering a transaction)
Transaction driven
No performance sensitivity (immediacy not
required)
Analysis driven
Control of update a major concern.
Control of update is not a issue
Current Information
Non redundancy
More historical Information
Supports redundancy
Small amount of data used in a process
Large amount of data used in a process
Static structure;
Flexible structure
Limited number of data elements for a single record Many records of many data elements
3.1 The Data Warehousing Process
3.1.1 Determine Informational Requirements




Identify and analyze existing informational capabilities.
Identify from key users the significant business questions and key metrics that the target user.
Decompose these metrics into their component parts with specific definitions.
Map the component parts to the informational model and systems of record.
3.1.2 Evolutionary and Iterative Development Process









When you begin to develop your first data warehouse increment, the architecture is new and
fresh. With the second and subsequent increments, the following is true:
Start with one subject area (or subset or superset) and one target user group.
Continue and add subject areas, user groups and informational capabilities to the architecture
based on the organization’s requirements for information, not technology.
Improvements are made from what was learned from previous increments.
Improvements are made from what was learned about warehouse operation and support.
The technical environment may have changed.
Results are seen very quickly after each iteration.
The end user requirements are refined after each iteration.
A data warehouse is populated through a series of steps that
1) Move data from the source environment (extract).
2) Change the data to have desired warehouse characteristics like subject-orientation and time-variance
(transform).
3) Place the data into a target environment (load).
This process is represented by the acronym ETL (for Extract, Transform and Load).
3.1.3 Complexity of Transformation and Integration
The extraction of data from the operational environment to the data warehouse environment requires
a change in technology
2. The selection of data from the operational environment may be very complex.
3. Data is reformatted.
Data is cleansed.
Multiple input sources of data exist.
Default values need to be supplied.
Summarization of data often needs to be done.
The input records that must be read have “exotic” or nonstandard formats.
Data format conversion must be done.
Massive volumes of input must be accounted for.
Perhaps the worst of all: Data relationships that have been built into old legacy program logic
must be understood and unraveled before those files can be used as input.
3.2 How the transfer of data done through ETL tools? And it’s Functionality
ETL can be defined as Extraction Transformation and Loading.
ETL technology can be used in developing a complete Data warehousing database. It involves in getting
the data from different sources, manipulating the flow of data through various transformations and finally
loaded or sent the data/information to the specified target (D/w).
ETL Tool Functionality
While the selection of a database and hardware platform is a must, the selection of an ETL tool is highly
recommended, but it's not a must. When you evaluate ETL tools, it pays to look for the following characteristics:
Functional capability: This includes both the 'transformation' piece and the 'cleansing' piece. In general, the
typical ETL tools are either geared towards having strong transformation capabilities or having strong cleansing
capabilities, but they are seldom very strong in both. As a result, if you know your data is going to be dirty coming
in, make sure your ETL tool has a strong cleansing capability. If you know there are going to be a lot of different
data transformations, it then makes sense to pick a tool that is strong in transformation.
Ability to read directly from your data source: For each organization, there is a different set of data sources.
Make sure the ETL tool you select can connect directly to your source data.
Metadata support: The ETL tool plays a key role in your metadata because it maps the source data to the
destination, which is an important piece of the metadata. In fact, some organizations have come to rely on the
documentation of their ETL tool as their metadata source. As a result, it is very important to select a metadata tool
that works with your overall metadata strategy.
Popular Tools
 Data Junction
 Ascential DataStage
 Ab Initio
 Informatica
3.3 How OLAP was in the Process of Analysis? And it’s Functionality
OLAP tools are geared towards slicing and dicing of the data. As such, they require a strong metadata layer, as
well as front-end flexibility. Those are typically difficult features for any home-built systems to achieve. Therefore,
my recommendation is that if OLAP analysis is part of your charter for building a data warehouse, it is best to
purchase an existing OLAP tool rather than creating one from scratch.
OLAP Tool Functionality
Before we speak about OLAP tool selection criterion, we must first distinguish between the two types of OLAP
tools, MOLAP (Multidimensional OLAP) and ROLAP (Relational OLAP).
MOLAP: In this type of OLAP, a cube is aggregated from the relational data source (data warehouse). When user
generates a report request, the MOLAP tool can generate the create quickly because all data is already preaggregated within the cube.
ROLAP: In this type of OLAP, instead of pre-aggregating everything into a cube, the ROLAP engine essentially
acts as a smart SQL generator. The ROLAP tool typically comes with a 'Designer' piece, where the data
warehouse administrator can specify the relationship between the relational tables, as well as how dimensions,
attributes, and hierarchies map to the underlying database tables.
Right now, there is a convergence between the traditional ROLAP and MOLAP vendors. ROLAP vendor recognize
that users want their reports fast, so they are implementing MOLAP functionalities in their tools; MOLAP vendors
recognize that many times it is necessary to drill down to the most detail level information, levels where the
traditional cubes do not get to for performance and size reasons.
So what are the criteria for evaluating OLAP vendors? Here they are:
Ability to leverage parallelism supplied by RDBMS and hardware: This would greatly increase the tool's
performance, and help loading the data into the cubes as quickly as possible.
Performance: In addition to leveraging parallelism, the tool itself should be quick both in terms of loading the data
into the cube and reading the data from the cube.
Customization efforts: More and more, OLAP tools are used as an advanced reporting tool. This is because in
many cases, especially for ROLAP implementations, OLAP tools often can be used as a reporting tool. In such
cases, the ease of front-end customization becomes an important factor in the tool selection process.
Security Features: Because OLAP tools are geared towards a number of users, making sure people see only
what they are supposed to see is important. By and large, all established OLAP tools have a security layer that can
interact with the common corporate login protocols. There are, however, cases where large corporations have
developed their own user authentication mechanism and have a "single sign-on" policy. For these cases, having a
seamless integration between the tool and the in-house authentication can require some work. I would recommend
that you have the tool vendor team come in and make sure that the two are compatible.
Metadata support: Because OLAP tools aggregates the data into the cube and sometimes serves as the frontend tool, it is essential that it works with the metadata strategy/tool you have selected.
Popular Tools
 Business Objects
 Cognos
 Hyperion
 Microsoft Analysis Services
 MicroStrategy
3.4 How important are the reporting tools? How they were used as a part of
Analysis? And it’s Functionality
There is a wide variety of reporting requirements, and whether to buy or build a reporting tool for your business
intelligence needs is also heavily dependent on the type of requirements. Typically, the determination is based on
the following:
 Number of reports: The higher the number of reports, the more likely that buying a reporting tool is a good
idea. This is not only because reporting tools typically make creating new reports easier (by offering reusable components), but they also already have report management systems to make maintenance and
support functions easier.
 Desired Report Distribution Mode: If the reports will only be distributed in a single mode (for example,
email only, or over the browser only), we should then strongly consider the possibility of building the
reporting tool from scratch. However, if users will access the reports through a variety of different
channels, it would make sense to invest in a third-party reporting tool that already comes packaged with
these distribution modes.
 Ad Hoc Report Creation: Will the users be able to create their own ad hoc reports? If so, it is a good idea
to purchase a reporting tool. These tool vendors have accumulated extensive experience and know the
features that are important to users who are creating ad hoc reports. A second reason is that the ability to
allow for ad hoc report creation necessarily relies on a strong metadata layer, and it is simply difficult to
come up with a metadata model when building a reporting tool from scratch.
Reporting Tool Functionalities
Data is useless if all it does is sit in the data warehouse. As a result, the presentation layer is of very high
importance.
Most of the OLAP vendors already have a front-end presentation layer that allows users to call up pre-defined
reports or create ad hoc reports. There are also several report tool vendors. Either way, pay attention to the
following points when evaluating reporting tools:
Data source connection capabilities
In general there are two types of data sources, one the relationship database, the other is the OLAP
multidimensional data source. Nowadays, chances are good that you might want to have both. Many tool vendors
will tell you that they offer both options, but upon closer inspection, it is possible that the tool vendor is especially
good for one type, but to connect to the other type of data source, it becomes a difficult exercise in programming.
Scheduling and distribution capabilities
In a realistic data warehousing usage scenario by senior executives, all they have time for is to come in on Monday
morning, look at the most important weekly numbers from the previous week (say the sales numbers), and that's
how they satisfy their business intelligence needs. All the fancy ad hoc and drilling capabilities will not interest
them, because they do not touch these features.
Based on the above scenario, the reporting tool must have scheduling and distribution capabilities. Weekly reports
are scheduled to run on Monday morning, and the resulting reports are distributed to the senior executives either
by email or web publishing. There are claims by various vendors that they can distribute reports through various
interfaces, but based on my experience, the only ones that really matter are delivery via email and publishing over
the intranet.
Security Features: Because reporting tools, similar to OLAP tools, are geared towards a number of users, making
sure people see only what they are supposed to see is important. Security can reside at the report level, folder
level, column level, row level, or even individual cell level. By and large, all established reporting tools have these
capabilities. Furthermore, they have a security layer that can interact with the common corporate login protocols.
There are, however, cases where large corporations have developed their own user authentication mechanism
and have a "single sign-on" policy. For these cases, having a seamless integration between the tool and the in-
house authentication can require some work. I would recommend that you have the tool vendor team come in and
make sure that the two are compatible.
Customization
Every one of us has had the frustration over spending an inordinate amount of time tinkering with some office
productivity tool only to make the report/presentation look good. This is definitely a waste of time, but unfortunately
it is a necessary evil. In fact, a lot of times, analysts will wish to take a report directly out of the reporting tool and
place it in their presentations or reports to their bosses. If the reporting tool offers them an easy way to pre-set the
reports to look exactly the way that adheres to the corporate standard, it makes the analysts jobs much easier, and
the time savings are tremendous.
Export Capabilities
The most common export needs are to Excel, to a flat file, and to PDF, and a good report tool must be able to
export to all three formats. For Excel, if the situation warrants it, you will want to verify that the reporting format, not
just the data itself, will be exported out to Excel. This can often be a time-saver.
Popular Tools
 Business Objects (Crystal Reports)
 Cognos
 Actuate
Informatica is one of the several tools that support ETL technology.
Business Objects and Cognos supports both OLAP and Reporting features.
Informatica
Informatica supports Client/Server technology.
Informatica product comes in two editions Desktop and Enterprise.
Power Mart is a Desktop Edition
Power Center is a Enterprise Edition