Download Management Information System

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Clusterpoint wikipedia , lookup

Big data wikipedia , lookup

Functional Database Model wikipedia , lookup

Database model wikipedia , lookup

Transcript
ANJUMAN-I-ISLAM‘S
ALLANA INSTITUTE OF MANAGEMENT
STUDIES
Management Information
System
Data Mining and Data Warehousing
Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
image. Data warehousing A single, complete and consistent store of data obtained from a
variety of different sources made available to end users in what they can understand and use
in a business context.
1
Data Warehousing
&
Data Mining
Submitted by:
Name
Roll No
Mohsin Sayed
Shafat Ali
Arshad Shaikh
Maqsud Shaikh
Saif Shaikh
43
44
46
47
48
Academic Year: 2012-2013
Under the guidance of
Prof. Awesh Bhornya
Date of Submission: 16th February ’2013
Anjuman-I-Islam’s
Allana Institute of Management Studies
BadruddinTyabjiMarg, Off. 92,
Dr.D.N.Road, Opp. CST.,
Mumbai – 400 001‘
2
CERTIFICATE
This is to certify that Students from ‗A‘ division of
Anjuman-I-Islam‘s Allana Institute of Management
Studies (AIAIMS) pursuing first year in MMS has
completed
the
dissertation
project
on
“Data
Warehousing and Data Mining” in the Academic
Year 2013-2014.
Date: ______________
Place: ______________
Dr.Lukman Patel
Prof.Awesh Bhornya
Director – AIAIMS
Project Guide
3
ACKNOWLEDGEMENT
A project cannot be said to be the work of an individual. A
project is a combination of views and ideas, suggestions and
contributions of many people. We are extremely thankful to our
project guide Prof. Awesh Bhornya for giving us the valuable
guidance and helping me throughout this project and for his special
attention on me.
We wish to thank to all the people who had help and assisted us
wherever and whenever we needed their help by giving their precious
time and valuable suggestion to us.
Also I wish to thank all the respondents who gave me some of their
valuable time to fill up the questionnaires, without which the project
study wouldn‘t have been a success
4
Index
Topic
Page No.
 Data Warehousing
 History
 What is Data Warehousing
 Subject Oriented
 Integrated
 Non-volatile
 Time Variant
06
06
06
08
08
09
09
 Benefits of a Data warehouse
 Key developments in early years of data warehousing
 Dimensional V/S Normalized
 Data warehouses V/S operational systems
 Operational Systems V/S Data Warehousing Systems
 Evolution in organization use
 Data Warehouse Architecture
 Data Warehouse Architecture components
 Types of Data Warehouse Architectures
09
11
12
14
15
15
17
17
21
 Data Mining








Overview
The Foundations of Data Mining
The Scope of Data Mining
Database can be larger in depth and breadth
How data mining works
Architecture of Data Mining
Components of Data Mining
Integration of data mining system with a database
Or Data Warehouse system
 Conclusion
 Bibliography
5
30
30
33
34
38
40
41
42
43
Data Warehousing
History
The concept of data warehousing dates back to the late 1980s when IBM researchers
Barry Devlin and Paul Murphy developed the "business data warehouse". In essence, the data
warehousing concept was intended to provide an architectural model for the flow of data
from operational systems to decision support environments. The concept attempted to address
the various problems associated with this flow, mainly the high costs associated with it. In the
absence of a data warehousing architecture, an enormous amount of redundancy was required
to support multiple decision support environments.
In larger corporations it was typical for multiple decision support environments to
operate independently. Though each environment served different users, they often required
much of the same stored data. The process of gathering, cleaning and integrating data from
various sources, usually from long-term existing operational systems (usually referred to as
legacy systems), was typically in part replicated for each environment. Moreover, the
operational systems were frequently reexamined as new decision support requirements
emerged. Often new requirements necessitated gathering, cleaning and integrating new data
from "data marts" that were tailored for ready access by users.
What is Data warehouse?
In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW)
is a database used for reporting and data analysis. It is a central repository of data which is
created by integrating data from one or more disparate sources. Data warehouses store current
as well as historical data and are used for creating trending reports for senior management
reporting such as annual and quarterly comparisons.
The data stored in the warehouse are uploaded from the operational systems (such as
marketing, sales etc., shown in the figure to the right). The data may pass through an
operational data store for additional operations before they are used in the DW for reporting.
The typical ETL-based data warehouse uses staging, data integration, and access
layers to house its key functions. The staging layer or staging database stores raw data
6
extracted from each of the disparate source data systems. The integration layer integrates the
disparate data sets by transforming the data from the staging layer often storing this
transformed data in an operational data store (ODS) database. The integrated data are then
moved to yet another database, often called the data warehouse database, where the data is
arranged into hierarchical groups often called dimensions and into facts and aggregate facts.
The combination of facts and dimensions is sometimes called a star schema. The access layer
helps users retrieve data.
A data warehouse constructed from an integrated data source system does not require
ETL, staging databases, or operational data store databases. The integrated data source
systems may be considered to be a part of a distributed operational data store layer. Data
federation methods or data virtualization methods may be used to access the distributed
integrated source data systems to consolidate and aggregate data directly into the data
warehouse database tables. Unlike the ETL-based data warehouse, the integrated source data
systems and the data warehouse are all integrated since there is no transformation of
dimensional or reference data. This integrated data warehouse architecture supports the drill
down from the aggregate data of the data warehouse to the transactional data of the integrated
source data systems.
Data warehouses can be subdivided into data marts. Data marts store subsets of data
from a warehouse.This definition of the data warehouse focuses on data storage. The main
source of the data is cleaned, transformed, cataloged and made available for use by managers
and other business professionals for data mining, online analytical processing, market
research and decision support (Marakas & O'Brien 2009). However, the means to retrieve and
analyze data, to extract, transform and load data, and to manage the data dictionary are also
considered essential components of a data warehousing system. Many references to data
warehousing use this broader context. Thus, an expanded definition for data warehousing
includes business intelligence tools, tools to extract, transform and load data into the
repository, and tools to manage and retrieve metadata.
Data warehousing a single, complete and consistent store of data obtained from a
variety of different sources made available to end users in what they can understand and use
in a business context. TheData Warehousing site aims to help people get a good high-level
understanding of what it takes to implement a successful data warehouse project. A lot of the
7
information is from my personal experience as business intelligence professional, both as a
client and as a vendor.
A data warehouse is a relational database that is designed for query and analysis
rather than for transaction processing. It usually contains historical data derived from
transaction data, but it can include data from other sources. It separates analysis workload
from transaction workload and enables an organization to consolidate data from several
sources.
In addition to a relational database, a data warehouse environment includes an
extraction, transportation, transformation, and loading (ETL) solution, an online analytical
processing (OLAP) engine, client analysis tools, and other applications that manage the
process of gathering data and delivering it to business users.
A common way of introducing data warehousing is to refer to the characteristics of a data
warehouse as set forth by William Inmon:

Subject Oriented

Integrated

Non-volatile

Time Variant
Subject Oriented:Data warehouses are designed to help you analyze data. For example,
to learn more about your company's sales data, you can build a warehouse that concentrates
on sales. Using this warehouse, you can answer questions like "Who was our best customer
for this item last year?" This ability to define a data warehouse by subject matter, sales in this
case makes the data warehouse subject oriented.
Integrated:Integration is closely related to subject orientation. Data warehouses must put
data from disparate sources into a consistent format. They must resolve such problems as
naming conflicts and inconsistencies among units of measure. When they achieve this, they
are said to be integrated.
8
Nonvolatile:Nonvolatile means that, once entered into the warehouse, data should not
change. This is logical because the purpose of a warehouse is to enable you to analyze what
has occurred.
Time Variant: In order to discover trends in business, analysts need large amounts of
data. This is very much in contrast to online transaction processing (OLTP) systems, where
performance requirements demand that historical data be moved to an archive. A data
warehouse's focus on change over time is what is meant by the term time variant.
Benefits of a Data warehouse:
A data warehouse maintains a copy of information from the source transaction systems. This
architectural complexity provides the opportunity to:

Maintain data history, even if the source transaction systems do not.

Integrate data from multiple source systems, enabling a central view across the
enterprise. This benefit is always valuable, but particularly so when the organization
has grown by merger.

Improve data quality, by providing consistent codes and descriptions, flagging or even
fixing bad data.

Present the organization's information consistently.

Provide a single common data model for all data of interest regardless of the data's
source.

Restructure the data so that it makes sense to the business users.

Restructure the data so that it delivers excellent query performance, even for complex
analytic queries, without impacting the operational systems.
9

Add value to operational business applications, notably customer relationship
management (CRM) systems.
Generic data warehouse environment:
The environment for data warehouses and marts includes the following:

Source systems that provide data to the warehouse or mart;

Data integration technology and processes that are needed to prepare the data for use;

Different architectures for storing data in an organization's data warehouse or data
marts;

Different tools and applications for the variety of users;

Metadata, data quality, and governance processes must be in place to ensure that the
warehouse or mart meets its purposes.
In regards to source systems listed above, Rainer states, ―A common source for the data
in data warehouses is the company‘s operational databases, which can be relational
databases‖.
Regarding data integration, Rainer states, ―It is necessary to extract data from source
systems, transform them, and load them into a data mart or warehouse‖.
Rainer discusses storing data in an organization‘s data warehouse or data marts. ―There
are a variety of possible architectures to store decision-support data‖.Metadata are data about
data. ―IT personnel need information about data sources; database, table, and column names;
refresh schedules; and data usage measures. Today, the most successful companies are those
that can respond quickly and flexibly to market changes and opportunities. A key to this
response is the effective and efficient use of data and information by analysts and managers.
A ―data warehouse‖ is a repository of historical data that are organized by subject to support
decision makers in the organization. Once data are stored in a data mart or warehouse, they
can be accessed.
10
Key developments in early years of data warehousing were:

1960s — General Mills and Dartmouth College, in a joint research project, develop
the terms dimensions and facts.

1970s — ACNielsen and IRI provide dimensional data marts for retail sales.

1970s — Bill Inmon begins to define and discuss the term: Data Warehouse

1975 — Sperry Univac Introduce MAPPER (Maintain, Prepare, and Produce
Executive Reports) is a database management and reporting system that includes the
world's first 4GL. It was the first platform specifically designed for building
Information Centers (a forerunner of contemporary Enterprise Data Warehousing
platforms)

1983 — Teradata introduces a database management system specifically designed for
decision support.

1983 — Sperry Corporation Martyn Richard Jones defines the Sperry Information
Center approach, which while not being a true DW in the Inmon sense, did contain
many of the characteristics of DW structures and process as defined previously by
Inmon, and later by Devlin. First used at the TSB England & Wales

1984 — Metaphor Computer Systems, founded by David Liddle and Don Massaro,
releases Data Interpretation System (DIS). DIS was a hardware/software package and
GUI for business users to create a database management and analytic system.

1988 — Barry Devlin and Paul Murphy publish the article An architecture for a
business and information system in IBM Systems Journal where they introduce the
term "business data warehouse".

1990 — Red Brick Systems, founded by Ralph Kimball, introduces Red Brick
Warehouse, a database management system specifically for data warehousing.

1991 — Prism Solutions, founded by Bill Inmon, introduces Prism Warehouse
Manager, software for developing a data warehouse.

1992 — Bill Inmon publishes the book Building the Data Warehouse.

1995 — The Data Warehousing Institute, a for-profit organization that promotes data
warehousing, is founded.

1996 — Ralph Kimball publishes the book The Data Warehouse Toolkit.

2000 — Daniel Linstedt releases the Data Vault, enabling real time auditable Data
Warehouses warehouse.
11
Dimensional V/S Normalized Approach for Storage of Data
There are two leading approaches to storing data in a data warehouse — the
dimensional approach and the normalized approach.
The dimensional approach, whose supporters are referred to as ―Kimballites‖, believe
in Ralph Kimball‘s approach in which it is stated that the data warehouse should be modeled
using a Dimensional Model/star schema. The normalized approach, also called the 3NF
model, whose supporters are referred to as ―Inmonites‖, believe in Bill Inmon's approach in
which it is stated that the data warehouse should be modeled using an E-R model/normalized
model.
In a dimensional approach, transaction data are partitioned into "facts", which are
generally numeric transaction data, and "dimensions", which are the reference information
that gives context to the facts. For example, a sales transaction can be broken up into facts
such as the number of products ordered and the price paid for the products, and into
dimensions such as order date, customer name, product number, order ship-to and bill-to
locations, and salesperson responsible for receiving the order.
A key advantage of a dimensional approach is that the data warehouse is easier for the
user to understand and to use. Also, the retrieval of data from the data warehouse tends to
operate very quickly. Dimensional structures are easy to understand for business users,
because the structure is divided into measurements/facts and context/dimensions. Facts are
related to the organization‘s business processes and operational system whereas the
dimensions surrounding them contain context about the measurement (Kimball, Ralph 2008).
The main disadvantages of the dimensional approach are:
1. In order to maintain the integrity of facts and dimensions, loading the data warehouse
with data from different operational systems is complicated, and
2. It is difficult to modify the data warehouse structure if the organization adopting the
dimensional approach changes the way in which it does business.
In the normalized approach, the data in the data warehouse are stored following, to a
degree, database normalization rules. Tables are grouped together by subject areas that
reflect general data categories (e.g., data on customers, products, finance, etc.). The
12
normalized structure divides data into entities, which creates several tables in a relational
database. When applied in large enterprises the result is dozens of tables that are linked
together by a web of joins. Furthermore, each of the created entities is converted into separate
physical tables when the database is implemented (Kimball, Ralph 2008). The main
advantage of this approach is that it is straightforward to add information into the database. A
disadvantage of this approach is that, because of the number of tables involved, it can be
difficult for users both to:
1. Join data from different sources into meaningful information and then
2. Access the information without a precise understanding of the sources of data and of
the data structure of the data warehouse.
It should be noted that both normalized and dimensional models can be represented in
entity-relationship diagrams as both contain joined relational tables. The difference between
the two models is the degree of normalization.
These approaches are not mutually exclusive, and there are other approaches.
Dimensional approaches can involve normalizing data to a degree (Kimball, Ralph 2008).
In Information-Driven Business (Wiley 2010), Robert Hillard proposes an approach to
comparing the two approaches based on the information needs of the business problem. The
technique shows that normalized models hold far more information than their dimensional
equivalents (even when the same fields are used in both models) but this extra information
comes at the cost of usability. The technique measures information quantity in terms of
Information Entropy and usability in terms of the Small Worlds data transformation measure.
Advantages of Data Warehousing

Potential high Return on Investment

Competitive Advantage

Increased Productivity of Corporate Decision Makers
13
Data warehouses versus operational systems
The major task of on-line operational database systems is to perform on-line
transaction and query processing. These systems are called on-line transaction processing
(OLTP) systems.
Data warehouse systems, on the other hand, serve users or knowledge workers in the
role of data analysis and decision making. Such systems can organize and present data in
various formats in order to accommodate the diverse needs of the different users. These
systems are known as on-line analytical processing (OLAP) systems major reason for a
separation between database and dataware house is to help promote the high performance of
both systems
Comparison between OLTP and OLAP
Feature
OLTP
OLAP
Characteristic
Operational
processing
Informational
processing
Orientation
transaction
Analysis
Function
day-to-day
operation
long term
informational
requirements
Design
Application
oriented
subject
oriented
Access
Read and write
Mostly read
Data
accessed
tens
millions
View
Detailed
Summarized
Priority
High performance
and availability
High
flexibility
User
clerk, DBA
Knowledge
worker
Size
100MB to GB
100GB to TB
14
Operational systems are optimized for preservation of data integrity and speed of
recording of business transactions through use of database normalization and an entityrelationship model. Operational system designers generally follow the Codd rules of database
normalization in order to ensure data integrity. Codd defined five increasingly stringent rules
of normalization. Fully normalized database designs (that is, those satisfying all five Codd
rules) often result in information from a business transaction being stored in dozens to
hundreds of tables. Relational databases are efficient at managing the relationships between
these tables. The databases have very fast insert/update performance because only a small
amount of data in those tables is affected each time a transaction is processed. Finally, in
order to improve performance, older data are usually periodically purged from operational
systems.
Operational Systems v/s Data Warehousing Systems
Operational
Holds current data
Data is dynamic
Read/Write accesses
Repetitive processing
Transaction driven
Application oriented
Used by clerical staff for day-today operations
Normalized data model (ER
model)
Must be optimized for writes and
small queries.
Data Warehouse
Holds historic data
Data is largely static
Read only accesses
Adhoc complex queries
Analysis driven
Subject oriented
Used by top managers for analysis
Denormalized data model
(Dimensional model)
Must be optimized for queries
involving a large portion of the
warehouse.
Evolution in organization use
These terms refer to the level of sophistication of a data warehouse:
 Offline operational data warehouse: Data warehouses in this stage of
evolution are updated on a regular time cycle (usually daily, weekly or monthly) from
the operational systems and the data is stored in an integrated reporting-oriented data
15
 Offline data warehouse: Data warehouses at this stage are updated from data in
the operational systems on a regular basis and the data warehouse data are stored in a
data structure designed to facilitate reporting.
 On time data warehouse: Online Integrated Data Warehousing represent the
real time Data warehouses stage data in the warehouse is updated for every
transaction performed on the source data
 Integrated data warehouse: These data warehouses assemble data from
different areas of business, so users can look up the information they need across
other systems.
Sample applications
Some of the applications of data warehousing include:

Agriculture

Biological data analysis

Call record analysis

Churn Prediction for Telecom subscribers, Credit Card users etc.

Decision support

Financial forecasting

Insurance fraud analysis

Logistics and Inventory management

Trend analysis
Problems with Data Warehousing

Underestimation of resources for data loading

Hidden problems with source systems

Required data not captured

Increased end-user demands

High maintenance

Long duration projects

Complexity of integration
16
Data Warehouse Architecture
A typical data warehousing architecture is illustrated below:
Data Warehouse Components & Architecture
The data in a data warehouse comes from operational systems of the organization as well
as from other external sources. These are collectively referred to as source systems. The data
extracted from source systems is stored in a area called data staging area, where the data is
cleaned, transformed, combined, deduplicated to prepare the data for us in the data
warehouse. The data staging area is generally a collection of machines where simple
activities like sorting and sequential processing takes place.
The data staging area does not provide any query or presentation services. As soon as a
system provides query or presentation services, it is categorized as a presentationserver. A
presentation server is the target machine on which the data is loaded from the data staging
area organized and stored for direct querying by end users, report writers and other
applications. The three different kinds of systems that are required for a data warehouse are:
1. Source Systems
2. Data Staging Area
3. Presentation servers
17
The data travels from source systems to presentation servers via the data staging area. The
entire process is popularly known as ETL (extract, transform, and load) or ETT (extract,
transform, and transfer). Oracle‘s ETL tool is called Oracle Warehouse Builder (OWB) and
MS SQL Server‘s ETL tool is called Data Transformation Services (DTS).
Each

component
and
the
tasks
performed
by
them
are
explained
below:
OPERATIONAL DATA
The sources of data for the data warehouse are supplied from:
o
The data from the mainframe systems in the traditional network and
hierarchical format.
o
Data can also come from the relational DBMS like Oracle, Informix.
o
In addition to these internal data, operational data also includes external data
obtained from commercial databases and databases associated with supplier
and customers.

LOAD MANAGER
The load manager performs all the operations associated with extraction and loading data
into the data warehouse. These operations include simple transformations of the data to
prepare the data for entry into the warehouse. The size and complexity of this component will
vary between data warehouses and may be constructed using a combination of vendor data
loading tools and custom built programs.

WAREHOUSE MANAGER
The warehouse manager performs all the operations associated with the management of
data in the warehouse. This component is built using vendor data management tools and
custom built programs. The operations performed by warehouse manager include:
o
Analysis of data to ensure consistency
o
Transformation and merging the source data from temporary storage into data
warehouse tables
18
o
Create indexes and views on the base table.
o
Denormalization
o
Generation of aggregation
o
Backing up and archiving of data
In certain situations, the warehouse manager also generates query profiles to determine
which indexes and aggregations are appropriate.

QUERY MANAGER
The query manager performs all operations associated with management of user queries.
This component is usually constructed using vendor end-user access tools, data warehousing
monitoring tools, database facilities and custom built programs. The complexity of a query
manager is determined by facilities provided by the end-user access tools and database.

DETAILED DATA
This area of the warehouse stores all the detailed data in the database schema. In most
cases detailed data is not stored online but aggregated to the next level of details. However
the detailed data is added regularly to the warehouse to supplement the aggregated data.

LIGHTLY AND HIGHLY SUMMERIZED DATA
The area of the data warehouse stores all the predefined lightly and highly summarized
(aggregated) data generated by the warehouse manager. This area of the warehouse is
transient as it will be subject to change on an ongoing basis in order to respond to the
changing query profiles. The purpose of the summarized information is to speed up the query
performance. The summarized data is updated continuously as new data is loaded into the
warehouse.

ARCHIVE AND BACK UP DATA
This area of the warehouse stores detailed and summarized data for the purpose of
archiving and back up. The data is transferred to storage archives such as magnetic tapes or
optical disks.
19

META DATA
The data warehouse also stores all the Meta data (data about data) definitions used by all
processes in the warehouse. It is used for variety of purposed including:
o
The extraction and loading process – Meta data is used to map data sources to
a common view of information within the warehouse.
o
The warehouse management process – Meta data is used to automate the
production of summary tables.
o
As part of Query Management process Meta data is used to direct a query to
the most appropriate data source.
The structure of Meta data will differ in each process, because the purpose is different.
More about Meta data will be discussed in the later Lecture Notes.

END-USER ACCESS TOOLS
The principal purpose of data warehouse is to provide information to the business
managers for strategic decision-making. These users interact with the warehouse using end
user access tools. The examples of some of the end user access tools can be:
o
Reporting and Query Tools
o
Application Development Tools
o
Executive Information Systems Tools
o
Online Analytical Processing Tools
o
Data Mining Tools
THE
ETL
(EXTRACT
TRANSFORMATION
LOAD)
PROCESS
In this section we will discussed about the 4 major process of the data warehouse. They are
extract (data from the operational systems and bring it to the data warehouse),transform(the
data into internal format and structure of the data warehouse),cleanse (to make sure it is of
sufficient quality to be used for decision making) and load (cleanse data is put into the data
warehouse).
The four processes from extraction through loading often referred collectively as Data
Staging.
20
EXTRACT
Some of the data elements in the operational database can be reasonably be expected
to be useful in the decision making, but others are of less value for that purpose. For this
reason, it is necessary to extract the relevant data from the operational database before
bringing into the data warehouse. Many commercial tools are available to help with the
extraction process. Data Junction is one of the commercial products. The user of one of these
tools typically has an easy-to-use windowed interface by which to specify the following:
o
Which files and tables are to be accessed in the source database?
o
Which fields are to be extracted from them? This is often done internally by
SQL Select statement.
o
What are those to be called in the resulting database?
o
What is the target machine and database format of the output?
o
On what schedule should the extraction process be repeated?
TRANSFORM:
The operational databases developed can be based on any set of priorities, which keeps
changing with the requirements. Therefore those who develop data warehouse based on these
databases are typically faced with inconsistency among their data sources. Transformation
process deals with rectifying any inconsistency (if any).
One of the most common transformation issues is ‗Attribute Naming Inconsistency‘. It is
common for the given data element to be referred to by different data names in different
databases. Employee Name may be EMP_NAME in one database, ENAME in the other.
Thus one set of Data Names are picked and used consistently in the data warehouse. Once all
the data elements have right names, they must be converted to common formats. The
conversion may encompass the following:

Characters must be converted ASCII to EBCDIC or vise versa.

Mixed Text may be converted to all uppercase for consistency.

Numerical data must be converted in to a common format.

Data Format has to be standardized.
21

Measurement may have to convert.

Coded data (Male/ Female, M/F) must be converted into a common format.
All these transformation activities are automated and many commercial products are
available to perform the tasks. DataMAPPER from Applied Database Technologies is one
such comprehensive tool.
CLEANSING
Information quality is the key consideration in determining the value of the
information. The developer of the data warehouse is not usually in a position to change the
quality of its underlying historic data, though a data warehousing project can put spotlight on
the data quality issues and lead to improvements for the future. It is, therefore, usually
necessary to go through the data entered into the data warehouse and make it as error free as
possible. This process is known as Data Cleansing.
Data Cleansing must deal with many types of possible errors. These include missing
data and incorrect data at one source; inconsistent data and conflicting data when two or more
source are involved. There are several algorithms followed to clean the data, which will be
discussed in the coming lecture notes.
LOADING
Loading often implies physical movement of the data from the computer(s) storing the
source database(s) to that which will store the data warehouse database, assuming it is
different. This takes place immediately after the extraction phase. The most common channel
for data movement is a high-speed communication link. Ex: Oracle Warehouse Builder is the
API from Oracle, which provides the features to perform the ETL task on Oracle Data
Warehouse.
22
Types of Data Warehouse Architectures
Data warehouses and their architectures vary depending upon the specifics of an
organization's situation. Three common architectures are:

Data Warehouse Architecture (Basic)

Data Warehouse Architecture (with a Staging Area)

Data Warehouse Architecture (with a Staging Area and Data Marts)
Data Warehouse Architecture (Basic)
Figure 1-2 shows a simple architecture for a data warehouse. End users directly access
data derived from several source systems through the data warehouse.
Figure 1-2 Architecture of a Data Warehouse
Text description of the illustration
In Figure 1-2, the metadata and raw data of a traditional OLTP system is present, as is
an additional type of data, summary data. Summaries are very valuable in data warehouses
because they pre-compute long operations in advance. For example, a typical data warehouse
query is to retrieve something like August sales. A summary in Oracle is called a materialized
view.
23
Data Warehouse Architecture (with a Staging Area)
In Figure 1-2, you need to clean and process your operational data before putting it
into the warehouse. You can do this programmatically, although most data warehouses use a
staging area instead. A staging area simplifies building summaries and general warehouse
management. Figure 1-3 illustrates this typical architecture.
Figure 1-3 Architecture of a Data Warehouse with a Staging Area
Text description of the illustration
Data Warehouse Architecture (with a Staging Area and Data Marts)
Although the architecture in Figure 1-3 is quite common, you may want to customize
your warehouse's architecture for different groups within your organization. You can do this
by adding datamarts, which are systems designed for a particular line of business. Figure 1-4
illustrates an example where purchasing, sales, and inventories are separated. In this example,
a financial analyst might want to analyze historical data for purchases and sales.
24
Figure 1-4 Architecture of a Data Warehouse with a Staging Area and Data Marts
Data Warehousing Systems
A data warehousing system can perform advanced analyses of operational data
without impacting operational systems. OLTP is very fast and efficient at recording the
business transactions - not so good at providing answers to high-level strategic questions.
Component Systems
Legacy Systems
Any information system currently in use that was built using previous technology
generations. Most legacy Systems are operational in nature, largely because the automation
of transaction-oriented business process had long been the priority of IT projects.
Source Systems
Any system from which data is taken for a data warehouse. A source system is often
called a legacy system in a mainframe environment.
Operational Data Stores (ODS)
An ODS is a collection of integrated databases designed to support the monitoring of
operations. Unlike the databases of OLTP applications (that are function oriented), the ODS
25
contains subject oriented, volatile, and current enterprise-wide detailed information. It serves
as a system of record that provides comprehensive views of data in operational sources.
Like data warehouses, ODSs are integrated and subject-oriented. However, an ODS is
always current and is constantly updated. The ODS is an ideal data source for a data
warehouse, since it already contains integrated operational data as of a given point in time. In
short, ODS is an integrated collection of clean data destined for the data warehouse.
Data Warehouse Design
An introduction to Dimensional Modeling
Data Warehouses are not easy to build. Their design requires a way of thinking that is just
opposite to manner in which traditional computer systems are developed. Their construction
requires radical restructuring of vast amounts of data, often of dubious or inconsistent quality,
drawn from numerous heterogeneous sources. Their implementation strains the limits of
today‘s IT. Not surprisingly, a large number of data warehouse projects fail. Successful data
warehouses are built for just one reason: to answer business questions. The type of questions
to be addressed will vary, but the intention is always the same. Projects that deliver new and
relevant information succeed. Projects that do no, fail.
To deliver answers to businesspeople, one must understand their questions. The DW design
fuses business knowledge and technology know-how. The design of the data warehouse will
mean the difference between success and failure.
The design of the data warehouse requires a deep understanding of the business. Yet the task
of design is undertaken by IT professionals, but not business decision makers. Is it reasonable
to expect the project to succeed? The answer is yes. The key is learning to apply technology
toward business objectives.
Most computer systems are designed to capture data, data warehouses are designed to for
getting data out. This fundamental difference suggests that the data warehouse should be
designed according to a different set of principles.
Dimensional Modeling is the name of a logical design technique often used for data
warehouses. It is different from entity-relationship modeling. ER modeling is very useful for
transaction capture in OLTP systems.
26
Dimensional Modeling is the only viable technique for delivering data to the end users in a
data warehouse.
Comparison between ER and Dimensional Modeling
The characteristics of ER Model are well understood; its ability to support operational
processes is its underlying characteristic. The conventional ER models are constituted to
•
Remove redundancy in the data model
•
Facilitate retrieval of individual records having certain critical identifiers and
•
Therefore, optimize online transaction processing (OLTP) performance
In contrast, the dimensional model is designed to support the reporting and analytical needs
of a data warehouse system.
Why ER is not suitable for Data Warehouses?
•
End user cannot understand or remember an ER Model. End User cannot navigate an
ER Model. There is no graphical user interface or GUI that takes a general ER diagram and
makes it usable by end users.
•
ER modeling is not optimized for complex, ad-hoc queries. They are optimized for
repetitive narrow queries
•
Use of ER modeling technique defeats this basic allure of data warehousing, namely
intuitive and high performance retrieval of data because it leads to highly normalized
relational tables.
Introduction to Dimensional Modeling Concepts
The objective of dimensional modeling is to represent a set of business measurements in a
standard framework that is easily understandable by end users. A Dimensional model
contains the same information as an ER model but packages the data in a symmetric format
whose design goals are
•
User understandability
27
•
Query Performance
•
Resilience to Change
The main components of a Dimensional Model are Fact Tables and Dimension Tables. A fact
table is the primary table in each dimensional model that is meant to contain measurements of
the business. The most useful facts are numeric and additive. Every fact table represents a
many to many relationship and every fact table contains a set of two or more foreign keys
that join to their respective dimension tables.
A fact depends on many factors. For example, sale amount, a fact, depends on product,
location and time. These factors are known as dimensions. Dimensions are factors on which a
given fact depends. The sale amount fact can also be thought of as a function of three
variables.
sales amount = f(product, location, time)
Likewise in a sales fact table we may include other facts like sales unit and cost.
Dimension tables are companion tables to a fact table in a star schema. Each dimension table
is defined by it‘s primary key that serves as the basis for referential integrity with any given
fact table to which it is joined. Most dimension tables contain textual information. To
understand the concepts of facts, dimension, and star schema, let us consider the following
scenario:
Imagine standing in the marketplace and watching the products being sold and writing down
the quantity sold and the sales amount each day for each product in each store. Note that a
measurement needs to be taken at every intersection of all dimensions (day, product, and
store). The information gathered can be stored in the following fact table:
28
The facts are Sale Unit, Sale Amount, and Cost (note that all are numeric and additive),
which depend on dimensions Date, Product, and Store. The details of the dimensions are
stored in dimension tables..
29
Data Mining
Overview
Data mining, the extraction of hidden predictive information from large databases, is
a powerful new technology with great potential to help companies focus on the most
important information in their data warehouses. Data mining tools predict future trends and
behaviours, allowing businesses to make proactive, knowledge-driven decisions. The
automated, prospective analyses offered by data mining move beyond the analyses of past
events provided by retrospective tools typical of decision support systems. Data mining tools
can answer business questions that traditionally were too time consuming to resolve. They
scour databases for hidden patterns, finding predictive information that experts may miss
because it lies outside their expectations.
Most companies already collect and refine massive quantities of data. Data mining
techniques can be implemented rapidly on existing software and hardware platforms to
enhance the value of existing information resources, and can be integrated with new products
and systems as they are brought on-line. When implemented on high performance
client/server or parallel processing computers, data mining tools can analyze massive
databases to deliver answers to questions such as, "Which clients are most likely to respond
to my next promotional mailing, and why?"
This white paper provides an introduction to the basic technologies of data mining.
Examples of profitable applications illustrate its relevance to today‘s business environment as
well as a basic description of how data warehouse architectures can evolve to deliver the
value of data mining to end users.
The Foundations of Data Mining
Data mining techniques are the result of a long process of research and product
development. This evolution began when business data was first stored on computers,
continued with improvements in data access, and more recently, generated technologies that
allow users to navigate through their data in real time. Data mining takes this evolutionary
process beyond retrospective data access and navigation to prospective and proactive
30
information delivery. Data mining is ready for application in the business community because
it is supported by three technologies that are now sufficiently mature:

Massive data collection

Powerful multiprocessor computers

Data mining algorithms
Commercial databases are growing at unprecedented rates. A recent META Group survey
of data warehouse projects found that 19% of respondents are beyond the 50 gigabyte level,
while 59% expect to be there by second quarter of 1996.1 In some industries, such as retail,
these numbers can be much larger. The accompanying need for improved computational
engines can now be met in a cost-effective manner with parallel multiprocessor computer
technology. Data mining algorithms embody techniques that have existed for at least 10
years, but have only recently been implemented as mature, reliable, understandable tools that
consistently outperform older statistical methods.In the evolution from business data to
business information, each new step has built upon the previous one. For example, dynamic
data access is critical for drill-through in data navigation applications, and the ability to store
large databases is critical to data mining. From the user‘s point of view, the four steps listed
in Table 1 were revolutionary because they allowed new business questions to be answered
accurately and quickly.The core components of data mining technology have been under
development for decades, in research areas such as statistics, artificial intelligence, and
machine learning. Today, the maturity of these techniques, coupled with high-performance
relational database engines and broad data integration efforts, make these technologies
practical for current data warehouse environments.
Evolutionary
Step
Business Question
Enabling
Technologies
Product
Providers
Characteristics
Data Collection
"What was my total
revenue in the last five
years?"
Computers, tapes,
disks
IBM, CDC
Retrospective,
static data
delivery
"What were unit sales
in New England last
March?"
Relational databases
(RDBMS),
Structured Query
Language (SQL),
ODBC
Oracle,
Sybase,
Informix,
IBM,
Microsoft
Retrospective,
dynamic data
delivery at
record level
(1960s)
Data Access
(1980s)
31
Data
Warehousing &
Decision
Support
"What were unit sales
in New England last
March? Drill down to
Boston."
On-line analytic
processing (OLAP),
multidimensional
databases, data
warehouses
Pilot,
Comshare,
Arbor,
Cognos,
Micro
strategy
Retrospective,
dynamic data
delivery at
multiple levels
"What‘s likely to
happen to Boston unit
sales next month?
Why?"
Advanced
algorithms,
multiprocessor
computers, massive
databases
Pilot,
Lockheed,
IBM, SGI,
numerous
startups
(nascent
industry)
Prospective,
proactive
information
delivery
(1990s)
Data Mining
(Emerging
Today)
Table 1. Steps in the Evolution of Data Mining.
The Scope of Data Mining
Data mining derives its name from the similarities between searching for valuable
business information in a large database — for example, finding linked products in gigabytes
of store scanner data — and mining a mountain for a vein of valuable ore. Both processes
require either sifting through an immense amount of material, or intelligently probing it to
find exactly where the value resides. Given databases of sufficient size and quality, data
mining technology can generate new business opportunities by providing these capabilities:

Automated prediction of trends and behaviours. Data mining automates the
process of finding predictive information in large databases. Questions that
traditionally required extensive hands-on analysis can now be answered directly from
the data — quickly. A typical example of a predictive problem is targeted marketing.
Data mining uses data on past promotional mailings to identify the targets most likely
to maximize return on investment in future mailings. Other predictive problems
include forecasting bankruptcy and other forms of default, and identifying segments
of a population likely to respond similarly to given events.

Automated discovery of previously unknown patterns. Data mining tools sweep
through databases and identify previously hidden patterns in one step. An example of
pattern discovery is the analysis of retail sales data to identify seemingly unrelated
products that are often purchased together. Other pattern discovery problems include
32
detecting fraudulent credit card transactions and identifying anomalous data that could
represent data entry keying errors.
Data mining techniques can yield the benefits of automation on existing software and
hardware platforms, and can be implemented on new systems as existing platforms are
upgraded and new products developed. When data mining tools are implemented on high
performance parallel processing systems, they can analyze massive databases in minutes.
Faster processing means that users can automatically experiment with more models to
understand complex data. High speed makes it practical for users to analyze huge quantities
of data. Larger databases, in turn, yield improved predictions.
Databases can be larger in both depth and breadth:

More columns. Analysts must often limit the number of variables they examine when
doing hands-on analysis due to time constraints. Yet variables that are discarded
because they seem unimportant may carry information about unknown patterns. High
performance data mining allows users to explore the full depth of a database, without
preselecting a subset of variables.

More rows. Larger samples yield lower estimation errors and variance, and allow
users to make inferences about small but important segments of a population.
A recent Gartner Group Advanced Technology Research Note listed data mining and
artificial intelligence at the top of the five key technology areas that "will clearly have a
major impact across a wide range of industries within the next 3 to 5 years."2 Gartner also
listed parallel architectures and data mining as two of the top 10 new technologies in which
companies will invest during the next 5 years. According to a recent Gartner HPC Research
Note, "With the rapid advance in data capture, transmission and storage, large-systems users
will increasingly need to implement new and innovative ways to mine the after-market value
of their vast stores of detail data, employing MPP [massively parallel processing] systems to
create new sources of business advantage (0.9 probability)."3
The most commonly used techniques in data mining are:

Artificial neural networks: Non-linear predictive models that learn through training
and resemble biological neural networks in structure.
33

Decision trees: Tree-shaped structures that represent sets of decisions. These
decisions generate rules for the classification of a dataset. Specific decision tree
methods include Classification and Regression Trees (CART) and Chi Square
Automatic Interaction Detection (CHAID) .

Genetic algorithms: Optimization techniques that use process such as genetic
combination, mutation, and natural selection in a design based on the concepts of
evolution.

Nearest neighbour method: A technique that classifies each record in a dataset based
on a combination of the classes of the k record(s) most similar to it in a historical
dataset (where k ³ 1). Sometimes called the k-nearest neighbour technique.

Rule induction: The extraction of useful if-then rules from data based on statistical
significance.
Many of these technologies have been in use for more than a decade in specialized
analysis tools that work with relatively small volumes of data. These capabilities are now
evolving to integrate directly with industry-standard data warehouse and OLAP platforms.
The appendix to this white paper provides a glossary of data mining terms.
How Data Mining Works:
How exactly is data mining able to tell you important things that you didn't know or
what is going to happen next? The technique that is used to perform these feats in data
mining is called modelling. Modelling is simply the act of building a model in one situation
where you know the answer and then applying it to another situation that you don't. For
instance, if you were looking for a sunken Spanish galleon on the high seas the first thing you
might do is to research the times when Spanish treasure had been found by others in the past.
You might note that these ships often tend to be found off the coast of Bermuda and that there
are certain characteristics to the ocean currents, and certain routes that have likely been taken
by the ship‘s captains in that era. You note these similarities and build a model that includes
the characteristics that are common to the locations of these sunken treasures. With these
models in hand you sail off looking for treasure where your model indicates it most likely
34
might be given a similar situation in the past. Hopefully, if you've got a good model, you find
your treasure.
This act of model building is thus something that people have been doing for a long
time, certainly before the advent of computers or data mining technology. What happens on
computers, however, is not much different than the way people build models. Computers are
loaded up with lots of information about a variety of situations where an answer is known and
then the data mining software on the computer must run through that data and distill the
characteristics of the data that should go into the model. Once the model is built it can then be
used in similar situations where you don't know the answer. For example, say that you are the
director of marketing for a telecommunications company and you'd like to acquire some new
long distance phone customers. You could just randomly go out and mail coupons to the
general population - just as you could randomly sail the seas looking for sunken treasure. In
neither case would you achieve the results you desired and of course you have the
opportunity to do much better than random - you could use your business experience stored
in your database to build a model.
As the marketing director you have access to a lot of information about all of your
customers: their age, sex, credit history and long distance calling usage. The good news is
35
that you also have a lot of information about your prospective customers: their age, sex,
credit history etc. Your problem is that you don't know the long distance calling usage of
these prospects (since they are most likely now customers of your competition). You'd like to
concentrate on those prospects who have large amounts of long distance usage. You can
accomplish this by building a model. Table 2 illustrates the data used for building a model for
new customer prospecting in a data warehouse.
General information (e.g. demographic
Customers
Prospects
Known
Known
Known
Target
data)
Proprietary information (e.g. customer
transactions)
Table 2 - Data Mining for Prospecting
The goal in prospecting is to make some calculated guesses about the information in
the lower right hand quadrant based on the model that we build going from Customer General
Information to Customer Proprietary Information. For instance, a simple model for a
telecommunications company might be:
98% of my customers who make more than $60,000/year spend more than $80/month on
long distance
This model could then be applied to the prospect data to try to tell something about
the proprietary information that this telecommunications company does not currently have
access to. With this model in hand new customers can be selectively targeted.
Test marketing is an excellent source of data for this kind of modelling. Mining the
results of a test market representing a broad but relatively small sample of prospects can
provide a foundation for identifying good prospects in the overall market. Table 3 shows
another common scenario for building models: predict what is going to happen in the future.
36
Yesterday
Today
Tomorrow
Static information and current
plans (e.g. demographic data,
marketing plans)
Known
Known
Known
Dynamic information (e.g.
customer transactions)
Known
Known
Target
Table 3 - Data Mining for Predictions
If someone told you that he had a model that could predict customer usage how would
you know if he really had a good model? The first thing you might try would be to ask him to
apply his model to your customer base - where you already knew the answer. With data
mining, the best way to accomplish this is by setting aside some of your data in a vault to
isolate it from the mining process. Once the mining is complete, the results can be tested
against the data held in the vault to confirm the model‘s validity. If the model works, its
observations should hold for the vaulted data.
37
Architecture for Data Mining
To best apply these advanced techniques, they must be fully integrated with a data
warehouse as well as flexible interactive business analysis tools. Many data mining tools
currently operate outside of the warehouse, requiring extra steps for extracting, importing,
and analyzing the data. Furthermore, when new insights require operational implementation,
integration with the warehouse simplifies the application of results from data mining. The
resulting analytic data warehouse can be applied to improve business processes throughout
the organization, in areas such as promotional campaign management, fraud detection, new
product rollout, and so on. Figure 1 illustrates architecture for advanced analysis in a large
datawarehouse.
Figure 1 - Integrated Data Mining Architecture
The ideal starting point is a data warehouse containing a combination of internal data
tracking all customer contact coupled with external market data about competitor activity.
Background information on potential customers also provides an excellent basis for
prospecting. This warehouse can be implemented in a variety of relational database systems:
Sybase, Oracle, Redbrick, and so on, and should be optimized for flexible and fast data
access.
An OLAP (On-Line Analytical Processing) server enables a more sophisticated enduser business model to be applied when navigating the data warehouse. The multidimensional
structures allow the user to analyze the data as they want to view their business –
summarizing by product line, region, and other key perspectives of their business. The Data
Mining Server must be integrated with the data warehouse and the OLAP server to embed
ROI-focused business analysis directly into this infrastructure. An advanced, process-centric
metadata template defines the data mining objectives for specific business issues like
campaign management, prospecting, and promotion optimization. Integration with the data
38
warehouse enables operational decisions to be directly implemented and tracked. As the
warehouse grows with new decisions and results, the organization can continually mine the
best practices and apply them to future decisions.
This design represents a fundamental shift from conventional decision support
systems. Rather than simply delivering data to the end user through query and reporting
software, the Advanced Analysis Server applies users‘ business models directly to the
warehouse and returns a proactive analysis of the most relevant information. These results
enhance the metadata in the OLAP Server by providing a dynamic metadata layer that
represents a distilled view of the data. Reporting, visualization, and other analysis tools can
then be applied to plan future actions and confirm the impact of those plans.
The ideal starting point is a data warehouse containing a combination of internal data
tracking all customer contact coupled with external market data about competitor activity.
Background information on potential customers also provides an excellent basis for
prospecting. This warehouse can be implemented in a variety of relational database systems:
Sybase, Oracle, Redbrick, and so on, and should be optimized for flexible and fast data
access.
An OLAP (On-Line Analytical Processing) server enables a more sophisticated enduser business model to be applied when navigating the data warehouse. The multidimensional
structures allow the user to analyze the data as they want to view their business –
summarizing by product line, region, and other key perspectives of their business. The Data
Mining Server must be integrated with the data warehouse and the OLAP server to embed
ROI-focused business analysis directly into this infrastructure. An advanced, process-centric
metadata template defines the data mining objectives for specific business issues like
campaign management, prospecting, and promotion optimization. Integration with the data
warehouse enables operational decisions to be directly implemented and tracked. As the
warehouse grows with new decisions and results, the organization can continually mine the
best practices and apply them to future decisions.
This design represents a fundamental shift from conventional decision support
systems. Rather than simply delivering data to the end user through query and reporting
software, the Advanced Analysis Server applies users‘ business models directly to the
warehouse and returns a proactive analysis of the most relevant information. These results
39
enhance the metadata in the OLAP Server by providing a dynamic metadata layer that
represents a distilled view of the data. Reporting, visualization, and other analysis tools can
then be applied to plan future actions and confirm the impact of those plans.
Components of data mining
40
Integration of a Data Mining System with a Database or Data Warehouse
System
Data Base and Data Warehouse systems, possible integration schemes include

No coupling: No coupling means that a DM system will not utilize any function of a
DB or DW system

Loose coupling: Loose coupling means that a DM system will use some facilities of a
DB or DW system, fetching data from a data repository managed by these systems,
performing data mining, and then storing the mining results either in a file or in a
designated place in a database or data warehouse.

Semitight coupling: Semitight coupling means that besides linking a DM system to a
DB/DW system, efficient implementations of a few essential data mining primitives
(identified by the analysis of frequently encountered data mining functions) can be
provided in the DB/DW system.

Tight coupling: Tight coupling means that a DM system is smoothly integrated into
the DB/DW system.
Some issues we encounter in Data Mining

Mining methodology and user interaction issues

Mining different kinds of knowledge in databases

Interactive mining of knowledge at multiple levels of abstraction:

Incorporation of background knowledge

Data mining query languages and ad hoc data mining

Presentation and visualization of data mining results

Handling noisy or incomplete data

Pattern evaluation—the interestingness problem
The performance of data minig system is measured on the following issues:

Efficiency and scalability of data mining algorithms

Parallel, distributed, and incremental mining algorithms

Issues relating to the diversity of database types

Handling of relational and complex types of data
41
Conclusion:
Data warehousing and Data Mining are two important components of business intelligence.
Data warehousing is necessary to analyze (Analysis) the business needs, integrate
(Integration) data from several sources, model (Data Modeling) the data in an appropriate
manner to present the business information in the form of dashboards and reports
(Reporting).
42
Bibliography:

Google

Wikipedia

Slideshare.com

Atuthorstream.com

Yahoo.com

Google images
43