Download A Paper Presentation on

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
A Paper Presentation on
- Information repository with knowledge discovery
Abstract
Organisations are today suffering from a malaise of data overflow.
The developments in the transaction processing technology has given rise toa situation
where the amount and rate of data capture is very high, but the processing of this data
into information that can be utilised for decision making, is not developing at the same
pace. Data warehousing and data mining (both data & text) provide a technology that
enables the decision-maker in the corporate sector/govt. to process this huge amount of
data in a reasonable amount of time, to extract intelligence/knowledge in a near real time.
A data warehouse is a repository of integrated information, available for
queries and analysis. Data and information are extracted from non-homogeneous sources
as they are generated and processed using process managers (load/warehouse/query).
This makes it easier and more efficient to run queries over data that originally came from
different sources. It also enables the people to take informed decisions.
Various technologies for extracting new insight from the data warehouse have
come up which we classify loosely as "Data Mining Techniques" Data mining systems
improve an organization’s effectiveness, efficiency and value by increasing the
usefulness of the knowledge the organization process. Our paper focuses on the need for
information repositories and discovery of knowledge and thence the overview of, the so
hyped, Data Warehousing and Data Mining.
CONTENTS
Introduction
What is Data-Warehousing?
Warehousing Functions
Architecture Of Data Warehouse
What is Data Mining ?
Data Mining as a part of Knowledge Discovery
Goals of Data Mining and Knowlegdge Discovery
Bibliography
in the information systems field. Since
the early 1990s, data warehouses have
been at the forefront of information
technology applications as a way for
organizations to effectively use digital
Introduction
information for business planning and
decision
making.
Hence,
an
This paper presents an overview of how
understanding of data warehouse system
Data Ware-houses serve as a data source
architecture is or will be important in our
for datamining.Data warehousing is one
roles and responsibilities in information
of the most important strategic initiatives
management.
a data warehouse provides “a single
version of the truth.”
All users and applications access the
same data. Because users access better
data, their ability to analyze data and
make
decisions
Most fundamentally, a data warehouse is
Warehousing
created to provide a dedicated source of
increasingly
data
concept
to
support
decision-making
of
has
improves.
emerged
popular
and
applying
Data
as
an
powerful
information
applications. Rather than having data
technology to turn this huge islands of
scattered across a variety of systems, a
data into meaningful information for
data warehouse integrates the data into a
better business decisions
single repository. It is for this reason that



Most simply, a data warehouse is a
collection of data created to support
decision making. Users and applications
access the warehouse for the data that
they need. A warehouse provides a data
infrastructure. It eliminates a reason for
the failure of many decision support
applications – the lack of quality data.
A data warehouse has the following four
characteristics
Subject-oriented means that all
relevant data about a subject is gathered
and stored as a single set in a useful
format.
Integrated refers to the Data
collected from multiple systems and are
integrated around subjects. Data being
stored in a globally accepted fashion
with consistent naming conventions,
measurements, encoding structures, and
physical attributes, even when the
underlying operational systems store the
data differently.
Non-volatile:
A warehouse is
nonvolatile – users cannot change or
update the data. The data warehouse is
read-only. Non-volatile makes sure that
all users are working with the same data.
The warehouse is updated, but through
IT controlled load processes rather than
by users.

Time
variant.
A
warehouse
maintains historical data (i.e., it includes
time as a variable). Unlike transactional
systems, where only recent data, such as
for the last day, week, or month, are
maintained, a warehouse may store years
of data. Historical data is needed to
detect deviations, trends, and long-term
relationships.

Whereas a data warehouse is a
repository of data, data warehousing is
the entire process. As shown in Figure 1,
data warehousing encompasses a broad
range of activities: all the way from
extracting data from source systems to
the use of the data for decision-making
purposes. Specifically, it includes data
extraction, transformation, and loading,
the access of the data by end users, and
applications
Conceptually a data warehouse looks
like this
Information Sources always include the
core operational systems which form the
backbone of day-to-day activities. It is
these systems which have traditionally
provided management information to
support decision making.
Decision Support Tools are used to
analyze the information stored in the
warehouse, typically to identify trends
and new business opportunities..
The Data Warehouse itself is the bridge
between the operational systems and the
decision support tools. It holds a copy of
much of the operational system data in a
logical structure which is more
conducive to analysis. The Data
Warehouse, which will be refreshed in
scheduled bursts from operational
systems and from relevant external data
sources, provides a single, consistent
view of corporate data, leaving
operational systems unaffected.
Data – Warehouse Functions
The main function behind a data
warehouse is to get the enterprise-wide
data in a format that is most useful to
end-users, regardless of their locations.
Data warehousing is used for:
1. Increasing the speed and flexibility
of analysis.
2. Providing
a
foundation
for
enterprise-wide integration and
access.
3. Improving or re-inventing business
processes.
4. Gaining a clear understanding of
customer behavior.
Data Warehouse-Goals:
The fundamental goal is to
enable users appropriate access to a
homogenized and comprehensive view
of organization. It also supports
forecasting, planning and decisionmaking process. An additional goal is to
achieve information consistency, provide
security and adaptability.
ARCHITECTURE
OF
DATA
WAREHOUSE:
Data Warehouse Architecture (DWA) is
a way of representing the overall
structure of data, communication,
processing and presentation that exists
for end-user computing within the
enterprise.







The architecture is made up of a number
of interconnected parts:
Operational
Database
/
External Database Layer: Operational
systems process data to support critical
operational needs. To do that,
operational databases have been
historically created to provide an
efficient processing structure for a
relatively small number of well-defined
business transactions.
Information Access Layer: This
is the layer that the end-user deals with
directly. In particular, it represents the
tools that the end-user normally uses day
to day, e.g., Excel, Lotus 1-2-3 e.t.c.
Data Access Layer: The Data
Access Layer of the Data Warehouse
Architecture is involved with allowing
the Information Access Layer to talk to
the Operational Layer.
Data Directory (Meta-data)
Layer: Meta-data is the data about data
within
the
enterprise.
Record
descriptions in a COBOL program are
Meta-data.
Process Management Layer:
The Process Management Layer is
involved in scheduling the various tasks
that must be accomplished to build and
maintain the data warehouse and data
directory.
Application Messaging Layer:
The Application Message Layer has to
do with transporting information around
the
enterprise-computing
network.
Application Messaging is also referred
to as “Middleware”, but it can involve
those just networking protocols.
Data warehouse (physical)
Layer: The (core) Data Warehouse is
where the actual data used primarily for
informational uses occur. In a Physical
Data Warehouse copies, in some cases
many copies, of operational and or

external data are actually stored in a
form that is easy to access.
Data Staging Layer: Data
Staging is also called copy management
or replication management, but in fact, it
includes all of the processes necessary to
select, edit, summarize, combine and
load data warehouse and information
access data from operational and/or
external database.
Classification of data warehouses
Data warehouses can be classified into
three types:
Enterprise data warehouse: An
enterprise data warehouse provides a
central database for decision support
through
out
enterprise.
Operational data store (ODS): This
has a broad enterprise wide scope, but
unlike the real enterprise data
warehouse, data is refreshed in near real
time and used for routine business
activity.
Data Mart: Data mart is a subset of
data warehouse and it supports a
particular
region, business unit or
business function.
Data Marts
A data mart is typically defined as a
subset of the contents of a data
warehouse, stored within its own
database. A data mart tends to contain
data focused at the department level, or
on a specific business area. The data can
exist at both the detail and summary
levels. The data mart can be populated
with data taken directly from operational
sources, similar to a data warehouse, or
data taken from the data warehouse
itself. Because the volume of data in a
data mart is less than that in a data
warehouse, query processing is often
faster.
Data mining has been defined as "The
nontrivial extraction of implicit,
previously unknown, and potentially
useful information from data and the
science of extracting useful information
from large data sets or databases".
Application of search techniques from
Artificial Intelligence to these problems.
It is the analysis of large data sets to
discover patterns of interests.” Many of
the early data mining software packages
were based on one algorithm.
Data base mining or Data mining (DM)
(formally termed Knowledge Discovery
in Databases – KDD) is a process that
aims to use existing data to invent new
facts and to uncover new relationships
previously unknown even to experts
thoroughly familiar with the data. It is
like extracting precious metal (say gold
etc.) and/or gems, hence the term
“mining”, It is based on filtration and
assaying of mountain of data “ore” in
order to get “nuggets” of knowledge.
Characteristics of a data mart include:
1) Quicker and simpler implementation.
2) Lower implementation cost.
3) Needs of a specific business unit or
function met.
4) Protection of sensitive information
stored elsewhere in the data warehouse.
5) Faster response times due to lower
volumes of data.
6) Distribution of data marts to user
organizations.
7) Built from the bottom upward.
The Data Mining process is not a simple
function, as it often involves a variety of
feedback loops since while applying a
particular technique, the user may
determine that the selected data is of
poor quality or that the applied
techniques did not produce the results of
the expected quality. In such cases, the
User has to repeat and refine earlier
steps, possibly even restarting the entire
process from the beginning.
Data mining is a capability consisting of
the hardware, software, "warm ware"
(skilled labor) and data to support the
recognition of previously unknown but
potentially useful relationships. It
supports the transformation of data to
information, knowledge and wisdom, a
cycle that should take place in every
organization. Companies are now using
this capability to understand more about
their customers, to design targeted sales
and marketing campaigns, to predict
what and how frequently customers will
buy products, and to spot trends in
customer preferences that lead to new
product development.
Data Mining as a Part of the Knowledge
Discovery Process
· Knowledge Discovery in Databases,
frequently abbreviated as KDD,
typically encompasses more than data
mining.
· The knowledge discovery process
comprises six phases:
Data selection ,Data about specific items
or categories of items, or from stores in a
specific region or area of the country,
may be selected.
Data cleansing process then may correct
invalid zip codes or eliminate records
with incorrect phone prefixes.
Enrichment typically enhances the data
with additional sources of information.
Data transformation and encoding may
be done to reduce the amount of data.
Goals of Data Mining and Knowledge
Discovery
The goals of data mining fall into the
following classes:
Prediction: Data mining can show how
certain attributes within the data will
behave in the future.
Identification: Data patterns can be used
to identify the existence of an item, an
event, or an activity.
Classification : Data mining can
partition the data so that different classes
or categories can be identified based on
combinations of parameters.
Optimization :One eventual goal of data
mining may be to optimize the use of
limited resources such as time, space,
money, or materials and to maximize
output variables such as sales or profits
under a given set of constraints.
Conclusion
Data Warehousing provides the means
to change raw data into information for
making effective business decisions--the
emphasis on information, not data. The
data warehouse is the hub for decision
support data. A good data warehouse
will... provide the RIGHT data... to the
RIGHT people... at the RIGHT time:
RIGHT NOW! While data warehouse
organizes data for business analysis,
Internet has emerged as the standard for
information sharing Data warehouse and
data mining plays an important role in
storing data and sorting out the particular
data. It has become very easy for a user
to get the information that he wants
through this mining.
Quantifiable
business benefits have been prove
through the integration of data mining
with current information systems, and
new products are on the horizon that will
bring this integration to an even wider
audience of users.
Bibliography

Eckerson, W.W. (1988) "PostChasm Warehousing," Journal of Data
Warehousing,

Recent Developments in Data
Warehousing by H.J. Watson.

Data
Mining
Concepts
and
Techniques by Jiawei Han, Micheline
Kamber
WEBSITES
www.datawarehousingonline.com
www.pcc.ac.uk.com
www.dsstechniques.com