Download Data warehouse provides archetectures and tools for business

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data warehousing to
Data Mining
Abstract: This presentation explores
the concepts and techniques of data
warehousing and data mining, a
promising and flourishing frontier in
database systems and new database
applications. Data mining also
popularly referred to as knowledge
discovery databases
Introduction
1.2 Integrated: A data warehouse is
usually constructed by integrating
Multiple heterogeneous sources, such as
relational databases, flat-files, and online
transaction records .data cleaning and
data integration technique are applied
to ensure consistency in naming
conventions,
Encoding
structures
attribute measures.
1.3 Time-variant: Data are stored to
Data warehouse provides architectures
and tools for business executives to
systematically organize, understand, and use
their data to make strategic decisions. A
large number of organizations have found
that data warehouse system is valuable tools
in today’s competitive, fast evolving world.
In the last several years many
firms have
spent millions of dollars in building
enterprise –wide data warehouses.
“What exactly data warehouse”
The definition presents four key words
Subject-Oriented, Integrated, Timevarient, and Nonvolatile.
1.1
Subject-oriented:
A data
warehouse is organized around major
subjects such as customer, supplier, product
and sales. Rather than concentrating on the
day-to-day operations and transactions
processing of an organization a data
warehouse focuses on the modeling and
analysis of data for decision makers. Hence
data warehouse typically provide a simple
and concise view around particular subject
issues by excluding data that are not useful in
the decision support process.
provide information from a historical
perspective. Every key structure in the
data
warehouse
contains,
either
implicitly or explicitly, an element of
time.
1.4 Nonvolatile: A data warehouse is
always a physically separate store of data
transformed from the application data
found in the operational environment.
Due to this separation a data warehouse
does not require transaction processing,
recovery, and concurrency control
mechanisms.
A data warehouse is semantically
consistent data store that serves as a
physical implementation of a decision
support data model and stores the
information on which an enterprise needs
to make strategic decisions a data
warehouse is also often viewed as an
architecture, constructed by integrating
data from multiple heterogeneous
sources to support structured ad hoc
queries, analytical reporting and decision
making.
2.0
Differences
between
operational database systems
and data warehouses
The major task of online operational
database systems is to perform online
transaction processing and query
processing. These systems are called
online transaction processing systems.
they cover most of the day-to –day
operations of an organization ,such as
purchasinhg,inventory,manufacturing,ba
nking,payroll,registrationand accounting
Data warehouse systems on the other
hand serve users or knowledge workers
in the role of data analysis and decision
making such systems can organize
present data in various formats in order
to accommodated diverse needs of the
different
users. These systems are
known as online analytical processing
systems.
The major distinguish features
between OLTP and OLAP are
summarized as follows.
2.1 Users and system orientation:
An OLTP system is customer–oriented
and is used for transaction and query
processing by clerks, clients and
information technology professionals.
An olap system is market oriented and is
used for data analysis by knowledge
workers, including managers, executives
and analysts.
2.2 Data contents: An OLTP system
manages current data to be easily used
for decision making. An oltp system
manages large amounts of historical data,
provides facilities for summarization and
aggregation, and stores and manages
information at different levels of
granularity. These features make the data
easier to use in informed decision
making.
2.3 Database Design: An OLTP
system usually adopts an entity
relationship data model and an
application oriented database design. An
OLAP system adopts either a Star or
Snowflake model
2.4 View:-An OLTP system focuses
mainly on the current data within an
enterprise or department, without
referring to historical data or data in
different organizations an OLTP system
often spans multiple versions of a
database scheme, due to the evolutionary
process of an organization. OLAP data
are stored on multiple storage media.
2.5 Access patterns: The access
patterns an OLTP system consist mainly
of short, atomic transactions. Such a
system requires concurrency control and
recovery mechanisms
3.0 The process
warehouse design:
of
data
The warehouse design process
consists of the following steps:
3.1. Choose a business process
model: For example orders, invoices,
shipments,
inventory,
account
administration, sales, and the general
ledger. If the business processes
organized and involves multiple complex
object collections, a data warehouse
model should be followed however if the
process is departmental and focuses on
the analysis of one kind of business
process, a data mart model should be
chosen.
3.2 Choose the grain of the business
process. The grain is the fundamental,
atomic level of data to be presented in
the fact table process. For example
Individual transactions, individual daily
snapshots
3.3 Choose the dimensions that will
apply to each fact tablerecord typically
dimensions are time, item, customer,
supplier, warehouse, transaction type and
status
3.4 Choose the measures that will
populate each fact table record
.typical measures are numeric additive
quantities like dollars-sold
4.0 Data warehouse models:
enterprise warehouse, the data mart, and
virtual warehouse:
4.1 enterprise ware house:-an
enterprise warehouse collects all of the
information about subject spanning the
entire organization .it provides corporatewide data integration ,usually one or
more operational systems or external
information providers .it typically
contains detailed data as well
as
summarized data and can range in size
from a few gigabytes to hundred of
gigabytes.
4.2 Data mart:-a data mart contains a
subset of corporate wide data that is of
value to a specific group of users, the
scope is confined to specific selected
subjects
For example a marketing data mart may
contain its subjects to customer, item and
sales.
Data marts are usually implemented on
low cost departmental servers that are
UNIX or windows based. Depending on
the source of data. data marts can be
categorized as independent or dependent.
Independent data mare are sourced from
data captured from one or more
operational
systems
or
external
information providers or from data
generated locally with in a particular
department or geographic area
4.3 Virtual warehouse: a virtual
warehouse is a set of views over
operational databases. For efficient query
processing, only some of the possible
summary views may be materialized. A
virtual warehouse is easy to build but
requires excess capacity on operational
database servers.
5.0 Types of OLAP servers:
ROLAP versus MOLAP versus
HOLAP
5.1 Relational OLAP Servers:
these are the intermediate servers that
stand between a relational back-end
server and client front tools .They use a
relational or extended-relational dbms to
store and manage warehouse data and
olap middleware to support missing
pieses.rolap severs include optimization
for each dbms backend, implementation
of aggregation navigation logic and
additional tools and services
Multidimensional
Servers: These servers
5.2
OLAP
support
multidimensional views of data through
array-based multidimensional engines.
They map multidimensional views
directly to data cube array structures. The
advantage of using a data cube is that it
allows fast indexing to pre computed
summarized data .notice that with
multidimensional data stores, the storage
utilization may be low if data set is
sparse.
Many MOLAP servers adopt a two level
storage representation to handle sparse
and dense data sets: The dense sub cubes
are identified and stored as array
structures, while the sparse sub cubes
employ compression technology for
efficient storage utilization
5.3 Hybrid OLAP servers: the
hybrid olap approach combines ROLAP
and MOLAP technology, benefiting
From the greater scalability of ROLAP
and the faster computation of MOLAP.
For example a HOLAP server may allow
large volumes of detail data to be stored
in a relational database
From online analytical processing to
online analytical mining among many
different paradigms and architectures of
data mining systems, online analytical
mining which integrates online analytical
processing with data mining and mining
knowledge
in
multidimensional
databases, is particularly important in
the following reasons
High quality of data in data warehouse
available
information
processing
infrastructure
surrounding
data
warehouses
Olap-based exploratory data analysis
Online selection of data mining
functions
what is data mining
Data Mining refers to extracting or
“mining”
knowledge from large amounts of data.
Remember that the mining of gold from
rocks or sand is referred to as gold
mining rather than rock or sand mining.
Data mining should have been more
appropriately named knowledge mining
from data .data mining such as
knowledge mining from databases,
knowledge extraction, pattern analysis,
data archaeology and data dredging
knowledge discovery consists of the
following steps.
1.Data Cleaning(to remove noise and
inconsistent data)
2.Data Integration(with multiple data
sources may be combined)
3.Data Selection(where data relevant to
the analysis task are retrieved from the
database)
4.Data Transformation(where data are
transformed or consolidated into forms
appropriate for mining by performing
summery or aggregation operations for
instance)
5.Data Mining(an essential process
where intelligent methods are applied in
order to extract data patterns)
6.pattern evaluation(to identify the true
interesting
patterns
representing
knowledge
based
on
some
interestingness measures.
7.knowledge
presentation
(where
visualization and knowledge presentation
techniques are used to present the mined
knowledge to the user)
The architecture of a typical
data mining system may have
the following major components
1.0
Database, data warehouse, or other
information repository: this is one or a
set of databases, data warehouses,
spreadsheets, or other kinds of
information repositories. Data cleaning
and data integration techniques may be
performed on the data .
Database or data warehouse
server:
1.1
The database or data warehouse servers
responsible for fetching the relevant data
,based on the user’s data mining request.
Knowledge base: this is the domain
knowledge that issued to guide the
search or evaluate the interestingness of
resulting patterns. Such knowledge can
include concept hierarchies, used to
organize attributes or attributes values
into different levels of abstraction.
Knowledge such as user beliefs, which
can be used to assess a pattern’s
interestingness
based
on
its
unexpectedness, may also be included
1.2 Data mining engine: This is
essential to the data mining system and
ideally consists of a set of functional
modules
for
tasks
such
as
characterization,
association,
classification, cluster analysis, and
evaluation and deviation analysis
1.3 pattern evolution module:
This component typically employees
interestingness measures and interact
with data mining modules so as to focus
the search towards interesting patterns. It
may use interestingness threshold to
filter out discovered patterns
1.4 Graphical user interface: This
module communicates between users and
data mining system, allowing the user to
interact with the system by specifying a
data mining query or task ,providing
information to help focus the search, and
performing exploratory data mining
based on the intermediate data mining
results. In addition this component
allows the user to browse database and
data warehouse schemas or data
structures, evaluate mined patterns, and
visualize the pattern in different forms
Classification of data mining
systems:
2.0
Data mining system can be categorized
according to various criteria as follows
2.1
Classification according to
the kinds of databases
mined: A data mining system
can be classified according to the
kinds of databases mined database
systems themselves can classified
according to different criteria,
each of which may require its own
data mining technique.
2.2
classification according to
the kinds of knowledge
mined:
Data Mining systems can be
categorized according to the kinds
of knowledge they mine, that is
based
on
data
mining
functionalities,
such
as
characterization,
discrimination,
assosiation,
classification,
clustering, outlier analysis and
evolution analysis.
A comprehensive data mining
system usually provides multiple
data
integrated
data
mining
functionalities
2.3
Classification according to
the kind s of techniques
utilized: Data Mining systems can
be categorized according to the
underlying data mining techniques
employed. these techniques can be
described according to the degree of
user interaction involved or the
methods of data analysis employed
2.4
Classification according to
the application adapted:
Data Mining system can also be
categorized according to the
applications they adapt. for
example, there could be data
mining
systems
tailored
specifically
for
finance,
telecommunications, DNA, stock
markets and so on
Conclusion:
A Data Warehouse is a
repository for long-term
storage of data from
multiple sources, organizes
so
as
to
facilitate
management
decision
making. Data warehouse
systems provide some data
analysis
capabilities,
collectively referred to as
OLAP.
Data
mining
systems
mainly used for data
analysis
and
decision
making
Bibliography:1).Data mining: concepts
and techniques
Jiawei Han
Micheline kamber
Morgan Kaufmann publisher
2).Data warehousing in the
real world
Sam Anahory
Dennis Murray
Pearson education