Download Chapter25 - members.iinet.com.au

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expense and cost recovery system (ECRS) wikipedia , lookup

Big data wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Data Protection Act, 2012 wikipedia , lookup

Data model wikipedia , lookup

Data center wikipedia , lookup

SAP IQ wikipedia , lookup

Forecasting wikipedia , lookup

Information privacy law wikipedia , lookup

3D optical data storage wikipedia , lookup

Data analysis wikipedia , lookup

Database model wikipedia , lookup

Data vault modeling wikipedia , lookup

Business intelligence wikipedia , lookup

Transcript
Chapter #25 – Data Warehousing
Online Transaction Processing (OLTP) applications
 DBMS widely used by org. for maintaining data that documents their
everyday operations  operational data
 Applications typically make transactions that make small changes and a
larger number of transactions must be reliably and efficiently processed
 DBMS been optimized extensively to perform well in such apps.
Decision support applications
 Current and historical data is comprehensively analysed and explored to
identify useful trends and creating summaries of data to support high level
decision making
 DBMS vendors are adding features to their products to support it
o New constructs
o Novel indexing
o Query optimisation techniques
 to support complex queries
 Use of views extensively used for applications involving complex data
analysis
Use of views
 Precomputing view definitions can make queries run much faster
Data warehouse
 Organisation can consolidate info from several DBs into a data warehouse
by copying tables from many sources into one location
 Warehouses of data from multiple databases
 Drawn from several databases maintained by different business units
together with historical and summary info
 Gives comprehensive view of all aspects of an enterprise
Three classes of analysis tools available
1. Online Analytic Processing (OLAP)
o Support class of stylised queries that involve group-by and
aggregation operators
o Provide excellent support for complex Boolean conditions,
statistical functions and features for time-series analysis
o Apps dominated by such queries are called OLAP
o Support querying style in which data is best thought of as a
multidimensional array
2. DBMSs optimised for decision support applications
o DBMSs that support traditional SQL-style queries but are designed
to also support OLAP queries efficiently
o Vendors of RDBMS enhancing their products in this direction
3. Exploratory data analysis
o Motivated by desire to finding interesting or unexpected trends and
patterns in large data sets rather than the complex query
characteristics in previous 2
o Amount of data in many applications too large to permit manual
analysis or even traditional statistical analysis
o Goal of data mining is to support exploratory analysis over very
large data sets
Data warehousing
 OLAP or data mining queries over distributed data is likely to be slow
 Such complex analysis, often statistical in nature, not essential that most
current version of data is required
 Data warehousing is the creation of a centralised repository of all the data
 Availability of a data warehouse facilitates the application of OLAP and
data mining tools (analysis tools)
OLAP : Multidimensional data model
 OLAP applications are dominated by complex queries involving group-by
and aggregation operators
 OLAP queries use multidimensional data model
 Focus is on a collection of numeric measures
 Each measure depends on a set of dimensions
 Eg:
o Measure attribute is sales
o Dimensions are Product (pid), Location (locid), Time (timeid)
o Given a product, location and time we have 1 associated sale value
o Think of sales info being arranged into 3d array Sales
 In OLAP apps : bulk of data can be represented in such a
multidimensional array
Multidimensional OLAP (MOLAP)
 OLAP systems that use arrays to store multidimensional datasets
Representation using relations (fact tables)
 Multidimensional array can also be represented by a relation
 This relation which relates the dimensions (product, location and time) to
the measure of interest (sales) is called a fact table
Dimensions
 each dimensions can have a set of associated attributes
o i.e., location dimension: identified by locid, has attributes country,
state and city
 Each dimension can be structured as a hierarchy:

information about dimensions can also be represented by relations:
locations(locid: integer, city: string, state : string, country: string)
o these relations are much smaller than the fact table
o They are called dimension tables
Relational OLAP (ROLAP)
 OLAP systems that store all info including fact tables as relations
Multidimensional Database Design
 tables in a ROLAP







Suggests a star schema
o Centered at the fact table (Sales)
o Combination of fact table and dimension tables
Star schema pattern very common in DB designed for OLAP
Bulk of data typically in fact table
o Has no redundancy (usually BCNF)
Info about dimension values maintain in dimension tables
Size of DB used for OLAP dominated by fact table (Sales)
Small response times for interactive querying important in OLAP
New storage structures and indexing techniques have been developed to
support OLAP
Creating and maintaining a warehouse
 Since source DBs are often created and maintained by different groups,
there are a number of semantic mismatches across these DBs
o Different names for same attributes, different in how tables are
normalised and structured
o These differences must be reconciled when data brought into
warehouse

Extracted:
o Data extracted from operational databases and external sources





Cleaned:
o Data cleaned to minimised errors and fill in missing info if possible
Transformed:
o Data transformed to reconcile semantic mismatched
o Accomplished by defining a relational view over the tables in the
data sources
Loading:
o Loading data consists of materialising such views and storing them
in the warehouse
o Sorting and generation of summary info
o Data is partitioned and indexes are built for efficiency
o Very slow process
Refresh:
o After data loaded into a warehouse, need to ensure data in
warehouse is periodically refreshed to
 reflect updates to the data source
 purge old data
Metadata repository:
o Important task in maintaining a warehouse is keeping track of data
currently stored on it  bookkeeping
o Done by storing info about warehouse data in the system catalogs
 Typically very large and often stored and managed in
separate DB called metadata repository
o Size and complexity of catalogs is due to
 size/complexity of warehouse itself
 size of administrative info that must be maintained
Data Warehousing Architecture