Download An overview of Data Warehousing and OLAP Technology

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia, lookup

Extensible Storage Engine wikipedia, lookup

Entity–attribute–value model wikipedia, lookup

Clusterpoint wikipedia, lookup

Relational model wikipedia, lookup

Object-relational impedance mismatch wikipedia, lookup

Functional Database Model wikipedia, lookup

Database model wikipedia, lookup

Transcript
An overview of Data Warehousing and
OLAP Technology
Original slides by: Manish Desai
Modified and presented by Alice Leung
Introduction
• Essential elements of decision support
• Enables The Knowledge Worker to make better
and faster decisions
• Used in many industries like:
– Manufacturing (for order shipment)
– Retail (for inventory management)
– Financial Services (claims and risk analysis)
• Every major database vendor offers product in this
area
2
What is Data Warehouse ?
• A data warehouse is a “subject-oriented,
integrated, time-varying, non-volatile collection of
data that is used primarily in organizational
decision making”
• Typically maintained separately from operational
databases
3
Explanation of definition
• Subject-Oriented:
– Designed around subject such as customer, vendor,
product and activity
– Does not includes data that are not needed for Decision
support system (DSS)
• Integrated:
– Most important feature
– Consistent naming convention, measurement of
variables and so forth
– The data should be stored in single globally acceptable
fashion
4
Explanation (continues…)
• Time Varying:
– All data in the warehouse should be accurate as of some
moment in time
– Data stored over a long time horizon (5 –10 years)
– Key structure contains element of time (implicitly or
explicitly)
– Data once correctly recorded cant be updated
• Non Volatile:
– No Update of data allowed
– only loading and access of data operations
5
Why Separate Data Warehouse?
• High performance for both systems
– DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
– Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
• Different functions and different data:
– missing data: Decision support requires historical data which
operational DBs do not typically maintain
– data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
– data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
April 29, 2017
Data Mining: Concepts and Techniques
6
6
Data Warehouse Vs. Operational
Database
Data Warehouse
Operational Database
User
Knowledge worker
(Executive, manager,
analyst)
Clerk, IT professional
Function
Decision support
Day to day operations
Data
Historical,summarized,
multidimensional,
integrated
Current, up-to-date, detailed
Unit of work
Complex query
Short, simple transaction
DB Design
Subject-oriented
Application-oriented
Metric
Query throughout,
response
Transaction throughput
7
Tiered Architecture
External
Sources
Operational
Databases
Tier1:
Data Warehouse
Server
Extract
Transform
Load
Refresh
Data
Warehouse
Tier2:
OLAP
Server
OLAP Server
Serve
Tier3:
Client
s
Analysis
Query/Reports
Data mining
Data Marts
Data Sources
Data Storage
OLAP Engine Front-End Tools
8
Architecture (continues…)
• Distributed Data warehouse
– Load balancing, scalability,higher availability
– Meta data replicated and centrally administrated
– Too expansive
• Data marts
– Departmental subset focused on selected subjects
– example: marketing department includes customer,
sales and product tables
– Has own repository and administration
– May lead to complex integration problems if not
designed properly
9
Back end tools and Utilities
• Data cleaning, loading, refreshing tools
• Cleaning
– Multiple source, possibility of errors
– Example: replace string sex by gender
• Loading
– Building indices, sorting and making access paths
– Large amount of data
• Incremental loading
• Only updated tuples are inserted ,Process hard to manage
• Refresh
– Propagating updates
– When to refresh ?
– Set by administrator depending on user needs and traffic
10
Conceptual Model and front end tools
• Multi dimensional view
–
–
–
–
Dimensions together uniquely determine the measure
Example: Sales can be represented as city,product, data
Each dimension is described by set of attribute
Example: product consist of
• Category of product
• Industry of product
• Year of introduction
• Front end tools
– Multi dimensional spreadsheet
• Supports Pivoting-reorientation
• Roll_up - summarized data
• Drill_down - go from high level to low level summary
11
Conceptual Model
Date
1
2
3
4
sum
U.S.A
Canada
Mexico
Country
TV
PC
PVR
sum
Total annual sales
of TV in U.S.A.
sum
ALL
12
Database design
• Two ways to represent Multi dimensional model
– Star schema
• Database consist of single fact table and single table for each
dimension
• Each tuples in fact table consist of pointer to each of dimension
– Snowflake schema
• Refinement over star schema
• Dimensional hierarchy is explicitly represented by normalizing
dimension tables
13
Star Schema
Time
T_key
T_day
T_day_week
T_month
T_quarter
T_year
Sales Fact Table
Time_key
Item_key
Branch
Branch_key
B_key
B_name
B_type
Measures
Location_key
Units_sold
Dollars_sold
Avg_sales
item
I_key
I_name
I_brand
I_type
I_supplier_type
location
location_key
street
city
province
country
Star Schema
14
Snowflake Schema
Time
T_key
T_day
T_day_week
T_month
T_quarter
T_year
Sales Fact Table
time_key
item_key
Branch
branch_key
B_key
B_name
B_type
Measures
location_key
Item
I_key
I_name
I_brand
I_type
I_supplier_type
units_sold
Location
dollars_sold
location_key
street
city
avg_sales
Supplier
S_key
S_type
City
C_key
C_city
C_province
C_country
Snowflake Schema
15
Warehouse Servers
• Specialized SQL servers
– Provides advanced query language and query
processing support for SQL queries over star and
snowflake schemas
– Example: Redbrick
• ROLAP
– Between relational back end and client front end tools
– Extend traditional relational servers to support
multidimensional queries
– Example: Microstratergy
• MOLAP
– Multidimensional storage engine
– Direct mapping
– Example: Essbase from Arbor Inc.
16
Index structures
• Bit map indices
– Use single bit to indicate specific value of attribute
– Example:
instead of storing eight characters to record “engineer” as skill
of employee use single bit
id# Name Skill
1000 John
1
• Join indices
– Maintains the relationship between foreign key with its
matching primary keys
17
Meta data and warehouse management
• Its data about data
• Used for building, maintain, managing and using
data warehouse
• Administrative meta data
– Information about setting up and using warehouse
• Business meta data
– Business terms and definition
• Operational meta data
– Information collected during operation of warehouse
18