Download CS 338 Data Warehousing and Business Analytics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
CS 338
Data Warehousing and Business Analytics
Bojana Bislimovska
Winter 2016
Major research
Outline
• Terminology
• Data Warehouse Characteristics
• Comparison with Relational DBs
• Data Models for Data Warehouses
• Data Warehouse Functionality
Data
Warehouse
Terminology
Major research
• Data Warehouse
 Collection of information originating from multiple databases
 Allows complex analysis, knowledge discovery, and decision making
based on historical data
• Supported applications
 OLAP (Online Analytical Processing) –analysis of complex data from the
data warehouse
• Enables quick and straightforward querying of analytical data
stored in data warehouses
 DSS (Decision-Support Systems)
• Also known as EIS (Executive Information Systems)
• Provides data and tools for complex decision-making
 Data mining
• Knowledge discovery: searching data for unanticipated new
knowledge
Data
Warehouse
Terminology
Major research
• Data Warehousing – collection of decision-support technologies
aimed at enabling the knowledge worker (executive manager,
analyst) to make better and faster decisions
• Online Transaction Processing (OLTP) – supported by traditinal DBs
 Include data modifications
 Query requirements
• OLTP cannot be optimized for OLAP, DSS or data mining
Data
Warehouse
Characteristics
Major research
• Information in data warehouse is typically not subject to modification
 Periodic updates – data refreshed incrementally
• Warehouse insertions are handled by the ETL (extract, transform, load)
process
 Reformatting of data before loading them into the warehouse
• Encompass large volumes of data –order of magnitude larger than the
source DBs
Data
Warehouses
vs
Relational
DBs
Major research
• Operations
 Data warehouses optimized to find data correlations and to support trend
analyses
 Traditional databases are transactional: optimized for access, update, and
integrity assurance
 Data warehouses are less volatile than relational DBs.
• Data currency
 Relational DBs required to maintain up-to-date, detailed data
 Data warehouses characterized by historical data
 I for atio i data arehouse is relati ely coarse grai ed ie fro
ft. a d refresh policy is carefully chose , usually i cre e tal.
• Data volume
 Data warehouses may be exceptionally large (7 years of records)
• Data warehouse can be interpreted as a (special) view of the data
,
Classification
of
Data
Warehouses
Major research
• The sheer volume of data is an issue, based on which Data
Warehouses could be classified as follows.
 Enterprise-wide data warehouses
• Huge projects requiring massive investment of time and resources
 Virtual data warehouses
• Provide views of operational databases that are materialized for
efficient access
 Logical data warehouses
• Use data federation, virtualization and distribution techniques
 Data marts
• Generally targeted to a subset of organization, such as a
department, and are more tightly focused
Data
Modeling
for
Data
Warehouses
Major research
• Traditional DBs generally represent data in two dimensions
 Rows and columns of a relational model
 Spreadsheets
• Data Warehouses are usually multidimensional
 Data are stored in data cubes (hypercubes for more than three dimensions)
 Query performance is better than in the relational model
 Direct querying of data in any combination of dimensions
Data
Modeling
for
Data
Warehouses
Major research
The mutlidimensional model involves two types of tables:
• Dimension table – consists of tuples of attributes of the dimension
• Fact table – contains tuples, one per recorded fact
 Each fact contains some measured (observed) variables and identifies them with
pointers to dimension tables
Data
Modeling
for
Data
Warehouses
Major research
Two common multidimensional schemas
• Star schema – consists of a fact table with a single table for each
dimension
• Fact constelation – set of fact tables that share some dimension tables
Data
Modeling
for
Data
Warehouses
Major research
Two common multidimensional schemas
• Snowflake schema
 Variation on the star schema
 Dimension tables from a star schema are organized into a hierarchy by
normalizing them
Data
Warehouse
Functionality
Major research
• Influenced by SQL and spreadsheets
• Aggregate a measure over one or more tables
 Examples: find total sales, find total sales by region, find the top-5
most sold products
• Roll-up: summarizes data with increasing generalization
 Given total sales by city, can roll-up to get the total sales by country
• Drill-down: reveals increasing levels of detail (the inverse of roll-up)
 Given total sales by country, can drill-down to get the total sales by city
 It can also drill-down on different dimension to get total sales by
product for each country
Roll-up
vs
Drill-down
Major research
T hree d imensio nal d at a cub e
P
r
o
d
u
c
t
P1 2 3
r
r t e tr 4
a
u
Q
l Q tr 3
a
c
Q
F i s tr 2
Q
1
r
t
Q
Reg 1 Reg 2
Reg 3
roll up
Two Dimensional Model
REGION
P1 2 4
REG1
P1 2 5
P1 2 6
:
:
Region
drill down
P
R
O
D
U
C
T
P123
P124
P125
P126
:
:
REG2
REG3
Data
Warehouse
Functionality
Major research
• Pivoting (rotation): changing from one dimensional hierarchy to another
pivot
Data
Warehouse
Functionality
Major research
• Slice and dice: reduction of data into smaller chunks so that information
is made visible from multiple points of view
• Sorting: data are sorted by ordinal value
• Selection: data are filtered by value or range
• Derived (computed) attributes: computed by operations on stored and
derived values