Download Newer Database Topics - University of Manitoba

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
University of Manitoba
Asper School of Business
3500 DBMS
Bob Travica
Chapter 8
Newer Database Topics
Based on G. Post, DBMS: Designing & Building Business Applications
Updated 2010
D
B
S
Y
S
T
E
M
S
OLAP & Data Warehouse
MIS 3500
Predefined
reports
Interactive
data analysis
Operations’
data
Periodical
transfers
Online Transaction
Processing (OLTP):
Querying Databases
with 3NF tables
Flat files
Online Analytical
Processing (OLAP);
Data warehousing;
Data Mining.
Usually denormalized data.
2 of 20
D
B
S
Y
S
T
E
M
S
OLTP vs. OLAP
Category
Data storage
Indexes
Joins
Duplicated data
Updates
Queries
OLTP
3NF tables
Few
Many
Normalized,
limited duplication
Constant,
small data sets
Specific
OLAP
Multidimensional cubes
Many
Minimal
Denormalized DBMS
Overnight,
Large data sets
Ad hoc
3 of 20
D
B
S
Y
S
T
E
M
S
Warehousing Goals
 Integrate data from different sources to get a larger picture of
business
 Data aggregations (summaries on different dimensions)
 Ad hoc queries (support non-routine decision making)
 Statistical analysis (test hypotheses on relationships between
pieces of data)
 Discover new relationships (data mining)
4 of 20
D
B
Extraction, Transformation, and
Transportation
• Preparations performed on data
Transform
S
Y
S
T
E
M
S
Transport
Customers
Extract
Convert “Client” to
“Customer”
Apply standard product
numbers
Convert currencies
Fix region codes
Transaction data
from diverse
systems.
Data warehouse:
All data must be
consistent.
5 of 20
D
B
S
Y
S
T
E
M
S
Three-Dimensional View of Data: Cube
Customer
Location
Similar ideas used
in crosstab query and
pivot table.
6 of 20
D
B
Data Hierarchy
Year
S
Y
S
T
E
M
S
Levels
Quarter
Roll-up
To get higher-level totals
Month
Week
Drill-down
To get lower-level details
Day
7 of 20
D
B
Star Design
Dimension Table
Dimension Table
S
Y
S
T
E
M
S
Product
Category
Hierarchical: Dimension tables
can link only via fact table.
Fact Table
Sale
SaleDate
SalePrice
Quantity
Amount=SalePrice*Quantity
Dimension Table
Measures
Customer
Location
Amounts broken down by
product category, period, and
customer location.
8 of 20
D
B
Snowflake Design
Merchandise
S
Y
S
T
E
M
S
ItemID
Description
QuantityOnHand
ListPrice
Category
OLAPItems
SaleID
ItemID
Quantity
SalePrice
Amount
Network-like design:
Dimension tables can link
directly.
Sale
SaleID
SaleDate
EmployeeID
CustomerID
SalesTax
City
CityID
ZipCode
City
State
Customer
CustomerID
Phone
FirstName
LastName
Address
ZipCode
CityID
9 of 20
D
B
S
Y
S
T
E
M
S
Excel Pivot Table Reports
Quarter
Month
Quarter 1 Quarter 2 Quarter 3 Quarter 4 Grand Total
LastName EmployeeIDData
Carpenter
8 Sum of Animal
1,668.91
Sum of Merchandise
324.90
Eaton
6 Sum of Animal
522.37
Sum of Merchandise
30.60
Farris
7 Sum of Animal
5,043.36
Sum of Merchandise
826.92
Gibson
2 Sum of Animal
4,983.51
Sum of Merchandise
668.25
Hopkins
4 Sum of Animal
3,747.96
Sum of Merchandise
476.91
James
5 Sum of Animal
3,282.77
Sum of Merchandise
505.89
O'Connor
9 Sum of Animal
2,643.69
Sum of Merchandise
263.70
Reasoner
3 Sum of Animal
4,577.43
Sum of Merchandise
762.30
Reeves
1 Sum of Animal
1,120.93
Sum of Merchandise
263.88
Shields
10 Sum of Animal
1,008.76
Sum of Merchandise
62.10
Total Sum of Animal
28,599.69
Total Sum of Merchandise
4,185.45
606.97
78.30
426.39
99.00
341.85
54.90
1,059.70
188.10
1,549.83
238.50
1,194.88
252.90
2,373.08
693.45
180.91
83.70
625.74
89.10
372.65
121.50
437.88
99.00
510.12
55.80
589.68
116.80
7,591.11
1,624.05
162.15
22.50
2,840.72
569.50
7.20
128.70
562.50
107.10
796.47
306.00
2,556.10
450.90
128.41
7.20
150.11
99.00
2,500.24
396.90
6,701.03
1,495.80
2,709.47
630.90
1,426.72
192.60
6,899.53
1,321.02
9,089.44
1,357.65
5,443.90
858.51
6,243.84
1,397.34
3,334.72
403.20
8,293.09
1,365.10
1,120.93
263.88
1,170.91
84.60
45,732.55
7,874.80
Can place data in rows or columns.
By grouping months, can instantly get quarterly or monthly totals.
10 of 20
D
B
S
Y
S
T
E
M
S
CUBE Option (SQL 99)
SELECT Category, Month, Sum, GROUPING (Category) AS Gc,
GROUPING (Month) AS Gm
FROM …
GROUP BY CUBE (Category, Month...)
Category
Bird
Bird
…
Bird
Bird
Cat
Cat
…
Cat
(null)
(null)
(null)
…
(null)
Month
Amount
Gc
Gm
1
2
135.00
45.00
0
0
0
0
(null)
(null)
1
2
32.00
607.50
396.00
113.85
0
1
0
0
0
0
0
0
(null)
1
2
3
1293.30
1358.8
1508.94
2362.68
1
0
0
0
0
1
1
1
(null)
8451.79
1
1
11 of 20
D
B
S
Y
S
T
E
M
S
GROUPING SETS: Hiding Details
SELECT Category, Month, Sum
FROM …
GROUP BY GROUPING SETS
( ROLLUP (Category),
ROLLUP (Month),
() )
Category Month
Bird
(null)
Cat
(null)
…
(null)
1
(null)
2
(null)
3
…
(null) (null)
Amount
607.50
1293.30
1358.8
1508.94
2362.68
8451.79
12 of 20
D
B
S
Y
S
T
E
M
S
SQL RANK Functions
SELECT Employee, SalesValue
RANK() OVER (ORDER BY SalesValue DESC) AS rank
DENSE_RANK() OVER (ORDER BY SalesValue DESC) AS dense
FROM Sales
ORDER BY SalesValue DESC, Employee;
Employee
SalesValue
rank
dense
Jones
18,000
1
1
Smith
16,000
2
2
Black
16,000
2
2
White
14,000
4
3
DENSE_RANK
does not skip
numbers
• Therefore, advances in SQL motivate DBMS vendors to support
OLAP and data warehousing.
13 of 20
D
B
S
Y
S
T
E
M
S
Data Mining
 Goal: To discover unknown relationships in the data that can be
used to make better decisions.
 Exploratory analysis.
 A bottom-up approach that scans the data to find relationships
 Some statistical routines, but they are not sufficient
 Statistics relies on averages
 Sometimes the important data lies in more detailed pairs
 Supervised by developer vs. unsupervised (self-organizing
artificial neural networks)
14 of 20
D
B
Common Techniques
 1. Classification/Prediction
S
Y
S
T
E
M
S
 2. Association Rules/Market Basket Analysis
 3. Clustering
15 of 20
D
B
S
Y
S
T
E
M
S
1. Classification
(Prediction)
 Purpose: “Classify” things that are causes and those that are
effects.
 Examples
 Which borrowers/loans are most likely to be successful?
 Which customers are most likely to want a new item?
 Which companies are likely to file bankruptcy?
 Which workers are likely to quit in the next six months?
 Which startup companies are likely to succeed?
 Which tax returns are fraudulent?
16 of 20
D
B
S
Y
S
T
E
M
S
Classification Process




Clearly identify the outcome/dependent variable.
Identify potential variables that might affect the outcome.
Use sample data to test and validate the model.
Regression/correlation analysis, decision tables and trees,
etc.
Income Credit History Job Stability
Credit
Success
50000
Good
Good
Yes
75000
Mixed
Bad
No
17 of 20
D
B
2. Association/Market Basket
 Purpose: Determine what events or items go together/co-occur.
S
Y
S
T
E
M
S
 Examples:
 What items are customers likely to buy together?
(Business use: Consider putting the two together to
increase cross-selling.)
18 of 20
D
B
S
Y
S
T
E
M
S
Association Challenges
 If an item is rarely purchased, any other item bought with it
seems important. So combine items into categories.
 Some relationships are obvious.
 Burger and fries.
 Some relationships are puzzling/meaningless.
 Hardware store found that toilet rings sell well only when a new
store first opened. But what does it mean?
19 of 20
D
B
S
Y
S
T
E
M
S
3. Cluster Analysis
 Purpose: Determine groups of people or some entities.
 Examples
 Are there groups of customers? (If so, we could target
them; market segmentation)
 Do the locations for our stores have elements in common?
(If so, we can search for similar clusters for new locations.)
 Do employees have common characteristics?
(If so, we can hire similar, or dissimilar, people.)
Large
intercluster
distance
Small
intracluster
distance
20 of 20