Download DataWarehousing vs DataMining (another 4 algorithms)

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
tutorial based on the book:
Data Mining
Concepts and Tehniques
by Jiawei Han and Micheline Kamber
This material was developed with financial help of the WUSA fund of Austria.
made by Radmilo Pesic & Branko Golubovic
1/74
Introduction
made by Radmilo Pesic & Branko Golubovic
2/74
What motivated data mining?
Necessity is the mother of invention.
Data Collection and Database Creation
(1960s and earlier)
Database Management Systems
(1970s-early 1980s)
Advanced Databases Systems
(mid-1980s-present)
Web-based Databases Systems
(1990s-present)
Data Warehousing and Data Mining
(mid-1980s-present)
New Generation of Integrated Information Systems
(2000-…)
made by Radmilo Pesic & Branko Golubovic
3/74
What Is Data Mining?
Extracting or “mining” knowledge from large amounts of data.
1. Data cleaning
2. Data integration
3. Data selection
4. Data transformation
5. Data mining
6. Pattern evaluation
7. Knowledge presentation
Evaluation and
Presentation
Knowledge
Data Mining
Patterns
Selection and
Transformation
Cleaning and
Integration
Databases
Data
warehouse
Flat files
made by Radmilo Pesic & Branko Golubovic
4/74
Components of a typical data mining system:
• Database, data warehouse,
or other information repository
• Database
or data warehouse server
• Knowledge base
• Data mining engine
• Pattern evaluation module
• Graphical user interface
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge
base
Database or
Data warehouse server
Database
Data
warehouse
made by Radmilo Pesic & Branko Golubovic
5/74
Data mining – On What
Kind of Data?
•
•
•
•
Relational Databases
Data Warehouses
Transactional Databases
Advanced Database Systems
and Advanced Database Applications
(object-oriented, object-relational, spatial, temporal, time-series, text,
multimedia, heterogeneus, legacy databases and the world wide web)
made by Radmilo Pesic & Branko Golubovic
6/74
Relational Databases
customer
cust_ID
name
address
age
income
credit_info
…
C1
…
…
Smith, Sandy
…
…
5463 E Hastings, Burnaby,
BC V5A 4S9, Canada
…
21
…
…
$27000
…
…
1
…
…
…
…
…
item
item_ID
name
brand
category
type
price
place_made
supplier
cost
I3
I8
…
high_res_TV
multidiscCDplay
Toshiba
Sanyo
…
high resolution
multidisc
…
TV
CD player
…
$988.00
$369.00
…
Japan
Japan
…
NikoX
Music Front
…
$600.00
$120.00
…
employee
empl_ID
name
category
group
salary
commission
E55
…
Jones, Jane
…
home entertainment
…
manager
…
$18,000
…
2%
…
branch
branch_ID
name
address
B1
…
City Square
…
369 Cambie St., Vancouver, BC V5L 3A2, Canada
…
purchases
trans_ID
cust_ID
empl_ID
date
time
method_paid
amount
T100
…
C1
…
E55
…
09/21/98
…
15:45
…
Visa
…
$1357.00
…
item_sold
works_at
trans_ID
item_ID
qty
empl_ID
branch_ID
T100
T100
I3
I8
1
2
E55
…
B1
…
made by Radmilo Pesic & Branko Golubovic
7/74
Data Warehouses
Data source in Chicago
Data source in New York
Client
Clean
Transform
Integrate
Load
Data
warehouse
Query and
analysis tools
Data source in Toronto
Client
Data source in Vancouver
Typical architecture of a data warehouse for AllElectronics
made by Radmilo Pesic & Branko Golubovic
8/74
time (quarters)
Chicago
440
New York
1560
Toronto
395
Vancouver
Q1
605
825
14
<Vancouver,Q1,security>
400
Q2
Q3
Q4
computer
home
entertainment
security
phone
item (types)
Drill-down on time data for Q1
time (months)
Jan
150
Feb
100
March
150
computer
time (quarters)
USA
2000
Canada
1000
Chicago
New York
Toronto
Vancouver
home
entertainment
Roll-up on address
Q1
Q2
Q3
Q4
security
phone
computer
home
entertainment
item (types)
security
phone
item (types)
made by Radmilo Pesic & Branko Golubovic
9/74
Text Databases and Multimedia Databases
•
•
•
Text databases can be:
highly unstructured, semistructured or well structured
Multimedia databases store image, audio, and video data
Such data require a lot of storage space; it’s continuous-media data
Heterogeneus Databases and Legacy Databases
The World Wide Web
•
mining path traversal patterns
made by Radmilo Pesic & Branko Golubovic
10/74
Data Mining Functionalities
What Kinds of Patterns Can Be Mined?
• Concept/Class Description:
Characterization and Discrimination
• Association Analysis
• Classification and Prediction
• Cluster Analysis
• Outlier Analysis
• Evolution Analysis
made by Radmilo Pesic & Branko Golubovic
11/74
Are All of the Patterns Interesting?
A pattern is interesiting if it is:
• easily understood
• valid
• (potentially) useful
• novel
or if it
• confirms user’s hypothesis
Interesting pattern represents knowledge!
made by Radmilo Pesic & Branko Golubovic
12/74
Objective measures of pattern interestingness:
• support
• confidence
Subjective measures of pattern interestingness:
• data is unexpected
• data is actionable
• data is expected
Can a data mining system generate all of the interesting patterns?
Can a data mining system generate only interesting patterns?
made by Radmilo Pesic & Branko Golubovic
13/74
Classification of
Data Mining Systems
Database
technology
Statistics
Data
Mining
Information
science
Visualization
Machine
learning
Other disciplines
• according to kinds of databases mined
(relational, data warehouse, object-oriented…)
• according to kinds of knowledge mined
(association, classification, clustering…; generalized, primitive-level or
knowledge at multiple levels; regularities or irregularities)
• according to the kinds of techniques utilized
(autonomous, interactive exploratory or query-driven systems; data
warehouse oriented, statistics…)
• according to the applications adapted
(for finance, DNA, etc.)
made by Radmilo Pesic & Branko Golubovic
14/74
Major Issues in Data Mining
Mining methodology and user interaction issues:
• Mining different kinds of knowledge in databases
• Interactive mining of knowledge at multiple levels of abstraction
• Incorporation of background knowledge
• Data mining query languages and ad hoc data mining
• Presentation and visualization of data mining results
• Handling noisy or incomplete data
• Pattern evaluation – the interestingness problem
Performance issues:
• Efficiency and scalability of data mining algorithms
• Parallel, distributed, and incremental mining algorithms
Issues relating to the diversity of database types:
• Handling of relational and complex types of data
• Mining information from heterogeneous databases and global information systems
made by Radmilo Pesic & Branko Golubovic
15/74
Data Warehouse and OLAP
Technology for Data Mining
made by Radmilo Pesic & Branko Golubovic
16/74
What Is a Data Warehouse?
“A datawarehouse is a subject-oriented, integrated, timevariant, and nonvolatile collection of data in support of
management’s decision making process.”
W.H. Inmon
•
•
•
•
Subject-oriented
Integrated
Time-variant
Nonvolatile
made by Radmilo Pesic & Branko Golubovic
17/74
How are organizations using the information from
data warehouse?
• Increasing customer focus
• Repositioning products and managing product portfolios
• Analyzing operations and looking for sources of profit
• Managing the customer relationships,
making environmental corrections, and
managing the cost of corporate assets
Different approach to heterogeneous database integration:
• Query-driven approach (wrappers and integrators)
• Update-driven approach
made by Radmilo Pesic & Branko Golubovic
18/74
Differences Between Operational
Database Systems and Data Warehouse
•
•
•
•
•
Users and system orientation
Data contents
Database design
View
Access patterns
Why have a separate data warehouse?
made by Radmilo Pesic & Branko Golubovic
19/74
A Multidimensional Data Model
From Tables and Spreadsheets
to Data Cubes
• A data cube is defined by dimensions and facts
• Dimension table
• Fact table
made by Radmilo Pesic & Branko Golubovic
20/74
location = “Chicago”
location = “New York”
location = “Toronto”
location = “Vancouver”
item
item
item
item
home
time ent. comp. phone
sec.
home
comp. phone
ent.
sec.
Q1
Q2
Q3
Q4
623
698
789
870
1087 968
1130 1024
1034 1048
1142 1091
872
925
1002
984
854
943
1032
1129
882
890
924
992
89
64
59
63
38
41
45
54
home
comp. phone
ent.
818
894
940
978
746
769
795
864
43
52
58
59
sec.
591
682
728
784
home
comp. phone
ent.
605
680
812
927
825
952
1023
1038
14
31
30
38
sec.
400
512
501
580
time (quarters)
Chicago
440
882
89
623
New York
1560 968
38
872
Toronto
395
746
43
591
Vancouver
Q1
605
825
14
400
Q2
680
952
31
512
Q3
812
1023
30
501
Q4
927
1038
38
580
computer
home
entertainment
security
phone
item (types)
A 2-D view of sales data for AllElectronics, and it’s 3-D data cube representation
made by Radmilo Pesic & Branko Golubovic
21/74
supplier=“SUP1”
supplier=“SUP2”
supplier=“SUP1”
time (quarters)
Chicago
New York
Toronto
Vancouver
Q1
605
825
14
400
Q2
Q3
Q4
computer
home
entertainment
security
phone
item (types)
computer
home
entertainment
security
phone
computer
home
entertainment
item (types)
security
phone
item (types)
A 4-D data cube representation of sales data for AllElectronics
made by Radmilo Pesic & Branko Golubovic
22/74
all
item
time
0-D (apex) cuboid
location
time, supplier
time, item
time, location
supplier
1-D cuboid
item, supplier
item, location
location, supplier
2-D cuboid
time, location, supplier
time, item, location
time, item, supplier
time, item, location, supplier
item, location, supplier
3-D cuboid
4-D (base) cuboid
Lattice of cuboids, making up a 4-D data cube
made by Radmilo Pesic & Branko Golubovic
23/74
Stars, Snowflakes, and Fact Constellations:
Schemas for Multidimensional Databases
Star schema:
• a large central table (fact table)
• a set of smaller attendant tables (dimension tables),
one for each dimension
time
dimension table
time_key
day
day_of_week
month
quarter
year
sales
fact table
branch
dimension table
branch_key
branch_name
branch_type
time_key
item_key
branch_key
location_key
dollars_sold
units_sold
item
dimension table
item_key
item_name
brand
type
supplier_type
location
dimension table
location_key
street
city
province_or_state
country
made by Radmilo Pesic & Branko Golubovic
24/74
Snowflake schema:
• a variant of star schema, where some dimension tables are normalized
• reduce redundancies, but reduce the effectivness of browsing
time
dimension table
time_key
day
day_of_week
month
quarter
year
branch
dimension table
branch_key
branch_name
branch_type
sales
fact table
time_key
item_key
branch_key
location_key
dollars_sold
units_sold
item
dimension table
item_key
item_name
brand
type
supplier_key
location
dimension table
location_key
street
city_key
made by Radmilo Pesic & Branko Golubovic
supplier
dimension table
supplier_key
supplier_type
city
dimension table
city_key
city
province_or_state
country
25/74
Fact constelation:
• multiple fact tables share dimension tables
time
dimension table
time_key
day
day_of__week
month
quarter
year
branch
dimension table
branch_key
branch_name
branch_type
sales
fact table
time_key
item_key
branch_key
location_key
dollars_sold
units_sold
item
dimension table
item_key
item_name
brand
type
supplier_type
shipping
fact table
item_key
time_key
shipper_key
from_location
to_location
dollars_sold
units_shipped
shipper
dimension table
shipper_key
shipper_name
location_key
shipper_type
location
dimension table
location_key
street
city
province_or_state
country
made by Radmilo Pesic & Branko Golubovic
26/74
Defining multidimensional schema
• DMQL – data mining query language
• Syntax:
cube definition:
define cube <cube_name> [<dimension_list>]: <measure_list>
dimension definition:
define dimension <dimension_name> as (<atribute_or_subdimension_list>)
made by Radmilo Pesic & Branko Golubovic
27/74
Example:
• Constellation schema defined in DMQL:
define cube sales [time, item, branch, location]:
dollars_sold=sum(sales_in_dollars), units_sold=count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state, country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollars_cost=sum(cost_in_dollars), unit_shipped=count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name,
location as location in cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
made by Radmilo Pesic & Branko Golubovic
28/74
Measures:
Their Categorization and Computation
Measures, based on the aggregate function:
• Distributive
• Algebraic
• Holistic
made by Radmilo Pesic & Branko Golubovic
29/74
Introducing Concept Hierarchies
• A concept hierarchy defines a sequence of mappings
from a set of low-level to higher-level concepts.
location
all
all
country
province_or_state
city
Canada
British Columbia
Vancouver
Victoria
USA
Ontario
Toronto
New York
Ottawa
New York
made by Radmilo Pesic & Branko Golubovic
Illinois
Buffalo
Chicago
30/74
•
Hierarchial and lattice structures of atributes in warehouse dimensions:
country
year
province_or_state
quarter
city
month
week
day
street
Hierarchy for location
Lattice for time
made by Radmilo Pesic & Branko Golubovic
31/74
OLAP Operations in the
Multidimensional Data Model
•
•
•
•
•
Roll-up
Drill-down
Slice and dice
Pivot (rotate)
Other (drill-across, drill-through)
made by Radmilo Pesic & Branko Golubovic
32/74
time (quarters)
Chicago
440
New York
1560
Toronto
395
Vancouver
825
14
400
Q2
Q3
computer
home
entertainment
security
phone
time (months)
item (types)
USA
Canada
time (quarters)
605
Q4
roll-up
on location
(from cities
to countries)
Q1
Q1
drill-down
on time
(from quarters
to months)
2000
1000
Q2
Chicago
New York
Toronto
Vancouver
January
150
February
100
March
150
April
May
June
July
August
September
Q3
October
November
Q4
December
computer
home
entertainment
security
phone
computer
home
entertainment
item (types)
security
phone
item (types)
made by Radmilo Pesic & Branko Golubovic
33/74
Chicago
440
New York
1560
Toronto
395
Vancouver
605
825
14
time
(quarters)
time (quarters)
Q1
USA
Canada
400
Q2
Q1
605
Q2
computer
Q3
Q4
computer
home
entertainment
security
home
dice for
entertainment
(location=“Toronto” or “Vancouver”)
item (types)
and (time=“Q1”or “Q2”) and
(item=“home entertainment” or “computer”)
phone
item (types)
slice
for time=“Q1”
Chicago
New York
Toronto
Vancouver
pivot
605
825
14
computer
home
entertainment
400
item (types)
location (cities)
395
home
entertainment
605
computer
825
phone
14
security
400
security
phone
New York
Chicago
item (types)
Vancouver
Toronto
location (cities)
made by Radmilo Pesic & Branko Golubovic
34/74
A Starnet Query Model for Querying
Multidimensional Databases
location
customer
continent
group
country
province_or_state
category
city
street
day
name
name brand category type
item
month
quarter
year
time
made by Radmilo Pesic & Branko Golubovic
35/74
Data Warehouse Architecture
Steps for the Design and Construction of
Data Warehouse
The Design of a Data Warehouse: A Business Analysis Framework
• top-down view
• data source view
• data warehouse view
• business query view
made by Radmilo Pesic & Branko Golubovic
36/74
The Process of Data Warehouse Design
•
top-down approach
•
bottom-up approach
•
combined approach
•
•
waterfall method
spiral method
Steps of the warehouse design:
1)
Choosing a business proces to model;
2)
Choosing the grain of the business proces;
3)
Choosing the dimensions;
4)
Choosing the measures.
made by Radmilo Pesic & Branko Golubovic
37/74
A Three-Tier
Data Warehouse Architecture
Query/report
Analysis
Data mining
Top tier:
front-end tools
OLAP server
Output
OLAP server
Middle tier:
OLAP server
Monitoring
Administration
Data warehouse
Data marts
Bottom tier:
data warehouse
server
Metadata repository
Extract
Clean
Transform
Load
Refresh
Operational databases
Data
External sources
made by Radmilo Pesic & Branko Golubovic
38/74
There are three data warehouse models:
• Enterprise warehouse
• Data mart
• Virtual warehouse
made by Radmilo Pesic & Branko Golubovic
39/74
Types of OLAP Servers:
ROLAP versus MOLAP versus HOLAP
Relational OLAP (ROLAP) servers:
• use of relational or extended-relational DBMS
• greater scalability
Multidimensional OLAP (MOLAP) servers:
• use of data cube – fast indexing
• possible low storage utilization – use of compression
Hybrid OLAP (HOLAP) servers:
• scalability of ROLAP and faster computation of MOLAP
• Microsoft SQL Server 7.0 OLAP Services
supports HOLAP server
made by Radmilo Pesic & Branko Golubovic
40/74
Data Warehouse Implementation
• SQL
group by
Data cube computation extends SQL with
compute cube
• Example:
 “Compute the sum of sales, grouping by item and city.”
 “Compute the sum of sales, grouping by item.”
 “Compute the sum of sales, grouping by city.”
• The possible group by’s are the following:
{(city, item, year), (city, item), (city, year), (item, year), (city), (item), (year), ()}
made by Radmilo Pesic & Branko Golubovic
41/74
()
(city)
(item)
(city,item)
(city,year)
0-D (apex) cuboid
(year)
1-D cuboids
(item,year)
2-D cuboids
(city,item,year)
3-D (base) cuboids
Lattice of cuboids
define cube sales [item, city, year]: sum(sales_in_dollars)
compute cube sales
made by Radmilo Pesic & Branko Golubovic
42/74
• Number of cuboids in an n-dimensional data cube is 2n
• Number of cuboids in an n-dimensional data cube
where we have a concept hihierarchy
(day<week<month<quarter<year) is:
n
T   ( Li  1)
i 1
• Example:
if the cube has 10 dimensions and each dimension has
 4 levels,
the total number of cuboids that can be generated will be 510 9.8 x 106
made by Radmilo Pesic & Branko Golubovic
43/74
Partial Materialization:
Selected Computation of Cuboids
There are three choices for data cube materialization given a base cuboid:
(1) do not precompute any of the “nonbase” cuboids (no materialization)
(2) precompute all of the cuboids (full materialization)
(3) selectively compute a proper subset
of the whole set of possible cuboids (partial materialization);
the partial materialization of cuboids shoul consider three factors:
•identify the subset of cuboids to materialize,
•exploit the materialized cuboids during query processing, and
•efficiently update the materialized cuboids during load and refresh.
made by Radmilo Pesic & Branko Golubovic
44/74
Multiway Array Aggregation
in the Computation of Data Cubes
ROLAP:
• Sorting, hashing, and grouping operations are applied
to the dimension attributes in order to reorder and cluster related tuples.
• Grouping is performed on some subaggregates as a “partial grouping step”.
These “partial groupings” may be used
to speed up the computation of other subaggregates.
• Aggregates may be computed from previously computed aggregates,
rather than from the base fact tables.
MOLAP:
• Partitition the array into chunks.
• Compute aggregates by visiting cube cells.
made by Radmilo Pesic & Branko Golubovic
45/74
c3
c2
C
c1
61
45
29
62
46
30
63
47
31
64
48
32
60
c0
b3
44
13
14
15
16
56
28
40
b2
9
52
24
B
36
b1
b0
5
20
1
2
3
4
a0
a1
a2
a3
A
A 3-D array for the dimensions A, B, and C, organized into 64 chunks
made by Radmilo Pesic & Branko Golubovic
46/74
Indexing OLAP Data
• Bitmap indexing
• Join indexing
made by Radmilo Pesic & Branko Golubovic
47/74
Bitmap Indexing
Base table
Item bitmap index table
City bitmap index table
RID
item
city
RID
H
C
P
S
RID
V
T
R1
R2
R3
R4
R5
R6
R7
R8
H
C
P
S
H
C
P
S
V
V
V
V
T
T
T
T
R1
R2
R3
R4
R5
R6
R7
R8
1
0
0
0
1
0
0
0
0
1
0
0
0
1
0
0
0
0
1
0
0
0
1
0
0
0
0
1
0
0
0
1
R1
R2
R3
R4
R5
R6
R7
R8
1
1
1
1
0
0
0
0
0
0
0
0
1
1
1
1
Indexing OLAP data using bitmap indices
made by Radmilo Pesic & Branko Golubovic
48/74
Join Indexing
Join index table for
location/sales
sales
T57
location
Main Street
item
T238
Sony-TV
Join index table for
item/sales
location
sales_key
item
sales_key
…
Main Street
Main Street
Main Street
…
…
T57
T238
T884
…
…
Sony-TV
Sony-TV
…
…
T57
T459
…
Join index table linking two dimensions
location/item/sales
T459
T884
Linkages between a sales fact table and
dimension tables for location and item
location
item
sales_key
…
Main Street
…
…
Sony-TV
…
…
T57
…
Join index tables based on the linkages
between the sales fact table and dimension
tables for location and item
made by Radmilo Pesic & Branko Golubovic
49/74
Efficient Processing of OLAP Queries
1. Determine which operations should be performed
on the available cuboids
2. Determine to which materialized cuboid(s)
the relevant operations should be applied
made by Radmilo Pesic & Branko Golubovic
50/74
Metadata Repository
• A description of the structure
of the data warehouse
• Operational metadata
• The algorythms used for summarization
• The mapping from the operational environment
to the data warehouse
• Data related to system performance
• Business metadata
made by Radmilo Pesic & Branko Golubovic
51/74
Data Warehouse
Back-End Tools and Utilities
•
•
•
•
•
Data extraction
Data cleaning
Data transformation
Load
Refresh
made by Radmilo Pesic & Branko Golubovic
52/74
Further Development of
Data Cube Technology
Discovery-Driven Exploration of Data Cubes
• SelfExp
• InExp
• PathExp
made by Radmilo Pesic & Branko Golubovic
53/74
Sum of sales
Month
Jan
Total
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
1%
-1%
0%
1%
3%
-1%
-9%
-1%
2%
-4%
3%
Change in sales over time
Avg. sales
Item
Month
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Sony b/w printer
9%
-8%
2%
-5%
14%
-4%
0%
41%
-13%
-15%
-11%
Sony color printer
0%
0%
3%
2%
4%
-10%
-13%
0%
4%
-6%
4%
HP b/w printer
-2%
1%
2%
3%
8%
0%
-12%
-9%
3%
-3%
6%
HP color printer
0%
0%
-2%
1%
0%
-1%
-7%
-2%
1%
-4%
1%
IBM desktop computer
1%
-2%
-1%
-1%
3%
3%
-10%
4%
1%
-4%
-1%
IBM laptop computer
0%
0%
-1%
3%
4%
2%
-10%
-2%
0%
-9%
3%
Toshiba desktop comp.
-2%
-5%
1%
1%
-1%
1%
5%
-3%
-5%
-1%
-1%
Toshiba laptop comp.
1%
0%
3%
0%
-2%
-2%
-5%
3%
2%
-1%
0%
Logitech mouse
3%
-2%
-1%
0%
4%
6%
-11%
2%
1%
-4%
0%
Ergo-way mouse
0%
0%
2%
3%
1%
-2%
-2%
-5%
0%
-5%
8%
Change in sales for each item-time combination
made by Radmilo Pesic & Branko Golubovic
54/74
Avg. sales
Region
Month
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
North
-1%
-3%
-1%
0%
3%
4%
-7%
1%
0%
-3%
-3%
South
-1%
1%
-9%
6%
-1%
-39%
9%
-34%
4%
1%
7%
East
-1%
-2%
2%
-3%
1%
18%
-2%
11%
-3%
-2%
-1%
West
4%
0%
-1%
-3%
5%
1%
-18%
8%
5%
-8%
1%
Change in sales for the item IBM desktop computer per region
made by Radmilo Pesic & Branko Golubovic
55/74
Complex Aggregation at Multiple Granularities:
Multifeature Cubes
• Example 1:
Query 1: A simple data cube query. Find the total sales in 2000,
broken down by item, region, and month, with subtotals for each dimension.
• Example 2:
Query 2: A complex query. Grouping by all subsets of {item, region, month},
find the maximum price in 2000 for each group,
and the total sales among all maximum price tuples.
select
from
where
cube by
such that
item, region, month, MAX(price), SUM(R.sales)
Purchases
year=2000
item, region, month: R
R.price=MAX(price)
made by Radmilo Pesic & Branko Golubovic
56/74
• Example 3:
Query 3: An even more complex query. Grouping by all subsets of
{item,region,month}, find the maximum price in 2000 for each group. Among the
maximum price tuples, find the minimum and maximum item shelf life. Also find
the fraction of the total sales due to tuples that have minimum shelf life within the
set of all maximum price tuples, and the fraction of the total sales due to tuples
that have maximum shelf life within the set of all maximum price tuples.
select
from
where
cube by
such that
item, region, month, MAX(price), MIN(R1.shelf),
MAX(R1.shelf), SUM(R1.sales), SUM(R2.sales),
SUM(R3.sales)
Purchases
year=2000
item, region, month: R1, R2, R3
R1.price=MAX(price) and
R2 in R1 and R2.shelf=MIN(R1..shelf) and
R3 in R1 and R3.shelf=MAX(R1.shelf)
made by Radmilo Pesic & Branko Golubovic
57/74
From Data Warehousing
to Data Mining
Data Warehouse Usage
• Information processing
• Analytical processing
• Data mining
made by Radmilo Pesic & Branko Golubovic
58/74
From On-Line Analytical Processing to
On-Line Analytical Mining
• High quality of data in data warehouses
• Available information processing infrastructure
surrounding data warehouses
• OLAP-based exploratory data analysis
• On-line selection of data mining functions
made by Radmilo Pesic & Branko Golubovic
59/74
Architecture for On-Line Analytical Mining
Constraint-based
mining query
Mining result
Layer 4
user interface
Graphical user interface API
OLAM
engine
OLAP
engine
Layer 3
OLAP/OLAM
Cube API
Meta data
MDDB
Layer 2
multidimensional
database
Database API
Data filtering, data integration
Databases
Databases
Filtering
Data cleaning
Data integration
Layer 1
data repository
Data
warehouse
An integrated OLAM and OLAP architecture
made by Radmilo Pesic & Branko Golubovic
60/74
Data Preprocessing
made by Radmilo Pesic & Branko Golubovic
61/74
Data cleaning
Data integration
Data transformation
-2, 32, 100, 59, 48
Data reduction
T3
T4
A2
A3
…
A1
A126
transactions
transactions
T2
attributes
attributes
A1
T1
-0.02, 0.32, 1.00, 0.59, 0.48
A3
…
A115
T1
T4
…
T1456
…
T2000
Format of data preprocesing
made by Radmilo Pesic & Branko Golubovic
62/74
Data Cleaning
Missing values
1.
2.
3.
4.
5.
6.
Ignore the tuple
Fill in the missing value manualy
Use a global constant to fill in the missing value
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the same class
as the given tuple
Use the most probable value to fill in the missing value
made by Radmilo Pesic & Branko Golubovic
63/74
Inconsistent data
Noisy data
• Bining
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition info (equidepth) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
• Clustering
• Combined computer and human inspection
• Regression
made by Radmilo Pesic & Branko Golubovic
64/74
Data Integration and Transformation
Data Integration
Data Transformation
•
•
•
•
•
Smoothing
Aggregation
Generalization
Normalization
Attribute construction
made by Radmilo Pesic & Branko Golubovic
65/74
Data Reduction
•
•
•
•
•
Data cube aggregation
Dimension reduction
Data compression
Numerosity reduction
Discretization and concept hierarchy generation
made by Radmilo Pesic & Branko Golubovic
66/74
Dimensionality reduction
1. Stepwise forward selection
2. Stepwise backward elimination
3. Combination of forward selection and
backward elimination
• Decision tree induction
selection
Backward elimination
• Example: Forward
Initial attribute set:
Initial attribute set:
{A1,A2,A3,A4,A5,A6}
Initial reduced set:
{}
{A1}
{A1,A4}
Reduced attribute set:
{A1,A4,A6}
{A1,A2,A3,A4,A5,A6}
Decision tree inductiom
Initial attribute set:
{A1,A2,A3,A4,A5,A6}
A4?
Y
{A1,A3,A4,A5,A6}
A1?
{A1,A4,A5,A6}
N
Reduced attribute set: Y
{A1,A4,A6}
Class1
Class2
Greedy (heuristic)methods for attribute subset selection.
N
A6?
Y
N
Class1
Class2
Reduced attribute set:
{A1,A4,A6}
made by Radmilo Pesic & Branko Golubovic
67/74
Data Compression
• Wavelet transforms
• Principal components analysis
made by Radmilo Pesic & Branko Golubovic
68/74
Numerosity Reduction
• Regression and log-linear models
• Histograms
• Clustering
• Sampling
made by Radmilo Pesic & Branko Golubovic
69/74
10
9
8
7
6
5
4
3
2
1
25
20
count
count
Histogram Examples
15
10
5
5
10
15
20
price ($)
25
30
A histogram for price using singleton buckets – each
bucket represent one price-value/frequency pair.
1-10
11-20
21-30
price ($)
An equiwidth histogram for
price, where values are
aggregated so that each
bucket has a uniform width
of $10.
made by Radmilo Pesic & Branko Golubovic
70/74
Discretization
And Concept Hierarchy Generation
($0…$1000]
($0…$200]
($0…$100]
($200…$400]
($200…$300]
($100…$200]
($400…$600]
($400…$500]
($300…$400]
($600…$800]
($600…$700]
($500…$600]
($800…$1000]
($800…$900]
($700…$800]
($900…$1000]
A concept hierarchy for the attribute price.
made by Radmilo Pesic & Branko Golubovic
71/74
Discretization And
Concept Hierarchy Generation
for Numeric Data
•
•
•
•
•
Binning
Histogram analysis
Cluster analysis
Entropy-based Discretization
Segmentation by natural partitioning
made by Radmilo Pesic & Branko Golubovic
72/74
Concept Hierarchy Generation
for Categorical Data
•
•
•
•
Specification of a partial ordering
of attributes explicitly at the schema level
by users or experts
Specification of a portion of a hierarchy
by explicit data grouping
Specification of a set of attributes,
but not of their partial ordering
Specification of only a partial set of attributes
country
15 distinct values
province_or_state
365 distinct values
city
street
3,567 distinct values
674,339 distinct values
Automatic generation of a schema concept hierarchy
based on the number of distinct attribute values.
made by Radmilo Pesic & Branko Golubovic
73/74
Credits:
Radmilo Pešić
Branko Golubović
Veljko Milutinović
[email protected]
[email protected]
[email protected]
made by Radmilo Pesic & Branko Golubovic
74/74