Download What is a Data Warehouse?

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining and Data Warehousing:
Concepts and Techniques
Course
outlines
Motivation
Evolution of Database Technology - overview
Why Data Mining? — Potential Applications
What Is Data Mining? Data Mining: A KDD Process
Data Mining: On What Kind of Data?
What is a Data Warehouse?
Data Warehouse vs. other systems, OLTP vs. OLAP
Conceptual Modeling of Data Warehouses
Defining a Snowflake Schema in Data Mining Query Language DMQL
Multi-Tiered Architecture - Approaches to Building OLAP Server
Indexing OLAP Data: Bitmap Index
Data Warehouse Back-End Tools and Utilities
From OLAP to On Line Analytical Mining OLAM, An OLAM Architecture
Data Mining Functionalities
Are All the “Discovered” Patterns Interesting?
Market-Basket Data; typical case,
Frequent Pairs in SQL, A-Priori Algorithm
Motivation
Data explosion problem:
Automated data collection tools and database technology lead to tremendous
amounts of data stored in databases, data warehouses and other information
repositories.
Data are collected from everywhere and in huge amounts
 We are Data Rich but Information Poor
 How to make good use of your data?
Data warehousing and data mining
 On-line analytical processing
 Extraction of interesting knowledge (rules, patterns,…) from data in large
databases.
 Bring together scattered information from multiple sources as to provide a
consistent database source for decision support queries.
 Provide architectures and tools for business executives to systematically organize,
understand, and use their data to make strategic decisions.
2
Evolution of Database Technology - overview
1960s: Data collection, database creation, IMS and network
DBMS.
1970s: Relational data model, relational DBMS
implementation.
1980s: RDBMS, advanced data models (extended-relational,
OO, deductive, etc.) and application-oriented DBMS
(spatial, scientific, engineering, etc.).
1990s: Data mining and data warehousing, multimedia
databases, and Web technology.
3
Why Data Mining? — Potential Applications
 Database analysis and decision support
 Market analysis and management
 target marketing,
 customer relation management, Customer profiling, Identifying customer requirements,
market basket analysis, cross selling, market segmentation.
 Risk analysis and management
 Finance planning and asset evaluation:
 Forecasting, customer retention, improved underwriting, quality control, competitive
analysis.
 Fraud detection and management
 Applications- widely used in health care, retail, credit card services, telecommunications
(phone card fraud), etc.
 Approach- use historical data to build models of fraudulent behavior and use data mining
to help identify similar instances.
More examples:
 Detecting inappropriate medical treatment
 Detecting telephone fraud
 money laundering - detect suspicious money transactions (US Treasury's Financial
Crimes Enforcement Network)
 Other Applications:
 Text mining and Web analysis.
 Etc.
4
What Is Data Mining?
 Data mining (part of knowledge discovery in databases):
 Extraction of interesting ( non-trivial, implicit, previously unknown and
potentially useful) information from data in large databases
 Alternative names and their “inside stories”:
 Data mining: a misnomer?
 Knowledge Discovery in Databases (KDD: SIGKDD), knowledge extraction,
data archeology, data dredging, information harvesting, business
intelligence, etc.
 What is not data mining?
 (Deductive) query processing.
 Expert systems or small statistical programs
5
Data Mining: A KDD Process
Data mining: the core of
knowledge discovery process.
Pattern
Evaluation
Knowledge
Data Mining
Task-relevant
Data
Data Warehouse
Data
Cleaning
Selection
Data Integration
Databases
6
Knowledge
Pattern
Evaluation
Data
Mining
Taskrelevant
Data
Steps of a KDD Process
Data
Warehouse
Data
Cleanin
g
Selection
Data Integration
Learning the application domain –
relevant prior knowledge and goals of application
Data where-housing
Databases
 Creating a target data set: data selection
 Data cleaning and preprocessing: (may take 60% of effort!)
 Data reduction and projection – Find useful features, dimensionality/variable
reduction, invariant representation.
Data mining
 Choosing functions of data mining - summarization, classification, regression,
association, clustering.
 Choosing the mining algorithm(s)
 Data mining: search for patterns of interest
 Interpretation: analysis of results - visualization, transformation, removing
redundant patterns, etc.
 Use of discovered knowledge
7
Data Mining and Business Intelligence
Increasing potential
to support
business decisions
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
End User
Business
Analyst
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
DBA
Paper, Files, Information Providers, Database Systems, OLTP
8
Data Mining: On What Kind of Data?
Data mining is performed on data coming from:
Relational databases
Transactional databases
Advanced DB systems and information repositories
 Object-oriented and object-relational databases
 Spatial databases
 Time-series data and temporal data
 Text databases and multimedia databases
 Heterogeneous and legacy databases
 WWW
… and accumulated in a data warehouse for long periods of time
(several months or sometimes years)
9
What is a Data Warehouse?
Defined in many different ways, but not rigorously.
 A decision support database that is maintained separately from the
organization’s operational database
 Support information processing by providing a solid platform of consolidated,
historical data for analysis.
Knowledge
Pattern
Evaluation
Data
Mining
“A data warehouse is a
Taskrelevant
Data
subject-oriented,
integrated,
time-variant,
W. H. Inmon
and nonvolatile
collection of data in support of management’s decision-making process.”
Data
Warehouse
Data
Cleanin
g
Selection
Data Integration
Databases
Data warehousing: The process of constructing and using
data warehouses
10
Knowledge
What is a data warehouse?
Pattern
Evaluation
Data
Mining
Data Warehouse (1/2)
Taskrelevant
Data
Data
Warehouse
Data
Cleanin
g
Subject Oriented
Selection
Data Integration
Databases
 Organized around major subjects, such as customer, product, sales.
 Focusing on the modeling and analysis of data for decision makers, not on daily
operations or transaction processing.
 Provide a simple and concise view around particular subject issues by
excluding data that are not useful in the decision support process.
Integrated
 Integrate multiple, heterogeneous data sources - relational databases, flat files,
on-line transaction records
 Data cleaning and data integration techniques are applied.
 Ensure consistency in naming conventions, encoding structures, attribute
measures, etc. among different data sources (E.g., Hotel price: currency, tax,
breakfast covered, etc.)
 When data is moved to the warehouse, it is converted.
11
Knowledge
What is a data warehouse?
Pattern
Evaluation
Data
Mining
Taskrelevant
Data
Data Warehouse (2/2)
Time Variant
Data
Warehouse
Data
Cleanin
g
Selection
Data Integration
Databases
 The time horizon for the data warehouse is significantly longer than that of
operational systems.
 Operational database: current value data.
 Data Warehouse data: provide information from a historical perspective (e.g., past 5-10
years)
 Every key structure in the data warehouse
 Contains an element of time, explicitly or implicitly
 But the key of operational data may or may not contain “time element”.
Non-Volatile
 A physically separate store of data transformed from the operational environment.
 Operational update of data does not occur in the data warehouse environment.
 Does not require transaction processing, recovery, and concurrency control
mechanisms
 Requires only two operations in data accessing: initial loading of data and access
of data.
12
What is a data warehouse?
Data Warehouse vs. Heterogeneous DBMS
Traditional heterogeneous DB integration:
Build wrappers/mediators on top of heterogeneous databases
Query driven approach:
When a query is posed to a client site, a meta-dictionary is
used to translate the query into queries appropriate for
individual heterogeneous sites involved, and the results
are integrated into a global answer set.
Complex information filtering, compete for sources
Data warehouse: update-driven, high performance.
13
What is a data warehouse?
Data Warehouse vs. Operational DB Systems
OLTP (on-line transaction processing)
Major task of traditional relational DBMS - Most database operations
are of a type called OLTP.
Day-to-day operations: purchasing, inventory, banking, manufacturing,
payroll, registration, accounting, etc.
OLAP (on-line analytical processing)
Major task of data warehouse system
Data analysis and decision making
Distinct features (OLTP vs. OLAP):





User and system orientation: customer vs. market
Data contents:
current, detailed vs. historical, consolidated
Database design:
ER + application vs. star + subject
View:
current, local vs. evolutionary, integrated
Access patterns:
update vs. read-only but complex queries
14
What is a data warehouse?
The most complex OLAP queries are
often referred to as data mining
OLTP vs. OLAP Common architecture
 Local databases, say one per branch store, handle OLTP,
 while a warehouse integrating information from all branches
handles OLAP.
OLTP
OLAP
users
Clerk, IT professional
Knowledge worker
function
day to day operations
decision support
application-oriented
current, up-to-date
detailed, flat relational isolated
repetitive
subject-oriented
historical, summarized, multidimensional
integrated, consolidated
ad-hoc
DB design
data
usage
access read/write index/hash on prim. key
unit of work
lots of scans
short, simple transaction
complex query
tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
query throughput, response
# records accessed
15
Knowledge
What is a data warehouse?
Pattern
Evaluation
Data
Mining
Taskrelevant
Data
Why Separate Data Warehouse?
 High performance for both systems:
Data
Warehouse
Data
Cleanin
g
Selection
Data Integration
Databases
 DBMS — tuned for OLTP:
access methods, indexing, concurrency control, recovery
 Warehouse — tuned for OLAP:
complex OLAP queries, multidimensional view, consolidation.
 Different functions and different data:
missing data:
Decision support requires historical data which operational DBs do not typically
maintain
data consolidation:
Decision support requires consolidation (aggregation, summarization) of data
from heterogeneous sources
data quality:
different sources typically use inconsistent data representations, codes and
formats which have to be reconciled.
16
A multi-dimensional data model
Conceptual Modeling of Data Warehouses
Modeling data warehouses:
dimensions & measurements
Star schema:
A single object (fact table) in the middle connected to a number of objects
(dimension tables)
Snowflake schema:
A refinement of star schema where the dimensional hierarchy is represented
explicitly by normalizing the dimension tables.
Fact constellations:
Multiple fact tables share dimension tables.
17
Knowledge
Conceptual Modeling of Data Warehouses
Pattern
Evaluation
Data
Mining
Taskrelevant
Data
Example of Star Schema
Data
Warehouse
Data
Cleanin
g
Date
Day
Month
Year
Selection
Data Integration
Databases
Product
Sales Fact Table
Date
Store
Product
StoreID
City
State
Country
Region
Store
Customer
unit_sales
dollar_sales
Yen_sales
ProductNo
ProdName
ProdDesc
Category
QOH
Cust
CustId
CustName
CustCity
CustCountry
Measurements
18
Knowledge
Conceptual Modeling of Data Warehouses
Example of Snowflake
Pattern
Evaluation
Data
Mining
Taskrelevant
Data
Schema
Data
Warehouse
Data
Cleanin
g
Selection
Data Integration
Databases
Year
Year
Product
Month
Month
Year
Date
Day
Month
Store
City
City
State
State
Country
Country
Region
StoreID
City
State
Country
Sales Fact Table
Date
Product
Store
Customer
unit_sales
dollar_sales
Yen_sales
ProductNo
ProdName
ProdDesc
Category
QOH
Cust
CustId
CustName
CustCity
CustCountry
Measurements
19
Conceptual Modeling of Data Warehouses
A Data Mining Query Language:
DMQL Language Primitives
Cube Definition ( Fact Table )
Knowledge
Pattern
Evaluation
Data
Mining
Taskrelevant
Data
Data
Warehouse
Data
Cleanin
g
Selection
Data Integration
Databases
define cube <cube_name> [<dimension_list>]:
<measure_list>
Dimension Definition ( Dimension Table )
define dimension <dimension_name> as
(<attribute_or_subdimension_list>)
Special Case (Shared Dimension Tables)
First time as “cube definition”
define dimension <dimension_name> as
<dimension_name_first_time> in cube <cube_name_first_time>
20
Conceptual Modeling of Data Warehouses
Defining a Snowflake Schema in DMQL
define cube sales [date, product, store, customer]:
dollar_sales = sum(sales_in_dollars),
yen_sales = sum(sales_in_yens), unit_sales = count(*)
define dimension product as
(product_no, prod_name,prod_desc, category, QOH)
define dimension cust as (custID, cust_name, cust_city, cust_country)
define dimension date as (day, month (month_key, year (year_key) ) )
define dimension store as ( storeID, city ( city_key,
state( state_key,
country(country_key, region)
)))
21
A multi-dimensional data model
A Sample Data Cube
2Qtr
3Qtr
4Qtr
sum
U.S.A
Canada
Mexico
Country
TV
PC
VCR
sum
1Qtr
Date
Total annual sales
of TV in U.S.A.
sum
22
Data warehousing
Typical OLAP Operations
 Roll up (drill-up): summarize data; by climbing up hierarchy or by
dimension reduction
 Drill down (roll down): reverse of roll-up; from higher level
summary to lower level summary or detailed data, or introducing
new dimensions
 Slice and dice: project and select
 Pivot (rotate): reorient the cube, visualization, 3D to series of 2D
planes.
 Other operations
 drill across: involving (across) more than one fact table.
 drill through: through the bottom level to its back-end relational tables.
23
Data warehouse software architecture
Multi-Tiered Architecture
Other
sources
Operational
DBs
Metadata
Extract
Transform
Load ETL
Refresh
Monitor
&
Integrator
Data
Warehouse
OLAP Server
Serve
Analysis
Query
Reports
Data mining
Data Marts
Data Sources
Data Storage
OLAP Engine Front-End Tools
24
Data warehouse software architecture
Three-Tier Data Warehouse Architecture
 Enterprise warehouse:
collects all of the information about subjects
spanning the entire organization.
 Data Mart:
Other
sourc
es
Operational
DBs
Metadata
Extract
Transform
Load ETL
Refresh
Monitor
&
Integrator
Data
Warehouse
OLAP Server
Serv
e
Analysis
Query
Reports
Data mining
Data Marts
a subset of corporate-wide data that is of value to a specific groups of
users.
 Its scope is confined to specific, selected groups, such as marketing data mart.
 Independent vs. dependent (directly from warehouse) data mart
 Virtual warehouse:
 A set of views over operational databases.
 Only some of the possible summary views may be materialized.
25
Building Data Warehouses
Approaches to Building Warehouses
… OLAP Server Architectures
Relational OLAP (ROLAP):
relational database system tuned for star schemas, e.g., using special index structures such as:
 “Bitmap indexes” (for each key of a dimension table, e.g., bar name, a bit-vector telling which tuples of the fact table
have that value).
 Materialized views = answers to general queries from which more specific queries can be answered with less work
than if we had to work from the raw data.
Use relational or extended-relational DBMS to store and manage warehouse data and OLAP
middle ware to support missing pieces.
Include optimization of DBMS backend, implementation of aggregation navigation logic, and
additional tools and services
Multidimensional OLAP (MOLAP) - A specialized model based on a “cube”
view of data.
Array-based multidimensional storage engine (sparse matrix techniques)
fast indexing to pre-computed summarized data
Hybrid OLAP (HOLAP) - User flexibility,
e.g., low level: relational, high-level: array.
Specialized SQL servers - specialized support for SQL queries over star, snowflake
schemas
26
Building Data Warehouses
Indexing OLAP Data: Bitmap Index
Index on a particular column
Each value in the column has a bit vector: bit-op is fast
The length of the bit vector: # of records in the base table
The i-th bit is set if the i-th row of the base table has the value for the indexed
column
not suitable for high cardinality domains
Base table
Cust
C1
C2
C3
C4
C5
Region
Asia
Europe
Asia
America
Europe
Type
Retail
Dealer
Dealer
Retail
Dealer
Index on Region
Join Indices
 Join index: JI(R-id, S-id) where R (R-id, …)  S (S-id, …)
 Traditional indices map the values to a list of record ids
 In data warehouses, join index relates the values of the
dimensions of a start schema to rows in the fact table.
 E.g. fact table: Sales and two dimensions city and product ;
A join index on city maintains for each distinct city a list of RIDs of the tuples recording the Sales in the city
 Join indices can span multiple dimensions
RecID Asia
1
1
2
0
3
1
4
0
5
0
Europe
0
1
0
1
0
America
0
0
0
0
1
Index on Type
RecID
1
2
3
4
5
Retail
1
0
0
1
0
Dealer
0
1
1
0
1
27
Building Data Warehouses
Metadata Repository
Meta data are the data
defining warehouse objects
 Description the structure of the warehouse - schema, view, dimensions,
hierarchies, derived data defn, data mart locations and contents
 Operational meta-data - data lineage (history of migrated data and
transformation path), currency of data (active, archived, or purged), monitoring
information (warehouse usage statistics, error reports, audit trails)
 The algorithms used for summarization
 The mapping from operational environment to the data warehouse
 Data related to system performance - warehouse schema, view and derived data
definitions
 Business data - business terms and definitions, ownership of data, charging
policies
28
Building Data Warehouses
Data Warehouse Back-End Tools and Utilities
Other
sourc
es
Operational
DBs
Metadata
Extract
Transform
Load ETL
Refresh
Monitor
&
Integrator
Data
Warehouse
OLAP Server
Serv
e
Analysis
Query
Reports
Data mining
Data Marts
Data Extraction: get data from multiple, heterogeneous, and
external sources
Data cleaning: detect errors in the data and rectify them when
possible
Data Transformation: convert data from legacy or host format to
warehouse format
Load: sort, summarize, consolidate, compute views, check
integrity, and build indices and partitions
Refresh: propagate the updates from the data sources to the
warehouse
29
From data warehousing to data mining
Data Warehouse Usage
Three kinds of data warehouse applications
Information processing - supports querying, basic statistical
analysis, and reporting using crosstabs, tables, charts and
graphs
Analytical processing
 multidimensional analysis of data warehouse data
 supports basic OLAP operations, slice-dice, drilling, pivoting
Data mining
 knowledge discovery from hidden patterns
 supports associations, constructing analytical models, performing
classification and prediction, and presenting the mining results using
visualization tools.
30
From data warehousing to data mining
From On-Line Analytical Processing OLAP
to On Line Analytical Mining OLAM
 Why online analytical mining?
 High quality of data in data warehouses
DW contains integrated, consistent, cleaned data
Available information processing structure surrounding data
warehouses
ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP
tools
OLAP-based exploratory data analysis
mining with drilling, dicing, pivoting, etc.
On-line selection of data mining functions
integration and swapping of multiple mining functions, algorithms, and
tasks.
 Architecture of OLAM
31
From data warehousing to data mining
An OLAM Architecture
Mining query
On Line Analytical Mining
Mining result
Layer4
User Interface
User GUI API
OLAM
Engine
Data Cube API
OLAP
Engine
MDDB
Meta Data
Database API
Filtering &
Integration
Databases
Data cleaning
Data integration
Filtering
Data
Warehouse
Layer3
OLAP/OLAM
Layer2
MDDB
Layer1
Data Repository
32
Data Mining


Data mining is a popular term for queries that
summarize big data sets in useful ways.
Large-scale queries designed to extract patterns
from data.
Examples:
Clustering all Web pages by topic.
Finding characteristics of fraudulent credit-card use.
Knowledge
Pattern
Evaluation
Data
Mining
Taskrelevant
Data
Data Mining Functionalities
 Concept description: Characterization and Comparison Generalize, summarize, and possibly contrast data characteristics.
Data
Warehouse
Data
Cleanin
g
Selection
Data Integration
Databases
 Association
 From association, correlation, to causality.
 finding rules like “inside(x, city) à near(x, highway)”.
 Classification and Prediction
 Classify data based on the values in a classifying attribute, e.g., classify countries
based on climate, or classify cars based on gas mileage.
 Predict some unknown or missing attribute values based on other information.
 Cluster analysis – Group data to form new classes, e.g., cluster houses to find
distribution patterns.
 Outlier and exception data analysis
 Time-series analysis (trend and deviation)
 Trend and deviation analysis: regression, sequential pattern, similar sequences,
trend and deviation, e.g., stock analysis.
 Similarity-based pattern-directed analysis
 Full vs. partial periodicity analysis
 Other pattern-directed or statistical analysis
34
Knowledge
Pattern
Evaluation
Are All the “Discovered”
Patterns Interesting?
Data
Mining
Taskrelevant
Data
Data
Warehouse
Data
Cleanin
g
Selection
Data Integration
 A data mining system/query may generate thousands of patterns,
not all of them are interesting - Suggested approach: Human-centered, query-based,
focused mining
Databases
 Interestingness measures: A pattern is interesting if it is




easily understood by humans
valid on new or test data with some degree of certainty.
potentially useful
novel, or validates some hypothesis that a user seeks to confirm
 Objective vs. subjective interestingness measures:
 Objective: based on statistics and structures of patterns, e.g., support, confidence, etc.
 Subjective: based on user’s beliefs in the data, e.g., unexpectedness, novelty, etc.
Can It Find All and Only Interesting Patterns?
 Find all the interesting patterns: Completeness - Can a data mining system find all
the interesting patterns?
 Search for only interesting patterns: Optimization.
 Can a data mining system find only the interesting patterns?
 Approaches
 First general all the patterns and then filter out the uninteresting ones.
 Generate only the interesting patterns --- mining query optimization
35
Market-Basket Data; typical case
An important form of mining from relational data involves market
baskets = sets of “items” that are purchased together as a
customer leaves a store.
Summary of basket data is frequent itemsets
itemsets = sets of items that often appear together in baskets.
Example: Market Baskets
If people often buy hamburger and ketchup together, the store can:
1. Put hamburger and ketchup near each other and put potato chips between.
2. Run a sale on hamburger and raise the price of ketchup .
Finding Frequent Pairs
 The simplest case is when we only want to find “frequent pairs” of items.
 Assume data is in a relation Baskets(basket, item).
 The support threshold s is the minimum number of baskets in which a pair
appears before we are interested.
36
Frequent Pairs in SQL
SELECT
b1.item, b2.item
FROM
Baskets b1, Baskets b2
WHERE
b1.basket = b2.basket
AND
b1.item < b2.item
GROUP BY b1.item, b2.item
HAVING COUNT(*) >= s;
Look for two Basket tuples with the same basket and
different items. First item must precede second, so we don’t
count the same pair twice.
Create a group for each pair of items that appears in at
least one basket.
Throw away pairs of items that do not appear at least s times.
A-Priori Trick
Above query is prohibitively expensive for large data.

 A-priori algorithm uses the fact that a pair (i, j) cannot have support s unless i and j both have
support s by themselves.
 More efficient implementation uses an intermediate relation Baskets1(a materialized view to
hold only information about frequent items).
INSERT INTO Baskets1(bid, item)
SELECT * FROM Baskets
WHERE item IN (
SELECT
Items that appear in
item
FROM Baskets
at least s baskets.
GROUP BY item
HAVING COUNT(*) >= s
);
Then run the query for pairs on Baskets1
instead of Baskets.
37
A-Priori Algorithm
Materialize the view Baskets1.
Run the obvious query, but on Baskets1 instead of Baskets.
Baskets1 is cheap, since it doesn’t involve a join.
Baskets1 probably has many fewer tuples than Baskets.
Running time shrinks with the square of the number of tuples involved in the
join.
Example: A-Priori
Suppose:
1. A supermarket sells 10,000 items.
2. The average basket has 10 items.
3. The support threshold is 1% of the baskets.
 At most 1/10 of the items can be frequent.
 Probably, the minority of items in one basket are frequent -> factor 4 speedup.
38