Download File - Information Technology SNIST

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
DATA WAREHOUSING and Mining
May 7, 2017
1
This session

0. Introduction

Evolution of Database
What is data warehouse?

Motivation: Why data mining?

What is data mining?

I.
Data Preprocessing
Needs
Preprocessing the Data
Data
Cleaning
Data
Integration and Transformation
Data
Reduction
Discretization

Data Mining: On what kind of data?

Data mining functionality

Are all the patterns interesting?

Classification of data mining systems

Major issues in data mining
May 7, 2017
and Concept
Hierarchy Generation
2
Evolution of Database Technology

1960s:


1970s:


Relational data model, relational DBMS implementation
1980s:


Data collection, database creation, IMS and network
DBMS
RDBMS, advanced data models (extended-relational,
OO, deductive, etc.) and application-oriented DBMS
(spatial, scientific, engineering, etc.)
1990s—2000s:

Data mining and data warehousing, multimedia
databases, and Web databases
May 7, 2017
3
Short History of Data Mining




1989 - KDD term (Knowledge Discovery in
Databases) appears in (IJCAI Workshop)
1991 - a collection of research papers edited by
Piatetsky-Shapiro and Frawley
1993 – Association Rule Mining Algorithm
APRIORI proposed by Agraval, Imielinski and
Swami.
1996 – present: KDD evolves as a conjuction of
different knowledge areas (data bases, machine
learning, statistics, artificial intelligence) and the
term Data Mining becomes popular
Of “Laws”, Monsters, and Giants…

Moore’s law: processing “capacity” doubles every 18
months : CPU, cache, memory

It’s more aggressive cousin:

Disk storage “capacity” doubles every 9 months
What do the two
“laws” combined
produce?
A rapidly growing gap
between our ability to
generate data, and our
ability to make use of it.
May 7, 2017
Disk TB Shipped per Year
1E+7
1998 Disk Trend (Jim Porter)
http://www.disktrend.com/pdf/portrpkg.pdf.
ExaByte
1E+6
1E+5
disk TB
growth:
112%/y
Moore's Law:
58.7%/y
1E+4
1E+3
1988
1991
1994
1997
2000
5
Data, Data everywhere yet ...

I can’t find the data I need


data is scattered over the network
many versions, subtle differences
I can’t get the data I need
need an expert to get the data
I can’t understand the data I found
available data poorly documented
I can’t use the data I found
results are unexpected
data needs to be transformed from one form to other
May 7, 2017
6
Knowledge
Refinement
Pattern
Warehousing
Data Mining
OLAP/ROLAP
DWH
Statistics &
Reporting
Data
1970’s
1980’s
1990’s
2000
Fig.: From Data to Knowledge --- Series of steps
May 7, 2017
7
What motivated data mining ? Why is it so important ?
•
The major reason that data mining has attracted a great deal of attention
in the information industry in recent years is due to the wide availability
of huge amounts of data and the imminent need for turning such data
into useful information and knowledge.
•
Data mining can be viewed as a result of the natural evolution of
information technology
•
It has the following functionalities.
Data Collection and Database Creation, Data management (Including
data storage and retrieval and database transaction processing) and Data
analysis and understanding (involving database transaction processing)
May 7, 2017
8
May 7, 2017
9
Evolution of Sciences





Before 1600, empirical science
1600-1950s, theoretical science

Each discipline has grown a theoretical component. Theoretical models often motivate
experiments and generalize our understanding.
1950s-1990s, computational science

Over the last 50 years, most disciplines have grown a third, computational branch (e.g.
empirical, theoretical, and computational ecology, or physics, or linguistics.)

Computational Science traditionally meant simulation. It grew out of our inability to find
closed-form solutions for complex mathematical models.
1990-now, data science

The flood of data from new scientific instruments and simulations

The ability to economically store and manage petabytes of data online

The Internet and computing Grid that makes all these archives universally accessible

Scientific info. management, acquisition, organization, query, and visualization tasks
scale almost linearly with data volumes. Data mining is a major new challenge!
Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science,
Comm. ACM, 45(11): 50-54, Nov. 2002
May 7, 2017
10
Evolution of Database Technology





1960s:

Data collection, database creation, IMS and network DBMS
1970s:

Relational data model, relational DBMS implementation
1980s:

RDBMS, advanced data models (extended-relational, OO, deductive, etc.)

Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:

Data mining, data warehousing, multimedia databases, and Web
databases
2000s

Stream data management and mining

Data mining and its applications

Web technology (XML, data integration) and global information systems
May 7, 2017
11
What Is Data Mining?

Data mining (knowledge discovery from data)



Alternative names


Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
Data mining: a misnomer?
Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data archeology,
data dredging, information harvesting, business intelligence,
etc.
Watch out: Is everything “data mining”?
 Simple search and query processing
 (Deductive) expert systems
May 7, 2017
12
Knowledge Discovery (KDD) Process

Data mining—core of
knowledge discovery process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
May 7, 2017
13
Steps in KDD Process
1.
Data Cleaning : (To remove noise and inconsistent data)
2.
Data Integration : (Where multiple data sources may be combined)
3.
Data Selection : (Where data relevant to the analysis task are retrieved from the
database)
4.
Data Transformation : (Where data are transformed or consolidated into forms appropriate
for mining by performing summary or aggregation operations, for instance)
5.
Data Mining : (An essential process where intelligent methods are applied in order to extract
data patterns )
6.
Pattern evaluation : (To identify the truly interesting patterns representing knowledge based
on some interestingness measures)
7.
Knowledge presentation : (where visualization and knowledge representation techniques are
used to present the mined knowledge to the user )
May 7, 2017
14
Architecture of a Typical Data Mining System
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data
warehouse server
Data cleaning & data integration
Databases
May 7, 2017
Filtering
Data
Warehouse
15






Database, data warehouse, or other information repository : This is one or
a set of databases, data warehouses, spreadsheets, or other kinds of
informational repositories. Data cleaning and data integration techniques
may be performed on the data
Database, or data warehouse server : The database or data warehouse
server is responsible for fetching the relevant data, based on the user’s data
mining request.
Knowledge base : This is the domain knowledge that is used to guide the
search or evaluate the interestingness of resulting patterns
Data mining engine: This is essential to the data mining system and ideally
consists of a set of functional modules for tasks such as characterization,
association, classification, cluster analysis, and evolution and deviation
analysis
Pattern evaluation module
: This component typically employs
interestingness measures and interacts with data mining modules so as to
focus the search towards interesting patterns
Graphical user interface : This module communicates between the users
and the data mining system, allowing the user to interact with the system
by specifying a query or task.
May 7, 2017
16
Data Mining: On What Kind of Data?




Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories
 Object-oriented and object-relational databases
 Spatial databases
 Time-series data and temporal data
 Text databases and multimedia databases
 Heterogeneous and legacy databases
 WWW
May 7, 2017
17
Data Mining Functionalities (1)


Concept description: Characterization and discrimination

Generalize, summarize, and contrast data characteristics, e.g., dry vs.
wet regions
Association (correlation and causality)

Multi-dimensional vs. single-dimensional association

age(X, “20..29”) ^ income(X, “20..29K”) à buys (X, “PC”)
[support = 2%, confidence = 60%]
The number of times, this item set appears in the database is called its
"support"
Confidence of rule "B given A" is a measure of how much more likely it is
that B occurs when A has occurred. It is expressed as a percentage, with
100% meaning B always occurs if A has occurred

contains(T, “computer”) à contains(x, “software”) [1%, 75%]
May 7, 2017
18
Data Mining Functionalities (2)


Classification and Prediction
 Finding models (functions) that describe and distinguish
classes or concepts for future prediction
 E.g., classify countries based on climate, or classify cars based
on gas mileage
 Presentation: decision-tree, classification rule, neural network
 Prediction: Predict some unknown or missing numerical
values
Cluster analysis
 Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns
 Clustering based on the principle: maximizing the intra-class
similarity and minimizing the interclass similarity
May 7, 2017
19
Data Mining Functionalities (3)



Outlier analysis
 Outlier: a data object that does not comply with the
general behavior of the data
 It can be considered as noise or exception but is quite
useful in fraud detection, rare events analysis
Trend and evolution analysis
 Trend and deviation: regression analysis
 Sequential pattern mining, periodicity analysis
 Similarity-based analysis
Other pattern-directed or statistical analyses
May 7, 2017
20
Are All the “Discovered” Patterns Interesting?



A data mining system/query may generate thousands of patterns,
not all of them are interesting.
 Suggested approach: Human-centered, query-based, focused
mining
Interestingness measures: A pattern is interesting if it is easily
understood by humans, valid on new or test data with some degree
of certainty, potentially useful, novel, or validates some hypothesis
that a user seeks to confirm
Objective vs. subjective interestingness measures:
 Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.
 Subjective: based on user’s belief in the data, e.g.,
unexpectedness, novelty, action ability, etc.
May 7, 2017
21
Can We Find All and Only Interesting Patterns?


Find all the interesting patterns: Completeness
 Can a data mining system find all the interesting patterns?
 Association vs. classification vs. clustering
Search for only interesting patterns: Optimization
 Can a data mining system find only the interesting patterns?
 Approaches
 First general all the patterns and then filter out the
uninteresting ones.
 Generate only the interesting patterns—mining query
optimization
May 7, 2017
22
Data Mining: Confluence of Multiple Disciplines
Database
Technology
Machine
Learning
Information
Science
May 7, 2017
Statistics
Data Mining
Visualization
Other
Disciplines
23
Classification of Data mining
systems




Classification according to the kinds of databases mined:
data models(relational ,transactional ,object relational) and type of
data
Classification according to the kinds of knowledge mined
association , classification, clustering…
Classification according to the kinds of techniques utilized
techniques can be described according to the degree of user
interaction involved
Classification according to the applications adapted
finance, telecommunications, DNA, stock
markets, e-mail, and so on.
May 7, 2017
24
Major Issues in Data Mining



Mining methodology

Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web

Performance: efficiency, effectiveness, and scalability

Pattern evaluation: the interestingness problem

Incorporation of background knowledge

Handling noise and incomplete data

Parallel, distributed and incremental mining methods

Integration of the discovered knowledge with existing one: knowledge fusion
User interaction

Data mining query languages and ad-hoc mining

Expression and visualization of data mining results

Interactive mining of knowledge at multiple levels of abstraction
Applications and social impacts

Domain-specific data mining & invisible data mining

Protection of data security, integrity, and privacy
May 7, 2017
25
What is Data Warehousing?
Information
A process of transforming
data into information and
making it available to users
in a timely enough manner
to make a difference
[Forrester Research, April 1996]
Data
May 7, 2017
26
Very Large Data Bases

Terabytes -- 10^12 bytes: Walmart -- 24 Terabytes

Petabytes -- 10^15 bytes: Geographic Information
Systems
 Exabytes -- 10^18 bytes: National Medical Records

Zettabytes -- 10^21 bytes:Weather images

Zottabytes -- 10^24 bytes:Intelligence Agency Videos
May 7, 2017
27
What is a Data Warehouse?
A single, complete and
consistent store of data
obtained from a variety of
different sources made
available to end users in a
what they can understand
and use in a business
context.
[Barry Devlin]
May 7, 2017
28
Data Warehousing -- It is a process


May 7, 2017
Technique for assembling and
managing data from various
sources for the purpose of
answering business questions. Thus
making decisions that were not
previous possible
A decision support database
maintained separately from the
organization’s operational database
29
What is Data Warehouse?

Defined in many different ways, but not rigorously.




A decision support database that is maintained separately
from the organization’s operational database
Support information processing by providing a solid
platform of consolidated, historical data for analysis.
“A data warehouse is a subject-oriented, integrated,
time-variant, and nonvolatile collection of data in
support of management’s decision-making
process.”—W. H. Inmon
Data warehousing:

The process of constructing and using data warehouses
May 7, 2017
30
Data Warehouse—Subject-Oriented

Organized around major subjects, such as customer,
product, sales.

Focusing on the modeling and analysis of data for
decision makers, not on daily operations or
transaction processing.

Provide a simple and concise view around particular
subject issues by excluding data that are not useful in
the decision support process.
May 7, 2017
31
Data Warehouse—Integrated

Constructed by integrating multiple, heterogeneous
data sources


relational databases, flat files, on-line transaction records
Data cleaning and data integration techniques are
applied.

Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources


E.g., Hotel price: currency, tax, breakfast covered, etc.
When data is moved to the warehouse, it is converted.
May 7, 2017
32
Data Warehouse—Time Variant


The time horizon for the data warehouse is
significantly longer than that of operational systems.

Operational database: current value data.

Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
Every key structure in the data warehouse

Contains an element of time, explicitly or implicitly

But the key of operational data may or may not contain
“time element”.
May 7, 2017
33
Data Warehouse—Non-Volatile

A physically separate store of data transformed from
the operational environment.

Operational update of data does not occur in the data
warehouse environment.

Does not require transaction processing, recovery, and
concurrency control mechanisms

Requires only two operations in data accessing:

initial loading of data and access of data.
May 7, 2017
34
Data Warehouse vs. Heterogeneous DBMS


Traditional heterogeneous DB integration:

Build wrappers/mediators on top of heterogeneous databases

Query driven approach

When a query is posed to a client site, a meta-dictionary is used to
translate the query into queries appropriate for individual
heterogeneous sites involved, and the results are integrated into a
global answer set

Complex information filtering, compete for resources
Data warehouse: update-driven, high performance

Information from heterogeneous sources is integrated in advance and
stored in warehouses for direct query and analysis
May 7, 2017
35
Data Warehouse vs. Operational DBMS



OLTP (on-line transaction processing)

Major task of traditional relational DBMS

Day-to-day operations: purchasing, inventory, banking, manufacturing,
payroll, registration, accounting, etc.
OLAP (on-line analytical processing)

Major task of data warehouse system

Data analysis and decision making
Distinct features (OLTP vs. OLAP):

User and system orientation: customer vs. market

Data contents: current, detailed vs. historical, consolidated

Database design: ER + application vs. star + subject

View: current, local vs. evolutionary, integrated

Access patterns: update vs. read-only but complex queries
May 7, 2017
36
OLTP vs. OLAP
OLTP
OLAP
users
clerk, IT professional
knowledge worker
function
day to day operations
decision support
DB design
application-oriented
subject-oriented
data
current, up-to-date
detailed, flat relational
isolated
repetitive
historical,
summarized, multidimensional
integrated, consolidated
ad-hoc
lots of scans
unit of work
read/write
index/hash on prim. key
short, simple transaction
# records accessed
tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
query throughput, response
usage
access
May 7, 2017
complex query
37
Why Separate Data Warehouse?

High performance for both systems



DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery
Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
Different functions and different data:



missing data: Decision support requires historical data which
operational DBs do not typically maintain
data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
data quality: different sources typically use inconsistent data
representations, codes and formats which have to be
reconciled
May 7, 2017
38
Typical Process Flow Within a Data Warehouse
Source
Warehouse
Users
Data transformation
and movement
Extract And
load
Query
Archive data
Figure : Process flow within a data warehouse
May 7, 2017
39
1.
2.
3.
4.
Extract and load the data
Clean and transform data into a form that can
cope with large data volumes and provide
good query performance.
Back up and archive data
Manage queries and direct them to the
appropriate data sources.
May 7, 2017
40
Extract and Load Process
1. Controlling the Process
-
Determine when to start extracting the data
2. When to initiate the extract
- Data should be in a consistent state
- Start extracting data from data sources when it represents the
same snapshot of time as all the other data sources
3. Loading the data
- Do not execute consistency checks until all the data
sources have been loaded into the temporary data store
4. Copy Management Tools and Data cleanup
May 7, 2017
41
Clean and Transform Data
1.
Clean and Transform the data
Data needs to be cleaned and checked in the following ways:
- Make sure data is consistent within itself
- Make sure that data is consistent with other data within
the same source
- Make sure data is consistent with data in the other
source systems.
- Make sure data is consistent with the information
already in the warehouse
May 7, 2017
42
2. Transforming into Effective Structure
- Once the data has been cleaned, convert the source
data in the temporary data store into a structure that
is designed to balance query performance and
operational cost
May 7, 2017
43
Backup and Archive Process

The data within the data warehouse is backed
up regularly in order to ensure that the data
warehouse can always be recovered from data
loss, software failure or hardware failure.
May 7, 2017
44
Query Management Process




System process that manages the queries an
speeds them up by directing queries to the most
effective data source.
Directing Queries to the suitable tables
Maximizing System Resources
Query Capture
- Query profiles change on a regular basis
- In order to accurately monitor and understand what the
new query profiles are, it can be very effective to capture
the physical queries that are being executed.
May 7, 2017
45
Design of a Data Warehouse: A
Business Analysis Framework

Four views regarding the design of a data warehouse

Top-down view


Data source view


exposes the information being captured, stored, and managed by
operational systems
Data warehouse view


allows selection of the relevant information necessary for the data
warehouse
consists of fact tables and dimension tables
Business query view

sees the perspectives of data in the warehouse from the view of enduser
May 7, 2017
46
Data Warehouse Design Process

Top-down, bottom-up approaches or a combination of both



From software engineering point of view



Top-down: Starts with overall design and planning (mature)
Bottom-up: Starts with experiments and prototypes (rapid)
Waterfall: structured and systematic analysis at each step before
proceeding to the next
Spiral: rapid generation of increasingly functional systems, short turn
around time, quick turn around
Typical data warehouse design process




Choose a business process to model, e.g., orders, invoices, etc.
Choose the grain (atomic level of data) of the business process
Choose the dimensions that will apply to each fact table record
Choose the measure that will populate each fact table record
May 7, 2017
47
Multi-Tiered Architecture
Metadata
other
sources
Operational
DBs
Extract
Transform
Load
Refresh
Monitor
&
Integrator
Data
Warehouse
OLAP Server
Serve
Analysis
Query
Reports
Data mining
Data Marts
Data Sources
May 7, 2017
Data Storage
OLAP Engine Front-End Tools
48
Three Data Warehouse Models

Enterprise warehouse


collects all of the information about subjects spanning the
entire organization
Data Mart

a subset of corporate-wide data that is of value to a specific
groups of users. Its scope is confined to specific, selected
groups, such as marketing data mart


Independent vs. dependent (directly from warehouse) data mart
Virtual warehouse


A set of views over operational databases
Only some of the possible summary views may be
materialized
May 7, 2017
49
Data Warehouse Development:
A Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts
Data
Mart
Data
Mart
Model refinement
Enterprise
Data
Warehouse
Model refinement
Define a high-level corporate data model
May 7, 2017
50
OLAP Server Architectures

Relational OLAP (ROLAP)




Multidimensional OLAP (MOLAP)



Array-based multidimensional storage engine (sparse matrix techniques)
fast indexing to pre-computed summarized data
Hybrid OLAP (HOLAP)


Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware to support missing pieces
Include optimization of DBMS backend, implementation of aggregation
navigation logic, and additional tools and services
greater scalability
User flexibility, e.g., low level: relational, high-level: array
Specialized SQL servers

specialized support for SQL queries over star/snowflake schemas
May 7, 2017
51
May 7, 2017
52
Why Data Mining?

Data explosion problem

Automated data collection tools and mature database
technology lead to tremendous amounts of data stored in
databases, data warehouses and other information
repositories

We are drowning in data, but starving for knowledge!

Solution: Data warehousing and data mining

Data warehousing and on-line analytical processing

Extraction of interesting knowledge (rules, regularities,
patterns, constraints) from data in large databases
May 7, 2017
53
What Is Data Mining?

Data mining (knowledge discovery in
databases):


Alternative names and their “inside stories”:



Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
information or patterns from data in large databases
Data mining: a misnomer?
Knowledge discovery(mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
What is not data mining?


(Deductive) query processing.
Expert systems or small ML/statistical programs
May 7, 2017
54
Why Data Mining? — Potential Applications

Database analysis and decision support

Market analysis and management


Risk analysis and management



target marketing, customer relation management, market
basket analysis, cross selling, market segmentation
Forecasting, customer retention, improved underwriting,
quality control, competitive analysis
Fraud detection and management
Other Applications

Text mining (news group, email, documents) and Web analysis.

Intelligent query answering
May 7, 2017
55
Market Analysis and Management (1)

Where are the data sources for analysis?


Target marketing


Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
Determine customer purchasing patterns over time


Credit card transactions, loyalty cards, discount coupons, customer
complaint calls, plus (public) lifestyle studies
Conversion of single to a joint bank account: marriage, etc.
Cross-market analysis

Associations/co-relations between product sales

Prediction based on the association information
May 7, 2017
56
Market Analysis and Management (2)

Customer profiling

data mining can tell you what types of customers buy
what products (clustering or classification)

Identifying customer requirements

identifying the best products for different customers

use prediction to find what factors will attract new
customers

Provides summary information

various multidimensional summary reports

statistical summary information (data central tendency and
variation)
May 7, 2017
57
Corporate Analysis and Risk Management

Finance planning and asset evaluation




Resource planning:


cash flow analysis and prediction
contingent claim analysis to evaluate assets
cross-sectional and time series analysis (financial-ratio,
trend analysis, etc.)
summarize and compare the resources and spending
Competition:



monitor competitors and market directions
group customers into classes and a class-based pricing
procedure
set pricing strategy in a highly competitive market
May 7, 2017
58
Fraud Detection and Management (1)

Applications


Approach


widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
use historical data to build models of fraudulent behavior
and use data mining to help identify similar instances
Examples



auto insurance: detect a group of people who stage
accidents to collect on insurance
money laundering: detect suspicious money transactions
(US Treasury's Financial Crimes Enforcement Network)
medical insurance: detect professional patients and ring
of doctors and ring of references
May 7, 2017
59
Fraud Detection and Management (2)

Detecting inappropriate medical treatment


Detecting telephone fraud



Australian Health Insurance Commission identifies that in
many cases blanket screening tests were requested (save
Australian $1m/yr).
Telephone call model: destination of the call, duration,
time of day or week. Analyze patterns that deviate from an
expected norm.
British Telecom identified discrete groups of callers with
frequent intra-group calls, especially mobile phones, and
broke a multimillion dollar fraud.
Retail

Analysts estimate that 38% of retail shrink is due to
dishonest employees.
May 7, 2017
60
Other Applications

Sports


Astronomy


IBM Advanced Scout analyzed NBA game statistics
(shots blocked, assists, and fouls) to gain competitive
advantage for New York Knicks and Miami Heat
JPL and the Palomar Observatory discovered 22 quasars
with the help of data mining
Internet Web Surf-Aid

IBM Surf-Aid applies data mining algorithms to Web
access logs for market-related pages to discover customer
preference and behavior pages, analyzing effectiveness of
Web marketing, improving Web site organization, etc.
May 7, 2017
61
Data Mining: A KDD Process
Pattern Evaluation

Data mining: the core of
knowledge discovery
Data Mining
process.
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
May 7, 2017
62
Steps of a KDD Process

Learning the application domain:




Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation:




summarization, classification, regression, association, clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation


Find useful features, dimensionality/variable reduction, invariant
representation.
Choosing functions of data mining


relevant prior knowledge and goals of application
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
May 7, 2017
63
Data Mining and Business Intelligence
Increasing potential
to support
business decisions
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
End User
Business
Analyst
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
May 7, 2017
DBA
64
Architecture of a Typical Data Mining System
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data
warehouse server
Data cleaning & data integration
Databases
May 7, 2017
Filtering
Data
Warehouse
65
Data Mining: On What Kind of Data?




Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories






Object-oriented and object-relational databases
Spatial databases
Time-series data and temporal data
Text databases and multimedia databases
Heterogeneous and legacy databases
WWW
May 7, 2017
66
Data Mining Functionalities (1)

Concept description: Characterization and
discrimination


Generalize, summarize, and contrast data characteristics,
e.g., dry vs. wet regions
Association (correlation and causality)

Multi-dimensional vs. single-dimensional association

age(X, “20..29”) ^ income(X, “20..29K”)  buys(X,
“PC”) [support = 2%, confidence = 60%]

contains(T, “computer”)  contains(x, “software”) [1%,
75%]
May 7, 2017
67
Data Mining Functionalities (2)


Classification and Prediction

Finding models (functions) that describe and distinguish classes or
concepts for future prediction

E.g., classify countries based on climate, or classify cars based on
gas mileage

Presentation: decision-tree, classification rule, neural network

Prediction: Predict some unknown or missing numerical values
Cluster analysis

Class label is unknown: Group data to form new classes,
e.g., cluster houses to find distribution patterns

Clustering based on the principle: maximizing the intraclass similarity and minimizing the interclass similarity
May 7, 2017
68
Data Mining Functionalities (3)

Outlier analysis

Outlier: a data object that does not comply with the general
behavior of the data

It can be considered as noise or exception but is quite useful in
fraud detection, rare events analysis


Trend and evolution analysis

Trend and deviation: regression analysis

Sequential pattern mining, periodicity analysis

Similarity-based analysis
Other pattern-directed or statistical analyses
May 7, 2017
69
Are All the “Discovered” Patterns Interesting?

A data mining system/query may generate thousands of
patterns, not all of them are interesting.


Suggested approach: Human-centered, query-based, focused mining
Interestingness measures: A pattern is interesting if it is
easily understood by humans, valid on new or test data with
some degree of certainty, potentially useful, novel, or
validates some hypothesis that a user seeks to confirm

Objective vs. subjective interestingness measures:

Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.

Subjective: based on user’s belief in the data, e.g., unexpectedness,
novelty, actionability, etc.
May 7, 2017
70
Can We Find All and Only Interesting Patterns?


Find all the interesting patterns: Completeness

Can a data mining system find all the interesting
patterns?

Association vs. classification vs. clustering
Search for only interesting patterns: Optimization

Can a data mining system find only the interesting
patterns?

Approaches

First general all the patterns and then filter out the uninteresting
ones.

Generate only the interesting patterns—mining query
optimization
May 7, 2017
71
Data Mining: Confluence of
Multiple Disciplines
Database
Technology
Machine
Learning
Information
Science
May 7, 2017
Statistics
Data Mining
Visualization
Other
Disciplines
72
Data Mining: Classification Schemes


General functionality

Descriptive data mining

Predictive data mining
Different views, different classifications

Kinds of databases to be mined

Kinds of knowledge to be discovered

Kinds of techniques utilized

Kinds of applications adapted
May 7, 2017
73
MULTIDIMENSIONAL DATA


Analyze data by representing facts and
dimensions within a multidimensional
cube.
Purpose of viewing information in a cube is
that it lends itself to viewing statistical
operations/aggregations, by applying
functions against the plane of cube.
May 7, 2017
74
For example: In a retail sales analysis data warehouse, a cubical
representation of products by store by day is represented by a threedimensional cube.
Time
Location
Product
Figure: Product by store by day cube
The point of intersection of all axes represents the actual
number of sales for a specific product, in a specific store,
on a specific day.
May 7, 2017
75
Some operations in the multidimensional data model





Roll-up(drill-up)-Performs aggregation on a data
cube, either by climbing up a concept hierarchy for a
dimension or by dimension reduction.
Drill-down- Reverse of roll-up operation. It
navigates from less details data to more detailed data.
Slice- Performs a selection on one dimension of the
given cube, resulting in a sub-cube.
Dice- Define a sub-cube by performing a selection on
two or more dimensions.
Pivot(rotate)- is a visualization operation that rotates
the data axes in a view ,in order to provide an
alternative presentation of data.
May 7, 2017
76
Toronto
Vancover
Q1
Dice for
(location=”Toronto “ or “vancover”)
and (time=”Q1” or “Q2”) and
(item=”H.E” or “comp)
Chicago
Location
440
NY
156
Toronto
(Cities) Vancover
395
Q1
605
825
Q2
H.E. comp
Items (types)
H.E
605
825
Comp
14
400
14
Phone
Time
(quarters)
Q2
400
Security
Q3
Chicago
NY
Toronto Vancover
Q4
Home comp
entertainment
Items (types)
slice for
time “Q1”
Pivot
phone security
Chicago
NY
Toronto
Vancover
May 7, 2017
605
825
Home comp
entertainment
14
400
phone security
77
Location
(Cities)
Chicago
Q1
Drill-down on
440
156
NY
Toronto
Vancover
605
825
14
400
Q2
Time
(quarters) Q3
Chicago
NY
Toronto
Vancover
Q4
Home comp
entertainment
Roll-up
On location
(from cities
to country)
time(from quarters
to months)
395
phone security
Items (types)
Jan
Feb
Mar
App
May
June
July
August
USA
Canada
Time
(months)
Q1
Q2
Q3
Q4
Sep
Oct
Nov
H.E
comp
phone security
Items (types)
May 7, 2017
Dec
H.E
comp
phone security
Items (types)
78
A Multi-Dimensional View of Data Mining
Classification

Databases to be mined


Knowledge to be mined




Relational, transactional, object-oriented, object-relational, active,
spatial, time-series, text, multi-media, heterogeneous, legacy,
WWW, etc.
Characterization, discrimination, association, classification,
clustering, trend, deviation and outlier analysis, etc.
Multiple/integrated functions and mining at multiple levels
Techniques utilized
 Database-oriented, data warehouse (OLAP), machine
learning, statistics, visualization, neural network, etc.
Applications adapted

Retail, telecommunication, banking, fraud analysis, DNA mining,
stock market analysis, Web mining, Weblog analysis, etc.
May 7, 2017
79
OLAP Mining: An Integration of Data
Mining and Data Warehousing

Data mining systems, DBMS, Data warehouse
systems coupling


On-line analytical mining data


integration of mining and OLAP technologies
Interactive mining multi-level knowledge


No coupling, loose-coupling, semi-tight-coupling, tight-coupling
Necessity of mining knowledge and patterns at different levels of
abstraction by drilling/rolling, pivoting, slicing/dicing, etc.
Integration of multiple mining functions

Characterized classification, first clustering and then association
May 7, 2017
80
Data Warehouse Usage

Three kinds of data warehouse applications

Information processing




supports querying, basic statistical analysis, and reporting using
crosstabs, tables, charts and graphs
Analytical processing

multidimensional analysis of data warehouse data

supports basic OLAP operations, slice-dice, drilling, pivoting
Data mining

knowledge discovery from hidden patterns

supports associations, constructing analytical models, performing
classification and prediction, and presenting the mining results
using visualization tools.
Differences among the three tasks
May 7, 2017
81
From On-Line Analytical Processing to
On Line Analytical Mining (OLAM)

Why online analytical mining?





High quality of data in data warehouses
 DW contains integrated, consistent, cleaned data
Available information processing structure surrounding data
warehouses
 ODBC, OLEDB, Web accessing, service facilities, reporting and
OLAP tools
OLAP-based exploratory data analysis
 mining with drilling, dicing, pivoting, etc.
On-line selection of data mining functions
 integration and swapping of multiple mining functions,
algorithms, and tasks.
Architecture of OLAM
May 7, 2017
82
An OLAM Architecture
Mining query
Mining result
Layer4
User Interface
User GUI API
OLAM
Engine
OLAP
Engine
Layer3
OLAP/OLAM
Data Cube API
Layer2
MDDB
MDDB
Meta Data
Filtering&Integration
Database API
Filtering
Layer1
Data cleaning
Databases
May 7, 2017
Data
Data integration Warehouse
Data
Repository
83
Major Issues in Data Mining (1)


Mining methodology and user interaction

Mining different kinds of knowledge in databases

Interactive mining of knowledge at multiple levels of abstraction

Incorporation of background knowledge

Data mining query languages and ad-hoc data mining

Expression and visualization of data mining results

Handling noise and incomplete data

Pattern evaluation: the interestingness problem
Performance and scalability

Efficiency and scalability of data mining algorithms

Parallel, distributed and incremental mining methods
May 7, 2017
84
Major Issues in Data Mining (2)

Issues relating to the diversity of data types



Handling relational and complex types of data
Mining information from heterogeneous databases and
global information systems (WWW)
Issues related to applications and social impacts

Application of discovered knowledge





Domain-specific data mining tools
Intelligent query answering
Process control and decision making
Integration of the discovered knowledge with existing
knowledge: A knowledge fusion problem
Protection of data security, integrity, and privacy
May 7, 2017
85
Summary

Data mining: discovering interesting patterns from large
amounts of data

A natural evolution of database technology, in great demand,
with wide applications

A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation

Mining can be performed in a variety of information
repositories

Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend
analysis, etc.

Classification of data mining systems

Major
issues in data mining
May 7, 2017
86
Why Data Preprocessing?

Data in the real world is dirty




incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
noisy: containing errors or outliers
inconsistent: containing discrepancies in codes or names
No quality data, no quality mining results!


Quality decisions must be based on quality data
Data warehouse needs consistent integration of quality
data
May 7, 2017
87
Multi-Dimensional Measure of Data Quality

A well-accepted multidimensional view:









Accuracy
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
Broad categories:

intrinsic, contextual, representational, and
accessibility.
May 7, 2017
88
Major Tasks in Data Preprocessing

Data cleaning


Data integration


Normalization and aggregation
Data reduction


Integration of multiple databases, data cubes, or files
Data transformation


Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Obtains reduced representation in volume but produces the same or
similar analytical results
Data discretization

Part of data reduction but with particular importance, especially for
numerical data
May 7, 2017
89
Forms of data preprocessing
May 7, 2017
90
Data Cleaning

Data cleaning tasks

Fill in missing values

Identify outliers and smooth out noisy data

Correct inconsistent data
May 7, 2017
91
Missing Data

Data is not always available



E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data
Missing data may be due to

equipment malfunction

inconsistent with other recorded data and thus deleted

data not entered due to misunderstanding

certain data may not be considered important at the time of entry

not register history or changes of the data
Missing data may need to be inferred.
May 7, 2017
92
How to Handle Missing Data?

Ignore the tuple: usually done when class label is missing
(assuming the tasks in classification—not effective when the
percentage of missing values per attribute varies considerably.

Fill in the missing value manually: tedious + infeasible

Use a global constant to fill in the missing value: e.g.,
“unknown”, a new class?!

Use the attribute mean to fill in the missing value

Use the attribute mean for all samples belonging to the same
class to fill in the missing value: smarter

Use the most probable value to fill in the missing value: inference-based such
as Bayesian formula or decision tree
May 7, 2017
93
Noisy Data


Noise: random error or variance in a measured
variable
Incorrect attribute values may due to






faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning



duplicate records
incomplete data
inconsistent data
May 7, 2017
94
How to Handle Noisy Data?

Binning method:



Clustering


detect and remove outliers
Combined computer and human inspection


first sort data and partition into (equi-depth) bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
detect suspicious values and check by human
Regression

smooth by fitting the data into regression functions
May 7, 2017
95
Simple Discretization Methods: Binning

Equal-width (distance) partitioning:






It divides the range into N intervals of equal size:
uniform grid
if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B-A)/N.
The most straightforward
But outliers may dominate presentation
Skewed data is not handled well.
Equal-depth (frequency) partitioning:



It divides the range into N intervals, each containing
approximately same number of samples
Good data scaling
Managing categorical attributes can be tricky.
May 7, 2017
96
Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
May 7, 2017
97
Data Integration

Data integration:


Schema integration



combines data from multiple sources into a coherent store
integrate metadata from different sources
Entity identification problem: identify real world entities
from multiple data sources.
Detecting and resolving data value conflicts


for the same real world entity, attribute values from
different sources are different
possible reasons: different representations, different scales,
e.g., metric vs. British units
May 7, 2017
98
Handling Redundant Data in Data
Integration

Redundant data occur often when integration of
multiple databases

The same attribute may have different names in different
databases

One attribute may be a “derived” attribute in another
table, e.g., annual revenue

Redundant data may be able to be detected by
correlational analysis

Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
May 7, 2017
99
Data Transformation

Smoothing: remove noise from data

Aggregation: summarization, data cube construction

Generalization: concept hierarchy climbing

Normalization: scaled to fall within a small,
specified range


min-max normalization

z-score normalization

normalization by decimal scaling
Attribute/feature construction

New attributes constructed from the given ones
May 7, 2017
100
Data Transformation: Normalization

min-max normalization

z-score normalization
v  meanA
v' 
stand _ devA

normalization by decimal scaling
v  minA
v' 
(new _ maxA  new _ minA)  new _ minA
maxA  minA
v
v'  j
10
May 7, 2017
Where j is the smallest integer such that Max(| v ' |)<1
101
Data Reduction Strategies


Warehouse may store terabytes of data: Complex data
analysis/mining may take a very long time to run on the
complete data set
Data reduction


Obtains a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the
same) analytical results
Data reduction strategies




Data cube aggregation
Dimensionality reduction
Numerosity reduction
Discretization and concept hierarchy generation
May 7, 2017
102
Discretization and Concept hierachy

Discretization


reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values.
Concept hierarchies

reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute age)
by higher level concepts (such as young, middle-aged,
or senior).
May 7, 2017
103
Discretization

Three types of attributes:




Nominal — values from an unordered set
Ordinal — values from an ordered set
Continuous — real numbers
Discretization:




divide the range of a continuous attribute into intervals
Some classification algorithms only accept categorical
attributes.
Reduce data size by discretization
Prepare for further analysis
May 7, 2017
104
Discretization and concept hierarchy
generation for numeric data

Binning

Histogram analysis

Clustering analysis

Entropy-based discretization

Segmentation by natural partitioning
May 7, 2017
105
Concept hierarchy generation for categorical
data

Specification of a partial ordering of attributes
explicitly at the schema level by users or experts

Specification of a portion of a hierarchy by
explicit data grouping

Specification of a set of attributes, but not of their
partial ordering

Specification of only a partial set of attributes
May 7, 2017
106
Specification of a set of attributes

Concept hierarchy can be automatically generated
based on the number of distinct values per attribute
in the given attribute set. The attribute with the
most distinct values is placed at the lowest level of
the hierarchy.
country
15 distinct values
province_or_ state
65 distinct values
city
3567 distinct values
street
May 7, 2017
674,339 distinct values
107
Summary

Data preparation is a big issue for both warehousing
and mining


Data preparation includes

Data cleaning and data integration

Data reduction and feature selection

Discretization
A lot a methods have been developed but still an
active area of research
May 7, 2017
108