Download data quality

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
第四部分
数据库新应用与新技术
60年代中后期至90年代末,发展速度之快,应用范围之广,
是许多其他技术远不能及的。
发展的动力

计算环境的发展

与其他相关技术发展相结合

受应用需求推动

. . .
数据库技术发展和研究方向











并行数据库
分布式数据库
移动式、嵌入式数据库
面向对象数据库,对象-关系数据库
演绎数据库,知识库,主动数据库
多媒体数据库
空间数据库
数据仓库和OLAP,数据挖掘
信息集成
Web环境中的数据库技术
...
数据库技术发展和研究方向











并行数据库
分布式数据库
移动式、嵌入式数据库
面向对象数据库,对象-关系数据库
演绎数据库,知识库,主动数据库
多媒体数据库
空间数据库
数据仓库和OLAP,数据挖掘
信息集成
Web环境中的数据库技术
...
第十一章 数据库中的知识发现
Slides are basically from:
Data Mining:
Concepts and Techniques
— Slides for Textbook —
Jiawei Han and Micheline Kamber
Simon Fraser University, Canada
http://www.cs.sfu.ca

Motivation: Why Knowledge discovery in
databases?

Data Warehouse and OLAP

data mining
Motivation: “Necessity is the Mother
of Invention”

Data explosion problem

Automated data collection tools and mature database
technology lead to tremendous amounts of data
stored in databases, data warehouses and other
information repositories

We are drowning in data, but starving for knowledge!

Solution: Data warehousing and data mining


Data warehousing and on-line analytical processing
Extraction of interesting knowledge (rules, regularities,
patterns, constraints) from data in large databases
Evolution of Database Technology




1960s:
 Data collection, database creation, IMS and network
DBMS
1970s:
 Relational data model, relational DBMS
implementation
1980s:
 RDBMS, advanced data models (extended-relational,
OO, deductive, etc.) and application-oriented DBMS
(spatial, scientific, engineering, etc.)
1990s—2000s:
 Data mining and data warehousing, multimedia
databases, and Web databases

Motivation: Why Knowledge discovery in databases?

Data Warehouse and OLAP


What is a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse application
Data mining
What is Data Warehouse?



Defined in many different ways, but not rigorously.
 A decision support database that is maintained
separately from the organization’s operational
database
 Support information processing by providing a solid
platform of consolidated, historical data for analysis.
“A data warehouse is a subject-oriented, integrated,
time-variant, and nonvolatile collection of data in support
of management’s decision-making process.”—W. H.
Inmon
Data warehousing:
 The process of constructing and using data
warehouses
Data Warehouse—Subject-Oriented

Organized around major subjects, such as customer,
product, sales.

Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing.

Provide a simple and concise view around particular
subject issues by excluding data that are not useful in
the decision support process.
Data Warehouse—Integrated


Constructed by integrating multiple, heterogeneous
data sources
 relational databases, flat files, on-line transaction
records
Data cleaning and data integration techniques are
applied.
 Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different
data sources


E.g., Hotel price: currency, tax, breakfast covered, etc.
When data is moved to the warehouse, it is
converted.
Data Warehouse—Time Variant

The time horizon for the data warehouse is significantly
longer than that of operational systems.



Operational database: current value data.
Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse


Contains an element of time, explicitly or implicitly
But the key of operational data may or may not
contain “time element”.
Data Warehouse—Non-Volatile

A physically separate store of data transformed from the
operational environment.

Operational update of data does not occur in the data
warehouse environment.

Does not require transaction processing, recovery,
and concurrency control mechanisms

Requires only two operations in data accessing:

initial loading of data and access of data.
Data Warehouse vs. Heterogeneous
DBMS

Traditional heterogeneous DB integration:


Build wrappers/mediators on top of heterogeneous
databases
Query driven approach:



When a query is posed to a client site, a metadictionary is used to translate the query into queries
appropriate for individual heterogeneous sites
involved, and the results are integrated into a global
answer set.
Complex information filtering, compete for sources
Data warehouse: update-driven, high performance.
Data Warehouse vs. Operational DBMS

OLTP (on-line transaction processing)




Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
OLAP (on-line analytical processing)

Major task of data warehouse system

Data analysis and decision making
Distinct features (OLTP vs. OLAP):

User and system orientation: customer vs. market

Data contents: current, detailed vs. historical, consolidated

Database design: ER + application vs. star + subject

View: current, local vs. evolutionary, integrated

Access patterns: update vs. read-only but complex queries
OLTP vs. OLAP
OLTP
OLAP
users
clerk, IT professional
knowledge worker
function
day to day operations
decision support
DB design
application-oriented
subject-oriented
data
current, up-to-date
detailed, flat relational
isolated
repetitive
historical,
summarized, multidimensional
integrated, consolidated
ad-hoc
lots of scans
unit of work
read/write
index/hash on prim. key
short, simple transaction
# records accessed
tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
query throughput, response
usage
access
complex query
Why Separate Data Warehouse?


High performance for both systems
 DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery
 Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
Different functions and different data:
 missing data: Decision support requires historical data
which operational DBs do not typically maintain
 data consolidation: DS requires consolidation
(aggregation, summarization) of data from
heterogeneous sources
 data quality: different sources typically use inconsistent
data representations, codes and formats which have to
be reconciled

Motivation: Why Knowledge discovery in databases?

Data Warehouse and OLAP


What is a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse application
Data mining
Multidimensional Data Model

A data warehouse is based on a multidimensional data
model which views data in the form of a data cube



Measures attributes: attributes which measure some
value and can be aggregated upon. such as
dollars_sold.
Dimension attributes: attributes on which measure
attributes and summaries of measure attributes are
viewed. such as item (item_name, brand, type), or
time(day, week, month, quarter, year).
Data cube: multidimensional view of data. such as
sales.
A concept hierarchy: dimension (location)
all
all
Europe
region
country
city
office
Germany
Frankfurt
...
...
...
Spain
North_America
Canada
Vancouver ...
L. Chan
...
...
Mexico
Toronto
M. Young
Multidimensional Data

Sales volume as a function of product, month,
and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region
Year
Product
Category Country Quarter
Product
City
Office
Month
Month Week
Day
A Sample Data Cube
2Qtr
3Qtr
4Qtr
sum
U.S.A
Canada
Mexico
sum
Country
TV
PC
VCR
sum
1Qtr
Date
Total annual sales
of TV in U.S.A.
cuboid and Data Cubes

In data warehousing literature, an n-D base cube is
called a base cuboid. The top most 0-D cuboid, which
holds the highest-level of summarization, is called the
apex cuboid. The lattice of cuboids forms a data cube.
Cuboids Corresponding to the Cube
all
0-D(apex) cuboid
product
product,date
date
country
product,country
1-D cuboids
date, country
2-D cuboids
3-D(base) cuboid
product, date, country
More Example of Cube and Cuboids
all
time
time,item
0-D(apex) cuboid
item
time,location
location
item,location
time,supplier
time,item,location
supplier
1-D cuboids
location,supplier
2-D cuboids
item,supplier
time,location,supplier
3-D cuboids
time,item,supplier
item,location,supplier
4-D(base) cuboid
time, item, location, supplier
Conceptual Modeling of
Data Warehouses

Modeling data warehouses: dimensions & measures

Star schema: A fact table in the middle connected to a
set of dimension tables

Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake

Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
Example of Star Schema
time
item
time_key
day
day_of_the_week
month
quarter
year
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
location
location_key
street
city
province_or_state
country
Example of Snowflake Schema
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
supplier
supplier_key
supplier_type
location
location_key
street
city_key
city
city_key
city
province_or_state
country
Example of Fact Constellation
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
item_name
brand
type
supplier_type
item_key
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
time_key
item_key
shipper_key
from_location
branch_key
branch
Shipping Fact Table
location
to_location
location_key
street
city
province_or_state
country
dollars_cost
units_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type
Typical OLAP Operations

Roll up (drill-up): summarize data


Drill down (roll down): reverse of roll-up


project and select
Pivot (rotate):


from higher level summary to lower level summary or detailed
data, or introducing new dimensions
Slice and dice:


by climbing up hierarchy or by dimension reduction
reorient the cube, visualization, 3D to series of 2D planes.
Other operations


drill across: involving (across) more than one fact table
drill through: through the bottom level of the cube to its backend relational tables (using SQL)

Motivation: Why Knowledge discovery in databases?

Data Warehouse and OLAP


What is a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse application
Data mining
The Process of Data Warehouse
Design


Top-down, bottom-up approaches or a combination of
both
 Top-down: Starts with overall design and planning
(mature)
 Bottom-up: Starts with experiments and prototypes
(rapid)
From software engineering point of view
 Waterfall: structured and systematic analysis at each
step before proceeding to the next.
 Spiral: rapid generation of increasingly functional
systems, short turn around time, quick turn around.
Multi-Tiered Architecture
other
Metadata
sources
Operational
DBs
Extract
Transform
Load
Refresh
Monitor
&
Integrator
Data
Warehouse
OLAP Server
Serve
Analysis
Query
Reports
Data mining
Data Marts
Data Sources
Data Storage
OLAP Engine Front-End Tools
Three Data Warehouse Models


Enterprise warehouse
 collects all of the information about subjects spanning
the entire organization
Data Mart
 a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to specific,
selected groups, such as marketing data mart


Independent vs. dependent (directly from warehouse) data mart
Virtual warehouse
 A set of views over operational databases
 Only some of the possible summary views may be
materialized
Data Warehouse Development:
A Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts
Data
Mart
Data
Mart
Model refinement
Enterprise
Data
Warehouse
Model refinement
Define a high-level corporate data model
OLAP Server Architectures




Relational OLAP (ROLAP)
 Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware to support missing pieces
 Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services
 greater scalability
Multidimensional OLAP (MOLAP)
 Array-based multidimensional storage engine (sparse matrix
techniques)
 fast indexing to pre-computed summarized data
Hybrid OLAP (HOLAP)
 User flexibility, e.g., low level: relational, high-level: array
Specialized SQL servers
 specialized support for SQL queries over star/snowflake schemas

Motivation: Why Knowledge discovery in databases?

Data Warehouse and OLAP


What is a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse application
Data mining
Data Warehouse Usage

Three kinds of data warehouse applications

Information processing



Analytical processing

multidimensional analysis of data warehouse data

supports basic OLAP operations, slice-dice, drilling, pivoting
Data mining



supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs
knowledge discovery from hidden patterns
supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools.
Differences among the three tasks
From On-Line Analytical Processing to On
Line Analytical Mining

Why online analytical mining?

High quality of data in data warehouses


Available information processing structure
surrounding data warehouses


mining with drilling, dicing, pivoting, etc.
On-line selection of data mining functions
integration and swapping of multiple mining functions,
algorithms, and tasks.
Architecture of OLAM


ODBC, OLEDB, Web accessing, service facilities, reporting
and OLAP tools
OLAP-based exploratory data analysis


DW contains integrated, consistent, cleaned data
An OLAM Architecture
Mining query
Mining result
Layer4
User Interface
User GUI API
OLAM
Engine
OLAP
Engine
Layer3
OLAP/OLAM
Data Cube API
Layer2
MDDB
MDDB
Meta Data
Filtering&Integration
Database API
Filtering
Layer1
Data cleaning
Databases
Data
Data integration Warehouse
Data
Repository

Motivation: Why Knowledge discovery in databases?

Data Warehouse and OLAP

data mining

Introduction to data mining

Characterization

Association rule mining

Classification

Cluster Analysis
What Is Data Mining?


Data mining (knowledge discovery in databases):
 Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
information or patterns from data in large
databases
Alternative names and their “inside stories”:
 Data mining: a misnomer?
 Knowledge discovery(mining) in databases
(KDD), knowledge extraction, data/pattern
analysis, data archaeology, data dredging,
information harvesting, business intelligence, etc.
Why Data Mining? — Potential
Applications

Database analysis and decision support

Market analysis and management


Risk analysis and management



target marketing, customer relation management, market
basket analysis, cross selling, market segmentation
Forecasting, customer retention, improved underwriting,
quality control, competitive analysis
Fraud detection and management
Other Applications

Text mining (news group, email, documents) and Web analysis.

Intelligent query answering
Data Mining: A KDD Process
Pattern Evaluation

Data mining: the core of
knowledge discovery
Data Mining
process.
Task-relevant Data
Data Warehouse
Data Cleaning
Data Integration
Databases
Selection
Architecture of a Typical Data Mining
System
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data
warehouse server
Data cleaning & data integration
Databases
Filtering
Data
Warehouse
Data Mining: On What Kind of Data?




Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories
 Object-oriented and object-relational databases
 Spatial databases
 Time-series data and temporal data
 Text databases and multimedia databases
 Heterogeneous and legacy databases
 WWW
Data Mining Functionalities (1)

Concept description: Characterization and discrimination


Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet regions
Association



contains(x, “computer”)  contains(x, “software”)
[support = 1%, confidence = 75%]
age(X, “20..29”) ^ income(X, “20..29K”)  buys(X,
“PC”) [2%, 60%]
Multi-dimensional vs. single-dimensional association
Data Mining Functionalities (2)

Classification and Prediction



Finding models (functions) that describe and distinguish classes
or concepts for future prediction
E.g., classify countries based on climate, or classify cars based
on gas mileage

Presentation: decision-tree, classification rule, neural network

Prediction: Predict some unknown or missing numerical values
Cluster analysis


Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns
Clustering based on the principle: maximizing the intra-class
similarity and minimizing the interclass similarity
Data Mining Functionalities (3)

Outlier analysis

Outlier: a data object that does not comply with the general
behavior of the data

It can be considered as noise or exception but is quite useful in
fraud detection, rare events analysis


Trend and evolution analysis

Trend and deviation: regression analysis

Sequential pattern mining, periodicity analysis

Similarity-based analysis
Other pattern-directed or statistical analyses
Are All the “Discovered” Patterns
Interesting?

A data mining system/query may generate thousands of patterns,
not all of them are interesting.



Suggested approach: Human-centered, query-based, focused
mining
Interestingness measures: A pattern is interesting if it is easily
understood by humans, valid on new or test data with some degree
of certainty, potentially useful, novel, or validates some hypothesis
that a user seeks to confirm
Objective vs. subjective interestingness measures:


Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.
Subjective: based on user’s belief in the data, e.g.,
unexpectedness, novelty, actionability, etc.
Data Mining: Confluence of Multiple
Disciplines
Database
Technology
Machine
Learning
Information
Science
Statistics
Data Mining
Visualization
Other
Disciplines

Motivation: Why Knowledge discovery in databases?

Data Warehouse and OLAP

data mining

Introduction to data mining

Characterization

Association rule mining

Classification

Cluster Analysis
Data Generalization and Summarizationbased Characterization

Data generalization
 A process which abstracts a large set of task-relevant
data in a database from a low conceptual levels to
higher ones.
1
2
3
4
Conceptual levels
5

Approaches:
 Data cube approach(OLAP approach)
 Attribute-oriented induction approach
Characterization: Data Cube Approach
(without using AO-Induction)

Perform computations and store results in data cubes

Strength

An efficient implementation of data generalization

Computation of various kinds of measures



e.g., count( ), sum( ), average( ), max( )
Generalization and specialization can be performed on a data
cube by roll-up and drill-down
Limitations


handle only dimensions of simple nonnumeric data and
measures of simple aggregated numeric values.
Lack of intelligent analysis, can’t tell which dimensions should
be used and what levels should the generalization reach
Attribute-Oriented Induction



Proposed in 1989 (KDD ‘89 workshop)
Not confined to categorical data nor particular measures.
How it is done?
 Collect the task-relevant data( initial relation) using a
relational database query
 Perform generalization by attribute removal or attribute
generalization.
 Apply aggregation by merging identical, generalized
tuples and accumulating their respective counts.
 Interactive presentation with users.
Basic Principles of AttributeOriented Induction





Data focusing: task-relevant data, including dimensions,
and the result is the initial relation.
Attribute-removal: remove attribute A if there is a large set
of distinct values for A but (1) there is no generalization
operator on A, or (2) A’s higher level concepts are
expressed in terms of other attributes.
Attribute-generalization: If there is a large set of distinct
values for A, and there exists a set of generalization
operators on A, then select an operator and generalize A.
Attribute-threshold control: typical 2-8, specified/default.
Generalized relation threshold control: control the final
relation/rule size. see example
Class Characterization: An Example
Name
Gender
Jim
Initial
Woodman
Relation Scott
M
Major
M
F
…
Removed
Retained
Residence
Phone #
GPA
Vancouver,BC, 8-12-76
Canada
CS
Montreal, Que, 28-7-75
Canada
Physics Seattle, WA, USA 25-8-70
…
…
…
3511 Main St.,
Richmond
345 1st Ave.,
Richmond
687-4598
3.67
253-9106
3.70
125 Austin Ave.,
Burnaby
…
420-5232
…
3.83
…
Sci,Eng,
Bus
City
Removed
Excl,
VG,..
Gender Major
M
F
…
Birth_date
CS
Lachance
Laura Lee
…
Prime
Generalized
Relation
Birth-Place
Science
Science
…
Country
Age range
Birth_region
Age_range
Residence
GPA
Canada
Foreign
…
20-25
25-30
…
Richmond
Burnaby
…
Very-good
Excellent
…
Birth_Region
Canada
Foreign
Total
Gender
M
16
14
30
F
10
22
32
Total
26
36
62
Count
16
22
…
Presentation of Generalized Results

Generalized relation:


Cross tabulation:


Relations where some or all attributes are generalized, with counts
or other aggregation values accumulated.
Mapping results into cross tabulation form (similar to contingency
tables).

Visualization techniques:

Pie charts, bar charts, curves, cubes, and other visual forms.
Quantitative characteristic rules:

Mapping generalized result into characteristic rules with quantitative
information associated with it, e.g.,
grad( x)  male( x) 
birth_ region( x) "Canada"[t :53%] birth_ region( x) " foreign"[t : 47%].
Presentation—Generalized Relation
Presentation—Crosstab

Motivation: Why Knowledge discovery in databases?

Data Warehouse and OLAP

data mining

Introduction to data mining

Characterization

Association rule mining

Classification

Cluster Analysis
What Is Association Rule Mining?



Association rule mining:
 Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.
Applications:
 Basket data analysis, cross-marketing, catalog design,
loss-leader analysis, clustering, classification, etc.
Examples.
 Rule form: “antecedent consequent [support,
confidence]”.
 buys(x, “diapers”)  buys(x, “beers”) [0.5%, 60%]
 major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”) [1%,
75%]
Association Rules


Given
 A database of customer transactions
 Each transaction is a list of items (purchases by a
customer in a visit)
Find all rules that correlated the presence of one set of
items with another set of items
 Example: 98% of people who purchase tires and auto
accessories also get automotive services done


Any number of items in the consequent/antecedent of
rule
Possible to specify constraints on rules (e.g. find only
rules involving Home Laundry Appliances).
Characteristics
Number of items: Hundreds of thousands
Number of transactions: Millions
Data set sizes: High Gigabytes



Discovering all rules rather than verifying if a rule
holds
Performance
Scalability
Association Rules -- Problem Statement





I = {i1, i2 ,…, im} : a set of literals, called items.
Transaction T: a set of items such that T  I.
Database D : a set of transactions.
A transaction T contains X, a set of some items,
if X  T.
An association rule is an implication of the form
X  Y, where X,Y  I and X  Y = .
Association Rules –
Confidence and Support

confidence of rule X  Y is the percentage of
transactions in D containing X that also contain Y.
c = (number of transactions containing both X and Y) /
(total number of transactions containing X )

support of rule X  Y is the percentage of
transactions in D that contain both X and Y.
s = (number of transactions containing both X and Y) /
(total number of transactions in D)
Discover all rules that have minimum user-specified
confidence and support.
Mining Association Rules


Find the frequent itemsets : sets of items that
have minimum support .
 A subset of a frequent itemset must also be a
frequent itemset, i.e., if {AB} is a frequent
itemset, both {A} and {B} should be a frequent
itemset
 Iteratively find frequent itemsets with cardinality
from 1 to k (k-itemset)
Use the frequent itemsets to generate the desired
rules.
Mining Association Rules -- Example
Transaction ID
2000
1000
4000
5000
Items Bought
A,B,C
A,C
A,D
B,E,F
Min. support 50%
Min. confidence 50%
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%
For rule A  C:
support = support({A, C}) = 50%
confidence = support({A, C})/support({A}) = 66.6%
The Apriori Algorithm — Example
Database D
TID
100
200
300
400
itemset sup.
C1
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
Items
134
235
1235
25
C2 itemset sup
L2 itemset sup
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
{2 3 5}
Scan D
{1 3}
{2 3}
{2 5}
{3 5}
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
{1 2}
Scan D
L3 itemset sup
{2 3 5} 2
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
Using frequent itemsets to generate rules
Association rules can be generated as follows:


For each frequent itemset l, generate all
nonempty subsets of l.
For each nonempty subset s of l, output the
rule “s  (l-s)” if
support_count(l)/support_count(s) >=min_conf,
where min_conf is the minimum confidence
threshold.
Summary

Association rule mining


probably the most significant contribution from the
database community in KDD
A large number of papers have been published

Many interesting issues have been explored

An interesting research direction

Association analysis in other types of data: spatial
data, multimedia data, time series data, etc.

Motivation: Why Knowledge discovery in databases?

Data Warehouse and OLAP

data mining

Introduction to data mining

Characterization

Association rule mining

Classification

Cluster Analysis
Classification


Classification:
 predicts categorical class labels
 classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
Typical Applications
 credit approval
 target marketing
 medical diagnosis
 treatment effectiveness analysis
Classification—A Two-Step Process


Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
 The set of tuples used for model construction: training set
 The model is represented as classification rules, decision trees,
or mathematical formulae
Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
 The known label of test sample is compared with the
classified result from the model
 Accuracy rate is the percentage of test set samples that are
correctly classified by the model
 Test set is independent of training set, otherwise over-fitting
will occur
Classification Process (1): Model
Construction
Training
Data
NAME
M ike
M ary
B ill
Jim
D ave
A nne
RANK
YEARS TENURED
A ssistant P rof
3
no
A ssistant P rof
7
yes
P rofessor
2
yes
A ssociate P rof
7
yes
A ssistant P rof
6
no
A ssociate P rof
3
no
Classification
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classification Process (2): Use the
Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
T om
M erlisa
G eorge
Joseph
RANK
YEARS TENURED
A ssistant P rof
2
no
A ssociate P rof
7
no
P rofessor
5
yes
A ssistant P rof
7
yes
Tenured?
Supervised vs. Unsupervised
Learning

Supervised learning (classification)



Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)


The class labels of training data is unknown
Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
Classification by Decision Tree
Induction



Decision tree
 A flow-chart-like tree structure
 Internal node denotes a test on an attribute
 Branch represents an outcome of the test
 Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases
 Tree construction
 At start, all the training examples are at the root
 Partition examples recursively based on selected attributes
 Tree pruning
 Identify and remove branches that reflect noise or outliers
Use of decision tree: Classifying an unknown sample
 Test the attribute values of the sample against the decision tree
Training Dataset
This
follows
an
example
from
Quinlan’s
ID3
age
<=30
<=30
30…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Output: A Decision Tree for “buys_computer”
age?
>40
<=30
30..40
student?
no
no
yes
yes
yes
credit rating?
excellent
no
fair
yes
Another Example

An
Example
from
Quinlan
O u tlo o k
su n n y
su n n y
o verc ast
rain
rain
rain
o verc ast
su n n y
su n n y
rain
su n n y
o verc ast
o verc ast
rain
T em p reatu re
hot
hot
hot
m ild
cool
cool
cool
m ild
cool
m ild
m ild
m ild
hot
m ild
H u m id ity
h ig h
h ig h
h ig h
h ig h
n o rm al
n o rm al
n o rm al
h ig h
n o rm al
n o rm al
n o rm al
h ig h
n o rm al
h ig h
W in d y
false
tru e
false
false
false
tru e
tru e
false
false
false
tru e
tru e
false
tru e
C lass
N
N
P
P
P
N
P
N
P
P
P
P
P
N
A Sample Decision Tree
Outlook
sunny overcast
humidity
rain
windy
P
high
normal
true
false
N
P
N
P
Algorithm for Decision Tree Induction


Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-conquer
manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they are
discretized in advance)
 Examples are partitioned recursively based on selected attributes
 Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
 There are no samples left
Summary

Classification is an extensively studied problem (mainly in
statistics, machine learning & neural networks)

Classification is probably one of the most widely used
data mining techniques with a lot of extensions

Scalability is still an important issue for database
applications: thus combining classification with database
techniques should be a promising topic

Research directions: classification of non-relational data,
e.g., text, spatial, multimedia, etc..

Motivation: Why Knowledge discovery in databases?

Data Warehouse and OLAP

data mining

Introduction to data mining

Characterization

Association rule mining

Classification

Cluster Analysis
Examples of Clustering Applications





Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
Land use: Identification of areas of similar land use in an
earth observation database
Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
City-planning: Identifying groups of houses according to
their house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
What Is Good Clustering?



A good clustering method will produce high quality
clusters with

high intra-class similarity

low inter-class similarity
The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns.
Major Clustering Approaches

Partitioning algorithms: Construct various partitions and
then evaluate them by some criterion

Hierarchy algorithms: Create a hierarchical decomposition
of the set of data (or objects) using some criterion

Density-based: based on connectivity and density functions

Grid-based: based on a multiple-level granularity structure

Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that model to
each other
The K-Means Clustering Method

Example
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
1
2
3
4
5
6
7
8
9
10
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
Hierarchical Clustering

Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition
Step 0
a
Step 1
Step 2 Step 3 Step 4
ab
b
abcde
c
cde
d
de
e
Step 4
agglomerative
(AGNES)
Step 3
Step 2 Step 1 Step 0
divisive
(DIANA)
AGNES (Agglomerative Nesting)

Introduced in Kaufmann and Rousseeuw (1990)

Implemented in statistical analysis packages, e.g., Splus

Use the Single-Link method and the dissimilarity matrix.

Merge nodes that have the least dissimilarity

Go on in a non-descending fashion

Eventually all nodes belong to the same cluster
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
DIANA (Divisive Analysis)

Introduced in Kaufmann and Rousseeuw (1990)

Implemented in statistical analysis packages, e.g., Splus

Inverse order of AGNES

Eventually each node forms a cluster on its own
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10