Download CS 245A Intelligent Information Systems

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Using Type Inference and Induced
Rules to Provide Intensional Answers
Wesley W. Chu
Rei-Chi Lee
Qiming Chen
What is Intensional Answer?
Intensional answer to a query provides the
characteristics that describes the database values (the
extensional answers) that satisfy the query
Intensional answers provide the users with:


Summarized or approximate descriptions about the
extensional answers
Additional insight into the nature of extensional answers
2
An Example of Intensional Answer
Consider a personnel database containing the relation:
EMPLOYEE = (ID, Name, Position, Salary)
To find the person whose annual salary is more than 100K, the
query can be specified as
Q = SELECT * FROM EMPLOYEE WHERE Salary >100K;
A traditional answer would be:
{“Smith”, “Jones”,...}
An intensional answer would be:
“All the managers.”
3
Prior Work:


Constraint-based approach for intensional query answering (Motro 89)
Aggregate response using type hierarchy (Shum 88)
Only limited form of intentional answers can be generated
New Approach:
Use both type hierarchy and database intensional knowledge
Two Phases:

Knowledge Acquisition


Use rule induction to derive intensional knowledge from database content
Type Inference

Based on type hierarch, use the derived rules to generate the specific
intensional answers
4
Traditional Views of Type Hierarchy
In semantic or object-oriented data modeling, there
are two traditional views of type hierarchy:
1. IS_A Hierarchy:
A IS_A B means every member of type A is also a member of
type B.
2. PART_OF Hierarchy:
A is PART_OF B means A is a component of B.
These two views are mainly used for data modeling
which provides a language for:


describing and storing data
accessing and manipulating the data
5
The Notion of Type Hierarchy
Classes and Types:
Any of the entities being modeled that share some common
characteristics are gathered into classes.
All elements of the class have the same class type.
Type Hierarchy is a partial order for the set of types:
Types (referred to as super-types) at higher positions are
more generalized than types at lower positions.
Types (referred to as sub-types) at lower positions are more
specialized than types at higher positions.
6
An IS_A Type Hierarchy Example
7
Type Inference
Type Inference is a process if traversing the type
hierarchy that is based on the query condition and the
induced rules.
Traversing of the type hierarchy can be performed in
two directions:


Forward Inference
Backward Inference
8
Deriving Intensional Answers Using Forward
Inference
Forward Inference uses the known facts to derive
more facts. That is, given a rule, “If X then Y” and a
fact “X is true”, then we can conclude “Y is true”.
We perform forward inference by traversing along
the type hierarchies downward from the type that is
involved in the query. As a result,


The search scope for answering the query can be reduced
The lowest (the most specific) type description satisfying
the query condition are returned as the intensional answers
9
A Forward Inference Example
To find the submarine whose displacement is greater than 8000, the
query can be specified as:
Q = SELECT * FROM SUBMARINE WHERE Displacement > 8000.
The extensional answer to the query is:
id
name
class
type
SSBN730
Rhode Island
0101
SSBN
SSBN130
Typhoon
1301
SSBN
Using forward inference with R4, we can derive the following
intensional answer:
“Ships are SSBN submarines.”
10
Derive Intensional Answers Using
Backward Inference
Backward Inference uses the known facts to infer
what must be true according to the type hierarchies
and induced rules
Using backward inference, we traverse upward along
the type hierarchies to provide the set of with
constraints as intensional answers.
11
A Backward Inference Example
To find the names and classes of the SSBN submarines, the query can be specified as:
Q=
SELECT Name, Class
FROM SUBMARINE, CLASS
WHERE Type = “SSBN”;
The extensional answer to the query is:
name
Nathaniel Hale
Daniel Boone
Sam Rayburn
Lewis and Clark
Mariano G. Vallejo
Rhode Island
Typhoon
class
0103
0103
0103
0102
0102
0101
1301
Using backward inference, we can derive the following intensional answer:
“Some ships have classes in the range of 0101 to 0103.”
12
Deriving Intensional Answers via Type
Inference
Using forward inference, the intensional answer gives a set of type
descriptions that includes the answers.
Using backward inference, the intensional answer gives only a
description of partial answers.
Therefore,


the intensional answers derived from forward inference
characterize a set of instances containing the extensional answers,
whereas
the intensional answers derived from backward inference
characterize a set of answers contained in the extensional answers.
Forward inference and backward inference can also be combined to
derive more specific intensional answers.
13
A Forward and Backward Inference Example
To find the names, classes, and types of the SUBMARINES equipped with sonar BQS-04, the query can
be specified as:
Q=
SELECT SUBMARINE.Name, SUBMARINE.Class, CLASS.Type
FROM SUBMARINE, CLASS, INSTALL
WHERE SUBMARINE.Class = CLASS.Class
AND SUBMARINE.Id = INSTALL.Ship
AND INSTALL.Sonar = “BQS-04”
The extensional answer to the query is:
name
Bonefish
Seadragon
Snook
Robert E. Lee
class
type
0215
0212
0209
0208
SSN
SSN
SSN
SSN
Using both forward inference and backward inference, we can derive the following intensional answer:
“Ship type SSN with class in the range of 0208 to 0215 is equipped with sonar BQS-04.”
14
Conclusions
In this research, we have proposed an approach to provide
intensional answers using type inference and induced rules:
Type Inference




Inference
Backward inference
Combine forward and backward inference
Type inference with multiple type hierarchies
Rule Induction

Model-based inductive learning technique derives rules form
database contents.
For databases with strong type hierarchy and semantic
knowledge, type inference is more effective than integrity
constraints to derive intensional answers
15
16
Fault Tolerant DDBMS Via Data Inference
Network Partition
 Causes:

Failures of:



Channels
Nodes
Effects:
Queries cannot be processed if the required data is
inaccessible
 Replicated files in different partitions may be
inconsistent
 Updates may only be allowed in one partition.
 Transactions may be aborted

17
Conventional Approach for Handling
Network Partitioning
Based on syntax to serialize the operations
To assure data consistency


Not all queries can be processed
Based on data availability, determine which partition is
allowed to perform database update
POOR AVAILABILITY!!
18
New Approach
Exploit data and transaction semantics
Use Data Inference Approach

Assumption: Data are correlated

Examples



Salary and rank
Ship type and weapon
Infer in accessible data from the accessible data
Use semantic information to permit update under
network partitioning
19
Query Processing System with Data
Inference
Consists of



DDBMS
Knowledge-Base (rule-based)
Inference Engine
20
DDBMS with Data Inference
Information Module
Query Input
Database Fragments
Allocation
Availability
Query Parser
and
Analyzer
Inference System
Inference Engine
DDBMS
Rule Based
Knowledge-Based
System
Query Output
21
Fault Tolerant DDBMS with Inference Systems
KB2
SF
IE
LA
KB1
DB2
NY
KB3
DB1
IE
DB3
IE
KB
SHIP(SID) 
INSTALL (TYPE)
INSTALL(TYPE) 
INSTALL(WEAPON)
22
Architecture of Distributed Database with Inference
23
Motivation of Open Data Inference
Correlated knowledge is incomplete


Incomplete rules
Incomplete objects
24
Example of Incomplete Object
Type -------> Weapon
IF type in {CG, CGN} THEN weapon = SAM01
IF type = DDG THEN weapon = SAM02
TYPE
CG
CGN
DDG
SSGN
WEAPON
SAM01
SAM01
SAM02
??
Result: Incomplete rules generate incomplete object.
25
Merge of Incomplete Objects
Observation:

Relational join is not adequate for combining incomplete
objects

Lose information
Questions:


What kind of algebraic tools do we need to combine
incomplete objects without losing information?
Any correctness criteria to evaluate the incomplete results?
26
Merge of Incomplete Objects
TYPE ---> WEAPON and WEAPON --->WARFARE
Type
CG
CGN
DDG
SSGN
Weapon
SAM01
SAM01
SAM02
?
Weapon
SAM01
SAM03
Warfare
WF1C
WF1D
Use relational join to combine the above two paths:
Type
Weapon
Warfare
CG
SAM01
WF1C
CGN
SAM01
WFIC
Other way to combine:
TYPE
CG
CGN
DDG
?
SSGN
WEAPON
SAM01
SAM01
SAM02
SAM03
?
WARFARE
WF1C
WF1C
?
WF1D
?
27
New Algebraic Tools for Incomplete Objects
S-REDUCTION

Reduce redundant tuples in the object
OPEN S-UNION

Combine incomplete objects
28
S-Reduction
Remove redundant tuples in the object
Object RR with key attribute A is reduced to R
A
a
b
c
a
b
c
RR
B
1
2
_
1
_
_
C
aa
_
cc
aa
bb
_
A
a
b
c
R
B
1
2
_
C
aa
bb
cc
29
Open S-Union
Modify join operation to accommodate incomplete
information
Used to combine closed/open objects
R1
sid
s101
s102
s103
type
DD
DD
CG
R2
type
DD
CG
---->
weapon
SAM01
-
R
sid
s101
s102
s103
type weapon
DD SAM01
DD SAM01
CG
-
30
Open S-Union and Toleration
Performing open union on two objects R1, R2
generates the third object which tolerates both R1
and R2.
R1
U
R2
---->
sid
type type weapon
sid
s101 DD
DD SAM01
s101
s102 DD
CG
s102
s103 CG
s103
R
type
DD
DD
CG
weapon
SAM01
SAM01
-
R tolerates R1
R tolerates R2
31
Example of Open Inference
site LA: SHIP(sid,sname,class)
site SF: INSTALL(sid,weapon)
site NY: CLASS(class,type,tname)
Query: Find the ship names that carry weapon ‘SAM01’
(assuming site SF is partitioned)
SF
INSTALL
Network Partition
SHIP
LA
NY
CLASS
Rule: If SHIP TYPE = DD, Then WEAPON = SAM01
32
Implementation
Derive missing relations from accessible relations
and correlated knowledge
Three types of derivations:



View mechanism to derive new relations based on
certain source relations
Valuations of incomplete relations based on correlated
knowledge
Combine two intermediate results via open s-union
operation
33
Example of Open Inference
DERIVATION 1: select sid, type from SHIP, CLASS
DERIVATION 2: CLASS(type) --> INSTALL(weapon)
R1
sid
s101
s102
s103
R2
type
DD
DD
CG
type
DD
CG
---->
weapon
SAM01
-
INSTALL_INF
sid
type
weapon
s101
DD
SAM01
s102
DD
SAM01
s103
CG
-
INSTALL_INF can be used to replace missing relation INSTALL
34
Fault Tolerant DDBMS via Inference Techniques
Query Processing Under Network Partitioning



Open Inference: Inference with incomplete information
Algebraic tools for manipulating incomplete objects
Toleration: weaker correctness criteria for evaluating
incomplete information
35
Conclusion
Data Inference is an effective method for
providing database fault tolerance
during network partitioning.
36
Intelligent Dictionary and Directory (IDD)
The role of the IDD and the emerging technology of
Object-Oriented Database Systems
The integration of Artificial Intelligence and
Database Management tools and techniques to
explore new architectures for the IDD
The support of future applications: heterogeneous,
distributed, cooperating data/knowledge systems
guided by active, intelligent dictionaries and
directories and managed by Data and Knowledge
Administrators
37
Object-Oriented Dictionary Modeling
Information Resource Dictionary System
Functional Specification of the IDD
Role of Machine Learning in IDD
Mining Knowledge from Data
 Schema Evolution

System Optimization Issues
Support for Hypermedia
38
The Knowledge/Data Model (KDM)
The KDM modeling primitives are:
Generalization: Generalization provides the facility in the KDM to
group similar objects into a more general object. This
generalization hierarchy defines the inheritance structure.
Classification: Classification provides a means whereby specific
object instances can be considered as a higher-level object-type
(an object-type is a collection of similar objects). This is done
through the use of the “is-instance-of” relationship.
Aggregation: Aggregation is an abstraction mechanism in which an
object is related to its components via the “is-part-of”
relationship.
Membership: Membership is an abstraction mechanism that
specifically supports the “is-a-member-of” relationship.
39
The Knowledge/Data Model
Temporal: Temporal relationship primitives relate objecttypes by means of synchronous and asynchronous
relationships.
Constraints: This primitive is used to place a constraint on
some aspect of an object, operation, or relationship via the “isconstraint-on” relationship.
Heuristic: A heuristic can be associated with an object via the
“is-heuristic-on” relationship. These are used to allow the
specifications of rules and knowledge to be associated with an
object. In this way, object properties can be inferred using
appropriate heuristics.
40
The KDL Template for Object-Type Specification
object-type: OBJECT-TYPE-NAME has
[attributes:
{ATTRIBUTE-NAME:
[set of/list of] VALUE-TYPE
/*default is single-valued
composed of [ATTRIBUTE-NAME,}]
with constraints {predicate,}]
with heuristics {RULE,}];}]
[subtypes:
{OBJECT-TYPE-NAME,}]
[supertypes:
{OBJECT-TYPE-NAME,}]
[constraints:
{predicate,}]
[heuristics:
{rule,}]
41
The KDL Template for Object-Type Specification
(Cont’d)
/*successors predecessors, and concurrents are temporal primitives
[successors:
{OBJECT-TYPE-NAME,}]
[predecessors:
{OBJECT-TYPE-NAME,}]
[concurrents:
{OBJECT-TYPE-NAME,}]
[members:
{MEMBER-NAME: MEMBER-TYPE}]
[instances:
{INSTANCE,}]
end-object-type
42
Three Services Database Schemata
43
Knowledge Source Schemata in the KDM Paradigm
44
KDL Object Type Specification Template
45
The THESAURUS_OBJECT MetaSchema
46
The THESAURUS_OBJECT
Meta-Object-Type Specification
47
The KNOWLEDGE_SOURCE_OBJECT
Meta-Object-Type Specification
48
Three-Level Specification – Local, FM, and Federation
49
A sample export data/knowledge/task schema for a
federation interface manager
50
Conclusions
IDD based on the Knowledge Data Model can
provide the modeling power needed to:
1.
2.
3.
Extend the notions of the Information Resource
Dictionary System
Support Object-Oriented DBMS
Act as an Intelligent Thesaurus to support Cooperating
Knowledge Sources for Heterogeneous Databases
Scheme Evolution will require a meta-level
characterization of the KDM constructs so that
inference tools can reason about the effects of
changes to schema.
51
52
Intelligent Heterogeneous Autonomous
Database Architecture (INHEAD)
Reference:
D. Weishar and L. Kershberg, “An Intelligent
Heterogeneous Autonomous Database
Architecture for Semantic Heterogeneity
Support”, Proceedings of the First International
Workshop on Interoperability in Multi-Database
Systems. Kyoto, Japan, pp. 152-155, 1991.
53
INHEAD
Place query on blackboard
KS (domain experts) of the DBMS
KS cooperatively tries to find a solution to the query
If no request is found, further clarifications and
request information needed by the users.
Thesaurus performs semantic query processing of users
original query,
Controller provides:


necessary query translation and optimization
integrates the results
54
IDD for an Intelligent Front End to
Heterogeneous Databases
55
BLACKBOARD
Dynamic Control - make inferences related to
solution formation at each step
Focus of Attention - determine what part of the
emerging solution should be attended to next
Flexibility of Programming the Control - knowledge
about how control should be applied in various
domains can be codified in control rules or in
complex control regimes
56
Modularity

Well suited to the class of problems possessing one or
more of the following characteristics:





The need to represent many specialized and distinct kinds of
knowledge
The need to integrate disparate information
A natural domain hierarchy
Having continuous data input (e.g., Signal tracking)
Having sparse knowledge/data
Supporting semantic heterogeneity in a system of
heterogeneous autonomous database exhibits many
of these characteristics.
57
Opportunistic Query Processing
Opportunistic - query can be processed based on
goal, sub-goal, and hypothesis changes.

Redundant and overlapping data provides parallel
processing
Incrementally Query Processing - can halt the
processing of the query when the control structure
determines that the query has been satisfied.
58
The Active and Intelligent Thesaurus
Validating and performing consistency check on the
input to the thesaurus itself
Indexing and converting data values
Translating queries using different variants of names
Actively participating in on-line HELP (i.e., offer
suggestions)
The thesaurus can be used as:



A repository of knowledge of data item
An incorporation of newly discovered knowledge
An integration with existing knowledge
59
Data/Knowledge Packets
Object Encapsulation

Encapsulating





Object structure
Relationships
Operations
Constraints
Rules
Data/Knowledge Packet allows the specification of
abstract object types at the global level and the
encapsulation of optional and structural semantics.
60
An Example: The Artillery Movement Problem
Goal: provision 10 M110 Howitzer Weapon System for
departure to Middle East in 5 days.
1) Characteristics DB: describes the physical characteristics of
2)
3)
4)
5)
the component parts of the weapons system
Weapon system DB: describes the components of weapons
system
Logistics database: describes the logistics support required to
sustain weapons systems in combat
Personnel DB for crew requisitioning
Ship DB for obtaining space on seagoing vessels.
61
Overall Goal: Provision 10M110 Howitzer Weapon System for
departure to the Middle East in 5 days.
Subgoals
1.0
Determine availability of 10 M110 Howitzer Weapon Systems
1.1 Determine the locations of such items, subject to constraints
of being within 500 miles of Norfolk, Virginia
1.2 Send requests for items to locations to hold for shipment
2.0
2.1
2.2
2.3
2.4
Determine Availability of Logistic Support Units
Specialize camouflage to desert conditions
Specialize radar to desert night vision
Specialize rations to high water content rations
Specialize clothing to lightweight, chemically resistant
62
3.0
Determine Availability of Sealift Capability along the
Eastern Seaboard
3.1 Calculate total weight and volume for each system
3.2 Provision crews for each system
3.3 Assign crews and weapons to ships
3.3.1 Notify Crews
3.3.2 Send shipment requisitions to sites holding weapons
systems
63
64
Uncertainty Management Using
Rough Sets
Why Deal with Uncertainty
Most tasks requiring intelligent behavior have some
uncertainty
Forms of uncertainty in KB systems
Uncertainty in the data


Missing data
Imprecise representation, etc.
Uncertainty in the knowledge-base


Best guesses
Not applicable in all domains
66
Why Deal with Uncertainty (cont’d)
Some approaches to handle uncertainty:
Probability and Bayesian statistics
Confidence (or certainty) factors
Dempster Shafer theory of evidence
Fuzzy sets and fuzzy logic
Problems with these approaches
Make strong statistical assumptions such as follow a
probability distribution model

E.g., Bayesian approach
Cannot recognize structural properties of data qualitatively


Represent through numbers
E.g., Fuzzy Logic – concept of “Tall”, “Very Tall”, etc.
67
Rough Sets
Good for reasoning from qualitative and imprecise data


No approximation by numbers
No probability distribution model required
Uses set theory to provide insight into the structural properties of
data
Theory developed by Z. Pawlak in 1982
Well known experimental applications in





Medical diagnosis (Pawlak, Slowinski & Slowinski, 1986)
Machine learning (Wong & Ziarko, 1986b)
Information Retrieval (Gupta, 1988)
Conceptual Engineering design (Arciszewski and Ziarko, 1986)
Approximate Reasoning (Rasiowa and Epstein, 1987)
BASIC IDEA


Lower the degree of precision in the representation of objects
Make data regularities more visible and easier to characterize in terms of
rules
68
Example 1
69
Rough Sets vs. Classical Sets
Classical sets have well defined boundaries since the data
representation is exact
Rough sets have fuzzy boundaries since knowledge is
insufficient to determine exact membership of an object in the set

Example:


U: Universal set of all cars
X: Set of all fuel efficient cars
In the rough set approach, fuel efficiency is indirectly determined
from attributes such as:



Weight of car
Size of engine
Number of cylinders, etc.
Attribute Dependency
Qualitatively determine the significance of one or more attributes
(such as Weight, Size) on a decision attribute (such as Fuel eff.)
70
71
72
Indiscernibility Relation (IND) Equivalence Class
73
Definitions
74
Definition (cont’d)
Boundary region BND(X)

Consists of objects whose membership cannot be determined exactly.

BND(X) = IND(X) – IND(X)
Negative region NEG(X)

Union of those elementary sets of IND that are entirely outside X.

NEG(X) = U – IND(X)
Accuracy measure AM(X)

If lower approximation is different from upper approximation, the set
is rough.

AM(X) = Card(IND(X)) / Card(IND(X))
75
Example 2
76
What is the accuracy measure when the set of melons is classified on the
attribute “size”?
Let c be the condition attribute ‘size’
Let x be the ‘set of all melons’
x= {p1, p2, p4, p5}
The elementary classes of the attribute ‘size’ are as follows:
x1 = {p3,p6, p7,p9} size = small
x2 = {p1, p5}
size = med
x3 = {p2, p4, p8}
size = large
IND(x, c) = {p1, p5}
IND(x, c) = {p1, p2, p4, p5, p8}
BND(x, c) = {p2, p4, p8}
NEG(x c) = {p3, p6, p7, p9}
Accuracy measure, AM(x, c) = 2/5
77
Attribute Dependency
78
Example 3
How useful is {shape, taste} in determining the {kind of products}?
C: {shape, taste}
D: {kind of product}
D’ = the elementary classes of IND(D)
= {{set of all melons}, {set of all other fruits}}
= {{p, p2, p4, p5}, {p3, p6, p7, p8, p9}}
Elementary classes of IND(C)
={
{p1},
For (syp, sweet)
{p2, p4, p5, p6}, For (cyl, sweet)
{p3, p8},
For (syp, normal)
{p9},
For (cyl, normal)
{p7},
For (sph, sour)
{ },
For (cyl, sour)
}
79
POS (C, D) = the union of all positive regions
={
p1
p3, p8
p9,
p7,
(contained in class Melon),
(contained in class Other),
(contained in class Other),
(contained in class Other)
}
Dependency = card (POS(C, D))/card (U) = 5/9
Since 0 < 5/9 < 1, we have a partial dependency of D = {kind of
product} on C = {shape, taste}
80
Interpretation of K(C,D)
IF K(C,D) = 1, we have full dependency

Any class of object in D can be completely determined by
the attributes in C.
IF 0 < K(C,D) < 1, we have only partial dependency

The class of only some objects in D can be completely
determined by attributes in C.
IF K(C,D) = 0, we have no dependency

No object in D can be completely determined by the
attributes in C.
81
Similarly, we can calculate the dependency of {kind of
product} on other attribute groupings:
Dependency on {shape, size} = 1
Dependency on {size, taste} = 1
Dependency on {shape, size, taste} = 1
82
Minimal Set of Attributes or REDUCTS
Objective


Find the minimal set (or sets) of interacting attributes that would have the
same discriminating power as the original set of attributes.
This would allow us to eliminate irrelevant or noisy attributes without loss
of essential information.
In our example, {size, shape} and {size, taste} are minimal sets of
attributes.
Advantages:
Irrelevant attributes can be eliminated from a diagnostic
procedure, thereby reducing the costs of testing and obtaining
those values.
The knowledge-base system can form decision rules based on
minimal sets.
For example, we can form the rules
if (size = large) and (taste = sweet) then kind of product = melon
if (shape = cyl) and (size = small) then kind of product = other
83
Deterministic and Non-Deterministic Rules
Deterministic rules have only one outcome. They are
obtained from the positive and negative regions of the
approximation space.
Non-deterministic rules can have more than one outcome.
They are formed from the boundary regions of the
approximation space.
Selection of the best minimal set
If there are more than one minimal set, which is the best one?
If we assign a cost function to the attributes, selection can be
based on minimum cost criterion

E.g., In medical domain, some diagnostic procedures are more
expensive than others.
If there is no cost function, select the set with the minimal
number of attributes
84
Example 4
85
If Condition Attributes C = {size, cyl, turbo, fuelsys, displace,
comp, power, trans, weight} and Decision Attribute D =
{mileage}
K(C,D) was calculated to be 1
If C = {size, power}, D = {mileage}
K(C,D) was calculated to be 0.269
Thus, “size” and “power” are definitely not good enough to
determine mileage.
The following were determined to be minimal sets of attributes
{cyl, fuelsys, comp, power, weight}
{size, fuelsys, comp, power, weight}
{size, fuelsys, displace, weight}
{size, cyl, fuelsys, power, weight}
{cyl, turbo, fuelsys, displace, comp, trans, weight}
{size, cyl, fuelsys, comp, weight}
{size, cyl, turbo, fuelsys, trans, weight}
86
View of Table After Attribute Reduction
•Best minimal set:
{size, fuelsys, displace, weight}
87
Set of Rules Produced from the Reduced Table
•Blanks represent “don’t cares”
•CNo is the number of cases in the original table that support the given rule
•It provides a measure of the strength of confidence in the rule. Higher Cno, the
more the rule is confirmed.
•Dno is the number of cases in the table with the same decision value
Interpreting Row 5,
if (fuelsys = EFI) and (displace = small)
then mileage = high
88
Applications
Speech recognition

The method of reduct computation was used to eliminate
unnecessary spectral frequencies to find he best
representation for a group of spoken words
Medical domain


Analysis of records of patients who suffered from duodenal
ulcer (Pawlak, Slowinski & Slowinski, 1986)
Analysis of clinical data of patients with Cardiac valve
diseases (Abdalla S. A. Mohammed, 1991)
Architecture

Structural design optimization by obtaining characteristic
design rules from a database of existing designs and verified
performance data (Arciszewski et al, 1987; Arciszewski,
Ziarko, 1986)
89
Summary
The theory of rough sets is very good for handling qualitative,
imprecise data.
In this respect it is an improvement over probabilistic and
statistical methods.
Since the data are not covered to numbers but handled in
qualitative form, set theory is used to identify structural
relationships.
The strength of the dependency of any set of condition
attributes on a decision attribute can be determined
numerically.
By forming minimal sets of attributes we can filter noisy or
irrelevant attributes.
Minimal sets also identify strong data patterns that help the
KB system form rules
90
References
“Rough Sets”, Zdislaw Pawlak, Kluwer Academic Publishers, 1991.
“Rough Sets as the Basis of a Learning System” – Chapter 2, pp. 5-13.
“An Application of the Rough Sets Model to Analysis of Data
Tables” – Chapter 3, pp. 15-29.
“The Discovery, Analysis, and Representation of Data Dependencies in
Databases”, Wojciech Ziarko, Knowledge Discovery in Databases by
Shapiro, Frawley, pp. 195-209.
Applications of Rough Set Theory for Clinical Data Analysis: A Case
Study”, Abdalla S.A. Mohammed, Journal of Mathematical and
Computer Modeling, Vol. 15, No. 10, pp. 19-37, 1991.
“Intelligent Information Retrieval Using Rough Set Approximations”,
Padmini Srinivasan, Information Processing and Management, Vol. 25,
No. 4, pp. 347-361, 1989.
“Uncertainty Management”, Avelino J. Gonzalez, Douglas D. Dankel,
The Engineering of Knowledge-Base Systems, Chapter 8, pp. 232-262.
91
92
Data Mining:
Concepts and Techniques
— Slides for Textbook —
— Chapter 4 —
©Jiawei Han and Micheline Kamber
Intelligent Database Systems Research Lab
School of Computing Science
Simon Fraser University, Canada
http://www.cs.sfu.ca
A Data Mining Query Language (DMQL)
Motivation

A DMQL can provide the ability to support ad-hoc and interactive
data mining

By providing a standardized language like SQL

Hope to achieve a similar effect like that SQL has on relational
database

Foundation for system development and evolution

Facilitate information exchange, technology transfer,
commercialization and wide acceptance
Design

DMQL is designed with the primitives described earlier
103
Syntax for DMQL
Syntax for specification of

task-relevant data

the kind of knowledge to be mined

concept hierarchy specification

interestingness measure

pattern presentation and visualization
Putting it all together — a DMQL query
104
Syntax for task-relevant data specification
use database database_name, or use data warehouse
data_warehouse_name
from relation(s)/cube(s) [where condition]
in relevance to att_or_dim_list
order by order_list
group by grouping_list
having condition
105
Specification of task-relevant data
106
Syntax for specifying the kind of
knowledge to be mined
Characterization
Mine_Knowledge_Specification ::=
mine characteristics [as pattern_name]
analyze measure(s)
Discrimination
Mine_Knowledge_Specification ::=
mine comparison [as pattern_name]
for target_class where target_condition
{versus contrast_class_i where contrast_condition_i}
analyze measure(s)
Association
Mine_Knowledge_Specification ::=
mine associations [as pattern_name]
107
Syntax for specifying the kind of
knowledge to be mined (cont.)
Classification
Mine_Knowledge_Specification ::=
mine classification [as pattern_name]
analyze classifying_attribute_or_dimension
 Prediction
Mine_Knowledge_Specification ::=
mine prediction [as pattern_name]
analyze prediction_attribute_or_dimension
{set {attribute_or_dimension_i= value_i}}

108
Syntax for concept hierarchy specification
To specify what concept hierarchies to use
use hierarchy <hierarchy> for <attribute_or_dimension>
We use different syntax to define different type of hierarchies


schema hierarchies
define hierarchy time_hierarchy on date as [date,month
quarter,year]
set-grouping hierarchies
define hierarchy age_hierarchy for age on customer as
level1: {young, middle_aged, senior} < level0: all
level2: {20, ..., 39} < level1: young
level2: {40, ..., 59} < level1: middle_aged
level2: {60, ..., 89} < level1: senior
109
Syntax for concept hierarchy specification (Cont.)


operation-derived hierarchies
define hierarchy age_hierarchy for age on customer as
{age_category(1), ..., age_category(5)} := cluster(default, age, 5)
< all(age)
rule-based hierarchies
define hierarchy profit_margin_hierarchy on item as
level_1: low_profit_margin < level_0: all
if (price - cost)< $50
level_1: medium-profit_margin < level_0: all
if ((price - cost) > $50) and ((price - cost) <= $250))
level_1: high_profit_margin < level_0: all
if (price - cost) > $250
110
Syntax for interestingness measure specification
Interestingness measures and thresholds can be
specified by the user with the statement:
with <interest_measure_name> threshold =
threshold_value
Example:
with support threshold = 0.05
with confidence threshold = 0.7
111
Syntax for pattern presentation and
visualization specification
We have syntax which allows users to specify the display of
discovered patterns in one or more forms
display as <result_form>
To facilitate interactive viewing at different concept level, the
following syntax is defined:
Multilevel_Manipulation ::= roll up on attribute_or_dimension
| drill down on attribute_or_dimension
| add attribute_or_dimension
| drop attribute_or_dimension
112
Putting it all together: the full specification of
a DMQL query
use database AllElectronics_db
use hierarchy location_hierarchy for B.address
mine characteristics as customerPurchasing
analyze count%
in relevance to C.age, I.type, I.place_made
from customer C, item I, purchases P, items_sold S, works_at W, branch
where I.item_ID = S.item_ID and S.trans_ID = P.trans_ID
and P.cust_ID = C.cust_ID and P.method_paid = ``AmEx''
and P.empl_ID = W.empl_ID and W.branch_ID = B.branch_ID
and B.address = ``Canada" and I.price >= 100
with noise threshold = 0.05
display as table
113
Other Data Mining Languages &
Standardization Efforts
Association rule language specifications

MSQL (Imielinski & Virmani’99)

MineRule (Meo Psaila and Ceri’96)

Query flocks based on Datalog syntax (Tsur et al’98)
OLEDB for DM (Microsoft’2000)

Based on OLE, OLE DB, OLE DB for OLAP

Integrating DBMS, data warehouse and data mining
CRISP-DM (CRoss-Industry Standard Process for Data Mining)

Providing a platform and process structure for effective data mining

Emphasizing on deploying data mining technology to solve business
problems
114
120
Data Mining:
Concepts and Techniques
— Slides for Textbook —
— Chapter 5 —
©Jiawei Han and Micheline Kamber
Intelligent Database Systems Research Lab
School of Computing Science
Simon Fraser University, Canada
http://www.cs.sfu.ca
Chapter 5: Concept Description:
Characterization and Comparison
What is concept description?
Data generalization and summarization-based characterization
Analytical characterization: Analysis of attribute relevance
Mining class comparisons: Discriminating between different
classes
Mining descriptive statistical measures in large databases
Discussion
Summary
125
Data Generalization and
Summarization-based Characterization
Data generalization

A process which abstracts a large set of task-relevant data in a database
from a low conceptual levels to higher ones.

Approaches:
 Data cube approach(OLAP approach)
 Attribute-oriented induction approach
1
2
3
4
Conceptual levels
5
126
Characterization: Data Cube Approach
(without using AO-Induction)
Perform computations and store results in data cubes
Strength

An efficient implementation of data generalization

Computation of various kinds of measures


e.g., count( ), sum( ), average( ), max( )
Generalization and specialization can be performed on a data cube by roll-up
and drill-down
Limitations

handle only dimensions of simple nonnumeric data and measures of simple
aggregated numeric values.

Lack of intelligent analysis, can’t tell which dimensions should be used and
what levels should the generalization reach
127
Attribute-Oriented Induction
Proposed in 1989 (KDD ‘89 workshop)
Not confined to categorical data nor particular measures.
How it is done?




Collect the task-relevant data( initial relation) using a relational database
query
Perform generalization by attribute removal or attribute generalization.
Apply aggregation by merging identical, generalized tuples and
accumulating their respective counts.
Interactive presentation with users.
128
Basic Principles of AttributeOriented Induction
Data focusing: task-relevant data, including dimensions, and
the result is the initial relation.
Attribute-removal: remove attribute A if there is a large set of
distinct values for A but (1) there is no generalization operator
on A, or (2) A’s higher level concepts are expressed in terms of
other attributes.
Attribute-generalization: If there is a large set of distinct
values for A, and there exists a set of generalization operators
on A, then select an operator and generalize A.
Attribute-threshold control: typical 2-8, specified/default.
Generalized relation threshold control: control the final
relation/rule size. see example
Basic Algorithm for AttributeOriented Induction
InitialRel: Query processing of task-relevant data, deriving
the initial relation.
PreGen: Based on the analysis of the number of distinct
values in each attribute, determine generalization plan for
each attribute: removal? or how high to generalize?
PrimeGen: Based on the PreGen plan, perform
generalization to the right level to derive a “prime
generalized relation”, accumulating the counts.
Presentation: User interaction: (1) adjust levels by drilling,
(2) pivoting, (3) mapping into rules, cross tabs,
visualization presentations.
See Implementation
See example
See complexity
Example
DMQL: Describe general characteristics of graduate students
in the Big-University database
use Big_University_DB
mine characteristics as “Science_Students”
in relevance to name, gender, major, birth_place, birth_date,
residence, phone#, gpa
from student
where status in “graduate”
Corresponding SQL statement:
Select name, gender, major, birth_place, birth_date, residence, phone#,
gpa
from student
where status in {“Msc”, “MBA”, “PhD” }
131
Class Characterization: An Example
Name
Gender
Jim
Initial
Woodman
Relation Scott
M
Major
M
F
…
Removed
Retained
Residence
Phone #
GPA
Vancouver,BC, 8-12-76
Canada
CS
Montreal, Que, 28-7-75
Canada
Physics Seattle, WA, USA 25-8-70
…
…
…
3511 Main St.,
Richmond
345 1st Ave.,
Richmond
687-4598
3.67
253-9106
3.70
125 Austin Ave.,
Burnaby
…
420-5232
…
3.83
…
Sci,Eng,
Bus
City
Removed
Excl,
VG,..
Gender Major
M
F
…
Birth_date
CS
Lachance
Laura Lee
…
Prime
Generalized
Relation
Birth-Place
Science
Science
…
Country
Age range
Birth_region
Age_range
Residence
GPA
Canada
Foreign
…
20-25
25-30
…
Richmond
Burnaby
…
Very-good
Excellent
…
Birth_Region
Canada
Foreign
Total
Gender
See Principles
See Algorithm
M
16
14
30
F
10
22
32
Total
26
36
62
See Implementation
See Analytical Characterization
Count
16
22
…
Presentation of Generalized Results
Generalized relation:

Relations where some or all attributes are generalized, with counts or
other aggregation values accumulated.
Cross tabulation:

Mapping results into cross tabulation form (similar to contingency
tables).

Visualization techniques:

Pie charts, bar charts, curves, cubes, and other visual forms.
Quantitative characteristic rules:

Mapping generalized result into characteristic rules with quantitative
information associated with it, e.g.,
grad( x)  male( x) 
birth_ region( x) "Canada"[t :53%] birth_ region( x) " foreign"[t : 47%].
Presentation—Generalized Relation
134
Presentation—Crosstab
135
Implementation by Cube Technology
Construct a data cube on-the-fly for the given data mining
query



Facilitate efficient drill-down analysis
May increase the response time
A balanced solution: precomputation of “subprime” relation
Use a predefined & precomputed data cube



Construct a data cube beforehand
Facilitate not only the attribute-oriented induction, but also attribute
relevance analysis, dicing, slicing, roll-up and drill-down
Cost of cube computation and the nontrivial storage overhead
136
Characterization vs. OLAP
Similarity:

Presentation of data summarization at multiple levels of
abstraction.

Interactive drilling, pivoting, slicing and dicing.
Differences:

Automated desired level allocation.

Dimension relevance analysis and ranking when there
are many relevant dimensions.

Sophisticated typing on dimensions and measures.

Analytical characterization: data dispersion analysis.
137
Attribute Relevance Analysis
Why?




Which dimensions should be included?
How high level of generalization?
Automatic vs. interactive
Reduce # attributes; easy to understand patterns
What?

statistical method for preprocessing data




filter out irrelevant or weakly relevant attributes
retain or rank the relevant attributes
relevance related to dimensions and levels
analytical characterization, analytical comparison
138
Attribute relevance analysis (cont’d)
How?


Data Collection
Analytical Generalization


Relevance Analysis


Sort and select the most relevant dimensions and levels.
Attribute-oriented Induction for class description


Use information gain analysis (e.g., entropy or other measures) to
identify highly relevant dimensions and levels.
On selected dimension/level
OLAP operations (e.g. drilling, slicing) on relevance rules
139
Relevance Measures
Quantitative relevance measure determines the
classifying power of an attribute within a set of data.
Methods





information gain (ID3)
gain ratio (C4.5)
gini index
2 contingency table statistics
uncertainty coefficient
140
Entropy and Information Gain
S contains si tuples of class Ci for i = {1, …, m}
Information measures info required to classify any
arbitrary tuple
s
s
I( s ,s ,...,s )   log
s
s
m
1
2
m
i
i
2
i 1
Entropy of attribute A with values {a1,a2,…,av}
s1 j  ...  smj
I ( s1 j ,..., smj )
s
j 1
v
E(A)  
Information gained by branching on attribute A
Gain(A)  I(s 1, s 2 ,..., sm)  E(A)
143
Example: Analytical Characterization
Task

Mine general characteristics describing graduate students
using analytical characterization
Given





attributes name, gender, major, birth_place, birth_date,
phone#, and gpa
Gen(ai) = concept hierarchies on ai
Ui = attribute analytical thresholds for ai
Ti = attribute generalization thresholds for ai
R = attribute relevance threshold
144
Example: Analytical Characterization (cont’d)
1. Data collection


target class: graduate student
contrasting class: undergraduate student
2. Analytical generalization using Ui

attribute removal


attribute generalization



remove name and phone#
generalize major, birth_place, birth_date and gpa
accumulate counts
candidate relation: gender, major, birth_country,
age_range and gpa
145
Example: Analytical characterization (2)
gender
major
birth_country
age_range
gpa
count
M
F
M
F
M
F
Science
Science
Engineering
Science
Science
Engineering
Canada
Foreign
Foreign
Foreign
Canada
Canada
20-25
25-30
25-30
25-30
20-25
20-25
Very_good
Excellent
Excellent
Excellent
Excellent
Excellent
16
22
18
25
21
18
Candidate relation for Target class: Graduate students (=120)
gender
major
birth_country
age_range
gpa
count
M
F
M
F
M
F
Science
Business
Business
Science
Engineering
Engineering
Foreign
Canada
Canada
Canada
Foreign
Canada
<20
<20
<20
20-25
20-25
<20
Very_good
Fair
Fair
Fair
Very_good
Excellent
18
20
22
24
22
24
Candidate relation for Contrasting class: Undergraduate students (=130)
146
Example: Analytical characterization (3)
3. Relevance analysis

Calculate expected info required to classify an arbitrary tuple

Calculate entropy of each
e.g. major
120 attribute:
120 130
130
I(s 1, s 2 )  I( 120,130 )  
For major=”Science”:
250
log 2
250

250
log 2
250
 0.9988
S11=84
S21=42
I(s11,s21)=0.9183
For major=”Engineering”: S12=36
S22=46
I(s12,s22)=0.9892
For major=”Business”:
S23=42
I(s13,s23)=0
S13=0
Number of grad
students in “Science”
Number of undergrad
students in “Science”
147
Example: Analytical Characterization (4)
Calculate expected info required to classify a given sample if
S is partitioned according to the attribute
E(major) 
126
82
42
I ( s11, s 21 ) 
I ( s12, s 22 ) 
I ( s13, s 23 )  0.7873
250
250
250
Calculate information gain for each attribute
Gain(major)  I(s 1, s 2 )  E(major)  0.2115

Information gain for all attributes
Gain(gender)
= 0.0003
Gain(birth_country)
= 0.0407
Gain(major)
Gain(gpa)
= 0.2115
= 0.4490
Gain(age_range)
= 0.5971
148
Example: Analytical characterization (5)
4. Initial working relation (W0) derivation



R = 0.1
remove irrelevant/weakly relevant attributes from candidate relation
=> drop gender, birth_country
remove contrasting class candidate relation
major
Science
Science
Science
Engineering
Engineering
age_range
20-25
25-30
20-25
20-25
25-30
gpa
Very_good
Excellent
Excellent
Excellent
Excellent
count
16
47
21
18
18
Initial target class working relation W0: Graduate students
5. Perform attribute-oriented induction on W0 using Ti
149
Chapter 5: Concept Description:
Characterization and Comparison
What is concept description?
Data generalization and summarization-based characterization
Analytical characterization: Analysis of attribute relevance
Mining class comparisons: Discriminating between different
classes
Mining descriptive statistical measures in large databases
Discussion
Summary
150
Mining Class Comparisons
Comparison: Comparing two or more classes.
Method:





Partition the set of relevant data into the target class and the contrasting
class(es)
Generalize both classes to the same high level concepts
Compare tuples with the same high level descriptions
Present for every tuple its description and two measures:
 support - distribution within single class
 comparison - distribution between classes
Highlight the tuples with strong discriminant features
Relevance Analysis:

Find attributes (features) which best distinguish different classes.
Example: Analytical comparison
Task


Compare graduate and undergraduate students using discriminant
rule.
DMQL query
use Big_University_DB
mine comparison as “grad_vs_undergrad_students”
in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa
for “graduate_students”
where status in “graduate”
versus “undergraduate_students”
where status in “undergraduate”
analyze count%
from student
152
Example: Analytical comparison (2)
Given





attributes name, gender, major, birth_place, birth_date,
residence, phone# and gpa
Gen(ai) = concept hierarchies on attributes ai
Ui = attribute analytical thresholds for attributes ai
Ti = attribute generalization thresholds for attributes ai
R = attribute relevance threshold
153
Example: Analytical comparison (3)
1. Data collection

target and contrasting classes
2. Attribute relevance analysis

remove attributes name, gender, major, phone#
3. Synchronous generalization


controlled by user-specified dimension thresholds
prime target and contrasting class(es) relations/cuboids
154
Example: Analytical comparison (4)
Birth_country
Canada
Canada
Canada
…
Other
Age_range
20-25
25-30
Over_30
…
Over_30
Gpa
Good
Good
Very_good
…
Excellent
Count%
5.53%
2.32%
5.86%
…
4.68%
Prime generalized relation for the target class: Graduate students
Birth_country
Canada
Canada
…
Canada
…
Other
Age_range
15-20
15-20
…
25-30
…
Over_30
Gpa
Fair
Good
…
Good
…
Excellent
Count%
5.53%
4.53%
…
5.02%
…
0.68%
Prime generalized relation for the contrasting class: Undergraduate students
155
Example: Analytical comparison (5)
4. Drill down, roll up and other OLAP operations on
target and contrasting classes to adjust levels of
abstractions of resulting description
5. Presentation


as generalized relations, crosstabs, bar charts, pie charts, or
rules
contrasting measures to reflect comparison between target
and contrasting classes

e.g. count%
156
Quantitative Discriminant Rules
Cj = target class
qa = a generalized tuple covers some tuples of class

but can also cover some tuples of contrasting class
d-weight

range: [0, 1]
d  weight 
count(qa  Cj )
m
 count(q
a
 Ci )
i 1
quantitative discriminant rule form
 X, target_class(X)  condition(X) [d : d_weight]
157
Example: Quantitative Discriminant Rule
Status
Birth_country
Age_range
Gpa
Count
Graduate
Canada
25-30
Good
90
Undergraduate
Canada
25-30
Good
210
Count distribution between graduate and undergraduate students for a generalized tuple
Quantitative discriminant rule
X , graduate _ student ( X ) 
birth _ country( X ) " Canada" age _ range( X ) "25  30" gpa( X ) " good " [d : 30%]

where 90/(90+120) = 30%
158
Class Description
Quantitative characteristic rule
 X, target_class(X)  condition(X) [t : t_weight]

necessary
Quantitative discriminant rule
 X, target_class(X)  condition(X) [d : d_weight]

sufficient
Quantitative description rule
 X, target_class(X) 
condition 1(X) [t : w1, d : w 1]  ...  conditionn(X) [t : wn, d : w n]

necessary and sufficient
159
Example: Quantitative Description Rule
Location/item
TV
Computer
Both_items
Count
t-wt
d-wt
Count
t-wt
d-wt
Count
t-wt
d-wt
Europe
80
25%
40%
240
75%
30%
320
100%
32%
N_Am
120
17.65%
60%
560
82.35%
70%
680
100%
68%
Both_
regions
200
20%
100%
800
80%
100%
1000
100%
100%
Crosstab showing associated t-weight, d-weight values and total number (in thousands) of TVs and
computers sold at AllElectronics in 1998
Quantitative description rule for target class Europe
 X, Europe(X) 
(item(X) " TV" ) [t : 25%, d : 40%]  (item(X) " computer" ) [t : 75%, d : 30%]
160
Chapter 5: Concept Description:
Characterization and Comparison
What is concept description?
Data generalization and summarization-based characterization
Analytical characterization: Analysis of attribute relevance
Mining class comparisons: Discriminating between different
classes
Mining descriptive statistical measures in large databases
Discussion
Summary
161
Mining Data Dispersion Characteristics
Motivation

To better understand the data: central tendency, variation and spread
Data dispersion characteristics

median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals

Data dispersion: analyzed with multiple granularities of precision

Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures

Folding measures into numerical dimensions

Boxplot or quantile analysis on the transformed cube
162
Comparison of Entire vs. Factored Version Space
177
Incremental and Parallel Mining of
Concept Description
Incremental mining: revision based on newly added
data DB


Generalize DB to the same level of abstraction in the
generalized relation R to derive R
Union R U R, i.e., merge counts and other statistical
information to produce a new relation R’
Similar philosophy can be applied to data sampling,
parallel and/or distributed mining, etc.
178
Chapter 5: Concept Description:
Characterization and Comparison
What is concept description?
Data generalization and summarization-based characterization
Analytical characterization: Analysis of attribute relevance
Mining class comparisons: Discriminating between different
classes
Mining descriptive statistical measures in large databases
Discussion
Summary
179
Summary
Concept description: characterization and discrimination
OLAP-based vs. attribute-oriented induction
Efficient implementation of AOI
Analytical characterization and comparison
Mining descriptive statistical measures in large databases
Discussion

Incremental and parallel mining of description

Descriptive mining of complex types of data
180
References
Y. Cai, N. Cercone, and J. Han. Attribute-oriented induction in relational databases. In G.
Piatetsky-Shapiro and W. J. Frawley, editors, Knowledge Discovery in Databases, pages 213228. AAAI/MIT Press, 1991.
S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM
SIGMOD Record, 26:65-74, 1997
C. Carter and H. Hamilton. Efficient attribute-oriented generalization for knowledge
discovery from large databases. IEEE Trans. Knowledge and Data Engineering, 10:193-208,
1998.
W. Cleveland. Visualizing Data. Hobart Press, Summit NJ, 1993.
J. L. Devore. Probability and Statistics for Engineering and the Science, 4th ed. Duxbury
Press, 1995.
T. G. Dietterich and R. S. Michalski. A comparative review of selected methods for learning
from examples. In Michalski et al., editor, Machine Learning: An Artificial Intelligence
Approach, Vol. 1, pages 41-82. Morgan Kaufmann, 1983.
J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H.
Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab and
sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997.
J. Han, Y. Cai, and N. Cercone. Data-driven discovery of quantitative rules in relational
databases. IEEE Trans. Knowledge and Data Engineering, 5:29-40, 1993.
181
References (cont.)
J. Han and Y. Fu. Exploration of the power of attribute-oriented induction in data mining. In
U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in
Knowledge Discovery and Data Mining, pages 399-421. AAAI/MIT Press, 1996.
R. A. Johnson and D. A. Wichern. Applied Multivariate Statistical Analysis, 3rd ed. Prentice
Hall, 1992.
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets.
VLDB'98, New York, NY, Aug. 1998.
H. Liu and H. Motoda. Feature Selection for Knowledge Discovery and Data Mining. Kluwer
Academic Publishers, 1998.
R. S. Michalski. A theory and methodology of inductive learning. In Michalski et al., editor,
Machine Learning: An Artificial Intelligence Approach, Vol. 1, Morgan Kaufmann, 1983.
T. M. Mitchell. Version spaces: A candidate elimination approach to rule learning. IJCAI'97,
Cambridge, MA.
T. M. Mitchell. Generalization as search. Artificial Intelligence, 18:203-226, 1982.
T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.
D. Subramanian and J. Feigenbaum. Factorization in experiment generation. AAAI'86,
Philadelphia, PA, Aug. 1986.
182
http://www.cs.sfu.ca/~han/dmbook
Thank you !!!
183
184
Data Mining:
Concepts and Techniques
— Slides for Textbook —
— Chapter 6 —
©Jiawei Han and Micheline Kamber
Intelligent Database Systems Research Lab
School of Computing Science
Simon Fraser University, Canada
http://www.cs.sfu.ca
Chapter 6: Mining Association
Rules in Large Databases
Association rule mining
Mining single-dimensional Boolean association rules from
transactional databases
Mining multilevel association rules from transactional
databases
Mining multidimensional association rules from
transactional databases and data warehouse
From association mining to correlation analysis
Constraint-based association mining
Summary
186
What Is Association Mining?
Association rule mining:

Finding frequent patterns, associations, correlations, or causal structures
among sets of items or objects in transaction databases, relational
databases, and other information repositories.
Applications:

Basket data analysis, cross-marketing, catalog design, loss-leader
analysis, clustering, classification, etc.
Examples.



Rule form: “Body ead [support, confidence]”.
buys(x, “diapers”)  buys(x, “beers”) [0.5%, 60%]
major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”) [1%, 75%]
187
Association Rule: Basic Concepts
Given: (1) database of transactions, (2) each transaction is a list
of items (purchased by a customer in a visit)
Find: all rules that correlate the presence of one set of items with
that of another set of items

E.g., 98% of people who purchase tires and auto accessories also get
automotive services done
Applications




*  Maintenance Agreement (What the store should do to boost
Maintenance Agreement sales)
Home Electronics  * (What other products should the store stocks up?)
Attached mailing in direct marketing
Detecting “ping-pong”ing of patients, faulty “collisions”
188
Rule Measures: Support and
Confidence
Customer
buys both
Customer
buys diaper
Find all the rules X & Y  Z with
minimum confidence and support


support, s, probability that a transaction
contains {X  Y  Z}
confidence, c, conditional probability that a
transaction having {X  Y} also contains Z
Customer
buys beer
Transaction ID Items Bought
2000
A,B,C
1000
A,C
4000
A,D
5000
B,E,F
Let minimum support 50%, and
minimum confidence 50%, we
have


A  C (50%, 66.6%)
C  A (50%, 100%)
189
Association Rule Mining: A Road Map
Boolean vs. quantitative associations (Based on the types of values handled)
buys(x, “SQLServer”) ^ buys(x, “DMBook”) buys(x, “DBMiner”) [0.2%, 60%]
 age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”) [1%, 75%]

Single dimension vs. multiple dimensional associations (see ex. Above)
Single level vs. multiple-level analysis

What brands of beers are associated with what brands of diapers?
Various extensions

Correlation, causality analysis

Association does not necessarily imply correlation or causality

Maxpatterns and closed itemsets
 Constraints enforced

E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?
190
Chapter 6: Mining Association Rules
in Large Databases
Association rule mining
Mining single-dimensional Boolean association rules from
transactional databases
Mining multilevel association rules from transactional
databases
Mining multidimensional association rules from
transactional databases and data warehouse
From association mining to correlation analysis
Constraint-based association mining
Summary
191
Mining Association Rules—An Example
Transaction ID
2000
1000
4000
5000
Items Bought
A,B,C
A,C
A,D
B,E,F
For rule A  C:
Min. support 50%
Min. confidence 50%
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%
support = support({A C}) = 50%
confidence = support({A C})/support({A}) = 66.6%
The Apriori principle:
Any subset of a frequent itemset must be frequent
192
Mining Frequent Itemsets: the Key Step
Find the frequent itemsets: the sets of items that have
minimum support

A subset of a frequent itemset must also be a frequent itemset


i.e., if {AB} is a frequent itemset, both {A} and {B} should be a
frequent itemset
Iteratively find frequent itemsets with cardinality from 1 to k (kitemset)
Use the frequent itemsets to generate association rules.
193
The Apriori Algorithm
Join Step: Ck is generated by joining Lk-1with itself
Prune Step: Any (k-1)-itemset that is not frequent cannot
be a subset of a frequent k-itemset
Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
194
The Apriori Algorithm — Example
Database D
TID
100
200
300
400
itemset sup.
2
C1 {1}
{2}
3
Scan D {3}
3
{4}
1
{5}
3
Items
134
235
1235
25
C2 itemset sup
L2 itemset sup
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
Scan D
{1 3}
{2 3}
{2 5}
{3 5}
{2 3 5}
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
{1 2}
Scan D
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
L3 itemset sup
{2 3 5} 2
195
How to Generate Candidates?
Suppose the items in Lk-1 are listed in an order
Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
196
How to Count Supports of Candidates?
Why counting supports of candidates a problem?


The total number of candidates can be very huge
One transaction may contain many candidates
Method:

Candidate itemsets are stored in a hash-tree

Leaf node of hash-tree contains a list of itemsets and counts

Interior node contains a hash table

Subset function: finds all the candidates contained in a transaction
197
Example of Generating Candidates
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3

abcd from abc and abd

acde from acd and ace
Pruning:

acde is removed because ade is not in L3
C4={abcd}
198
Methods to Improve Apriori’s Efficiency
Hash-based itemset counting: A k-itemset whose corresponding hashing
bucket count is below the threshold cannot be frequent
Transaction reduction: A transaction that does not contain any frequent kitemset is useless in subsequent scans
Partitioning: Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB
Sampling: mining on a subset of given data, lower support threshold + a
method to determine the completeness
Dynamic itemset counting: add new candidate itemsets only when all of
their subsets are estimated to be frequent
199
Is Apriori Fast Enough? — Performance
Bottlenecks
The core of the Apriori algorithm:


Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets
Use database scan and pattern matching to collect counts for the
candidate itemsets
The bottleneck of Apriori: candidate generation

Huge candidate sets:



104 frequent 1-itemset will generate 107 candidate 2-itemsets
To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one
needs to generate 2100  1030 candidates.
Multiple scans of database:

Needs (n +1 ) scans, n is the length of the longest pattern
200
Mining Frequent Patterns Without
Candidate Generation
Compress a large database into a compact, FrequentPattern tree (FP-tree) structure

highly condensed, but complete for frequent pattern mining

avoid costly database scans
Develop an efficient, FP-tree-based frequent pattern
mining method

A divide-and-conquer methodology: decompose mining
tasks into smaller ones

Avoid candidate generation: sub-database test only!
201
Construct FP-tree from a Transaction DB
TID
100
200
300
400
500
Items bought
(ordered) frequent items
{f, a, c, d, g, i, m, p}
{f, c, a, m, p}
{a, b, c, f, l, m, o}
{f, c, a, b, m}
{b, f, h, j, o}
{f, b}
{b, c, k, s, p}
{c, b, p}
{a, f, c, e, l, p, m, n}
{f, c, a, m, p}
Steps:
1. Scan DB once, find frequent
1-itemset (single item
pattern)
2. Order frequent items in
frequency descending order
3. Scan DB again, construct
FP-tree
min_support = 0.5
{}
Header Table
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
f:4
c:3
c:1
b:1
a:3
b:1
p:1
m:2
b:1
p:2
m:1
202
Benefits of the FP-tree Structure
Completeness:


never breaks a long pattern of any transaction
preserves complete information for frequent pattern mining
Compactness




reduce irrelevant information—infrequent items are gone
frequency descending ordering: more frequent items are more
likely to be shared
never be larger than the original database (if not count nodelinks and counts)
Example: For Connect-4 DB, compression ratio could be over
100
203
Mining Frequent Patterns Using FP-tree
General idea (divide-and-conquer)

Recursively grow frequent pattern path using the FP-tree
Method



For each item, construct its conditional pattern-base, and
then its conditional FP-tree
Repeat the process on each newly created conditional FPtree
Until the resulting FP-tree is empty, or it contains only one
path (single path will generate all the combinations of its sub-paths,
each of which is a frequent pattern)
204
Major Steps to Mine FP-tree
1) Construct conditional pattern base for each node in
the FP-tree
2) Construct conditional FP-tree from each conditional
pattern-base
3) Recursively mine conditional FP-trees and grow
frequent patterns obtained so far

If the conditional FP-tree contains a single path, simply
enumerate all the patterns
205
Step 1: From FP-tree to Conditional
Pattern Base
Starting at the frequent header table in the FP-tree
Traverse the FP-tree by following the link of each frequent item
Accumulate all of transformed prefix paths of that item to form a
conditional pattern base
Header Table
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
Conditional pattern bases
{}
f:4
c:3
c:1
b:1
a:3
b:1
p:1
m:2
b:1
p:2
m:1
item
cond. pattern base
c
f:3
a
fc:3
b
fca:1, f:1, c:1
m
fca:2, fcab:1
p
fcam:2, cb:1
206
Properties of FP-tree for Conditional
Pattern Base Construction
Node-link property

For any frequent item ai, all the possible frequent patterns
that contain ai can be obtained by following ai's node-links,
starting from ai's head in the FP-tree header
Prefix path property

To calculate the frequent patterns for a node ai in a path P,
only the prefix sub-path of ai in P need to be accumulated,
and its frequency count should carry the same count as
node ai.
207
Step 2: Construct Conditional FPtree
For each pattern-base


Accumulate the count for each item in the base
Construct the FP-tree for the frequent items of the pattern base
Header Table
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
m-conditional pattern
base:
fca:2, fcab:1
{}
f:4
c:3
b:1
a:3
c:1
{}
b:1
f:3
p:1
m:2
b:1
p:2
m:1


c:3
a:3
All frequent patterns
concerning m
m,
fm, cm, am,
fcm, fam, cam,
fcam
m-conditional FP-tree
208
Mining Frequent Patterns by Creating
Conditional Pattern-Bases
Item
Conditional pattern-base
Conditional FP-tree
p
{(fcam:2), (cb:1)}
{(c:3)}|p
m
{(fca:2), (fcab:1)}
{(f:3, c:3, a:3)}|m
b
{(fca:1), (f:1), (c:1)}
Empty
a
{(fc:3)}
{(f:3, c:3)}|a
c
{(f:3)}
{(f:3)}|c
f
Empty
Empty
209
Step 3: Recursively mine the
conditional FP-tree
{}
{}
Cond. pattern base of “am”: (fc:3)
f:3
c:3
f:3
am-conditional FP-tree
c:3
{}
Cond. pattern base of “cm”: (f:3)
a:3
f:3
m-conditional FP-tree
cm-conditional FP-tree
{}
Cond. pattern base of “cam”: (f:3)
f:3
cam-conditional FP-tree
210
Single FP-tree Path Generation
Suppose an FP-tree T has a single path P
The complete set of frequent pattern of T can be generated by
enumeration of all the combinations of the sub-paths of P
{}
f:3
c:3

a:3
All frequent patterns
concerning m
m,
fm, cm, am,
fcm, fam, cam,
fcam
m-conditional FP-tree
211
Principles of Frequent Pattern Growth
Pattern growth property

Let  be a frequent itemset in DB, B be 's conditional
pattern base, and  be an itemset in B. Then    is a
frequent itemset in DB iff  is frequent in B.
“abcdef ” is a frequent pattern, if and only if

“abcde ” is a frequent pattern, and

“f ” is frequent in the set of transactions containing “abcde
”
212
Why Is Frequent Pattern Growth Fast?
Our performance study shows

FP-growth is an order of magnitude faster than Apriori,
and is also faster than tree-projection
Reasoning

No candidate generation, no candidate test

Use compact data structure

Eliminate repeated database scan

Basic operation is counting and FP-tree building
213
FP-growth vs. Apriori: Scalability With the
Support Threshold
Data set T25I20D10K
100
D1 FP-grow th runtime
90
D1 Apriori runtime
80
Run time(sec.)
70
60
50
40
30
20
10
0
0
0.5
1
1.5
2
Support threshold(%)
2.5
3
214
FP-growth vs. Tree-Projection: Scalability
with Support Threshold
Data set T25I20D100K
140
D2 FP-growth
Runtime (sec.)
120
D2 TreeProjection
100
80
60
40
20
0
0
0.5
1
1.5
2
Support threshold (%)
215
Presentation of Association Rules
(Table Form )
216
Visualization of Association Rule Using Plane Graph
217
Visualization of Association Rule Using Rule Graph
218
Iceberg Queries
Icerberg query: Compute aggregates over one or a set of
attributes only for those whose aggregate values is above certain
threshold
Example:
select P.custID, P.itemID, sum(P.qty)
from purchase P
group by P.custID, P.itemID
having sum(P.qty) >= 10
Compute iceberg queries efficiently by Apriori:


First compute lower dimensions
Then compute higher dimensions only when all the lower ones are above
the threshold
219
Chapter 6: Mining Association Rules
in Large Databases
Association rule mining
Mining single-dimensional Boolean association rules from
transactional databases
Mining multilevel association rules from transactional
databases
Mining multidimensional association rules from
transactional databases and data warehouse
From association mining to correlation analysis
Constraint-based association mining
Summary
220
Multiple-Level Association Rules
Items often form hierarchy.
Items at the lower level are
expected to have lower support.
Rules regarding itemsets at
appropriate levels could be quite
useful.
Transaction database can be
encoded based on dimensions
and levels
We can explore shared multilevel mining
Food
bread
milk
skim
2%
wheat white
Fraser Sunset
TID
T1
T2
T3
T4
T5
Items
{111, 121, 211, 221}
{111, 211, 222, 323}
{112, 122, 221, 411}
{111, 121}
{111, 122, 211, 221, 413}
221
Mining Multi-Level Associations
A top_down, progressive deepening approach:


First find high-level strong rules:
milk  bread [20%, 60%].
Then find their lower-level “weaker” rules:
2% milk  wheat bread [6%, 50%].
Variations at mining multiple-level association
rules.

Level-crossed association rules:

2% milk  Wonder wheat bread
Association rules with multiple, alternative hierarchies:
2% milk  Wonder bread
222
Multi-level Association: Uniform Support vs.
Reduced Support
Uniform Support: the same minimum support for all levels

+ One minimum support threshold. No need to examine itemsets
containing any item whose ancestors do not have minimum support.

– Lower level items do not occur as frequently. If support threshold


too high  miss low level associations
too low  generate too many high level associations
Reduced Support: reduced minimum support at lower levels

There are 4 search strategies:




Level-by-level independent
Level-cross filtering by k-itemset
Level-cross filtering by single item
Controlled level-cross filtering by single item
223
Uniform Support
Multi-level mining with uniform support
Level 1
min_sup = 5%
Level 2
min_sup = 5%
Milk
[support = 10%]
2% Milk
Skim Milk
[support = 6%]
[support = 4%]
224
Reduced Support
Multi-level mining with reduced support
Level 1
min_sup = 5%
Level 2
min_sup = 3%
Milk
[support = 10%]
2% Milk
Skim Milk
[support = 6%]
[support = 4%]
225
Multi-level Association: Redundancy Filtering
Some rules may be redundant due to “ancestor”
relationships between items.
Example


milk  wheat bread [support = 8%, confidence = 70%]
2% milk  wheat bread [support = 2%, confidence = 72%]
We say the first rule is an ancestor of the second rule.
A rule is redundant if its support is close to the
“expected” value, based on the rule’s ancestor.
226
Multi-Level Mining: Progressive Deepening
A top-down, progressive deepening approach:


First mine high-level frequent items:
milk (15%), bread (10%)
Then mine their lower-level “weaker” frequent itemsets:
2% milk (5%), wheat bread (4%)
Different min_support threshold across multi-levels
lead to different algorithms:

If adopting the same min_support across multi-levels
then toss t if any of t’s ancestors is infrequent.

If adopting reduced min_support at lower levels
then examine only those descendents whose ancestor’s support is
frequent/non-negligible.
227
Progressive Refinement of
Data Mining Quality
Why progressive refinement?

Mining operator can be expensive or cheap, fine or rough

Trade speed with quality: step-by-step refinement.
Superset coverage property:

Preserve all the positive answers—allow a positive false test
but not a false negative test.
Two- or multi-step mining:

First apply rough/cheap operator (superset coverage)

Then apply expensive algorithm on a substantially reduced
candidate set (Koperski & Han, SSD’95).
228
Progressive Refinement Mining of
Spatial Association Rules
Hierarchy of spatial relationship:


“g_close_to”: near_by, touch, intersect, contain, etc.
First search for rough relationship and then refine it.
Two-step mining of spatial association:

Step 1: rough spatial computation (as a filter)


Using MBR or R-tree for rough estimation.
Step2: Detailed spatial algorithm (as refinement)

Apply only to those objects which have passed the rough spatial
association test (no less than min_support)
229
Chapter 6: Mining Association Rules
in Large Databases
Association rule mining
Mining single-dimensional Boolean association rules from
transactional databases
Mining multilevel association rules from transactional
databases
Mining multidimensional association rules from
transactional databases and data warehouse
From association mining to correlation analysis
Constraint-based association mining
Summary
230
Multi-Dimensional Association: Concepts
Single-dimensional rules:
buys(X, “milk”)  buys(X, “bread”)
Multi-dimensional rules:  2 dimensions or predicates

Inter-dimension association rules (no repeated predicates)
age(X,”19-25”)  occupation(X,“student”)  buys(X,“coke”)

hybrid-dimension association rules (repeated predicates)
age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”)
Categorical Attributes

finite number of possible values, no ordering among values
Quantitative Attributes

numeric, implicit ordering among values
231
Techniques for Mining MD Associations
Search for frequent k-predicate set:


Example: {age, occupation, buys} is a 3-predicate set.
Techniques can be categorized by how age are treated.
1. Using static discretization of quantitative attributes

Quantitative attributes are statically discretized by using
predefined concept hierarchies.
2. Quantitative association rules

Quantitative attributes are dynamically discretized into
“bins”based on the distribution of the data.
3. Distance-based association rules

This is a dynamic discretization process that considers the
distance between data points.
232
Static Discretization of Quantitative Attributes
Discretized prior to mining using concept hierarchy.
Numeric values are replaced by ranges.
In relational database, finding all frequent k-predicate sets will
require k or k+1 table scans.
()
Data cube is well suited for mining.
The cells of an n-dimensional
(age)
(income)
(buys)
cuboid correspond to the
predicate sets.
Mining from data cubes
can be much faster.
(age, income)
(age,buys) (income,buys)
(age,income,buys)
233
Quantitative Association Rules
Numeric attributes are dynamically discretized

Such that the confidence or compactness of the rules mined is
maximized.
2-D quantitative association rules: Aquan1  Aquan2  Acat
Cluster “adjacent”
association rules
to form general
rules using a 2-D
grid.
Example:
age(X,”30-34”)  income(X,”24K 48K”)
 buys(X,”high resolution TV”)
234
ARCS (Association Rule Clustering System)
How does ARCS work?
1. Binning
2. Find frequent predicateset
3. Clustering
4. Optimize
235
Limitations of ARCS
Only quantitative attributes on LHS of rules.
Only 2 attributes on LHS. (2D limitation)
An alternative to ARCS

Non-grid-based

equi-depth binning

clustering based on a measure of partial completeness.

“Mining Quantitative Association Rules in Large Relational Tables”
by R. Srikant and R. Agrawal.
236
Mining Distance-based Association Rules
Binning methods do not capture the semantics of interval data
Price($)
Equi-width
(width $10)
Equi-depth
(depth 2)
Distancebased
7
20
22
50
51
53
[0,10]
[11,20]
[21,30]
[31,40]
[41,50]
[51,60]
[7,20]
[22,50]
[51,53]
[7,7]
[20,22]
[50,53]
Distance-based partitioning, more meaningful discretization
considering:


density/number of points in an interval
“closeness” of points in an interval
237
Chapter 6: Mining Association Rules
in Large Databases
Association rule mining
Mining single-dimensional Boolean association rules from
transactional databases
Mining multilevel association rules from transactional
databases
Mining multidimensional association rules from
transactional databases and data warehouse
From association mining to correlation analysis
Constraint-based association mining
Summary
240
Interestingness Measurements
Objective measures
Two popular measurements:
 support; and
 confidence
Subjective measures (Silberschatz & Tuzhilin,
KDD95)
A rule (pattern) is interesting if
 it is unexpected (surprising to the user); and/or
 actionable (the user can do something with it)
241
Criticism to Support and Confidence
Example 1: (Aggarwal & Yu, PODS98)



Among 5000 students
 3000 play basketball
 3750 eat cereal
 2000 both play basket ball and eat cereal
play basketball  eat cereal [40%, 66.7%] is misleading because the overall
percentage of students eating cereal is 75% which is higher than 66.7%.
play basketball  not eat cereal [20%, 33.3%] is far more accurate, although
with lower support and confidence
basketball not basketball sum(row)
cereal
2000
1750
3750
not cereal
1000
250
1250
sum(col.)
3000
2000
5000
242
Criticism to Support and Confidence (Cont.)
X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
Z 0 1 1 1 1 1 1 1
Example 2:



X and Y: positively correlated,
X and Z, negatively related
support and confidence of
X=>Z dominates
We need a measure of dependent or
correlated events
corrA, B
P( A B)

P( A) P( B)
Rule Support Confidence
X=>Y 25%
50%
X=>Z 37.50%
75%
P(B|A)/P(B) is also called the lift of
rule A => B
243
Other Interestingness Measures: Interest
Interest (correlation, lift)
P( A  B)
P( A) P( B)

taking both P(A) and P(B) in consideration

P(A^B)=P(B)*P(A), if A and B are independent events

A and B negatively correlated, if the value is less than 1; otherwise A and B
positively correlated
X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
Z 0 1 1 1 1 1 1 1
Itemset
Support
Interest
X,Y
X,Z
Y,Z
25%
37.50%
12.50%
2
0.9
0.57
244
Chapter 6: Mining Association Rules
in Large Databases
Association rule mining
Mining single-dimensional Boolean association rules from
transactional databases
Mining multilevel association rules from transactional
databases
Mining multidimensional association rules from
transactional databases and data warehouse
From association mining to correlation analysis
Constraint-based association mining
Summary
245
Constraint-Based Mining
Interactive, exploratory mining giga-bytes of data?

Could it be real? — Making good use of constraints!
What kinds of constraints can be used in mining?


Knowledge type constraint: classification, association, etc.
Data constraint: SQL-like queries


Dimension/level constraints:


in relevance to region, price, brand, customer category.
Rule constraints


Find product pairs sold together in Vancouver in Dec.’98.
small sales (price < $10) triggers big sales (sum > $200).
Interestingness constraints:

strong rules (min_support  3%, min_confidence  60%).
246
Rule Constraints in Association Mining
Two kind of rule constraints:

Rule form constraints: meta-rule guided mining.


P(x, y) ^ Q(x, w) takes(x, “database systems”).
Rule (content) constraint: constraint-based query
optimization (Ng, et al., SIGMOD’98).

sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^ sum(RHS) > 1000
1-variable vs. 2-variable constraints (Lakshmanan, et al.
SIGMOD’99):


1-var: A constraint confining only one side (L/R) of the rule,
e.g., as shown above.
2-var: A constraint confining both sides (L and R).

sum(LHS) < min(RHS) ^ max(RHS) < 5* sum(LHS)
247
Constrain-Based Association Query
Database: (1) trans (TID, Itemset ), (2) itemInfo (Item, Type, Price)
A constrained asso. query (CAQ) is in the form of {(S1, S2 )|C },

where C is a set of constraints on S1, S2 including frequency constraint
A classification of (single-variable) constraints:
 Class constraint: S  A. e.g. S  Item
 Domain constraint:




S v,   { , , , , ,  }. e.g. S.Price < 100
v S,  is  or . e.g. snacks  S.Type
V S, or S V,   { , , , ,  }
 e.g. {snacks, sodas }  S.Type
Aggregation constraint: agg(S)  v, where agg is in {min, max,
sum, count, avg}, and   { , , , , ,  }.

e.g. count(S1.Type)  1 , avg(S2.Price)  100
248
Constrained Association Query
Optimization Problem
Given a CAQ = { (S1, S2) | C }, the algorithm should be :
 sound: It only finds frequent sets that satisfy the given
constraints C
 complete: All frequent sets satisfy the given constraints C
are found
A naïve solution:

Apply Apriori for finding all frequent sets, and then to test
them for constraint satisfaction one by one.
Our approach:

Comprehensive analysis of the properties of constraints
and try to push them as deeply as possible inside the
frequent set computation.
249
Anti-monotone and Monotone Constraints
A constraint Ca is anti-monotone iff. for any pattern S
not satisfying Ca, none of the super-patterns of S can
satisfy Ca
A constraint Cm is monotone iff. for any pattern S
satisfying Cm, every super-pattern of S also satisfies it
250
Succinct Constraint
A subset of item Is is a succinct set, if it can be
expressed as p(I) for some selection predicate p,
where  is a selection operator
SP2I is a succinct power set, if there is a fixed
number of succinct set I1, …, Ik I, s.t. SP can be
expressed in terms of the strict power sets of I1, …, Ik
using union and minus
A constraint Cs is succinct provided SATCs(I) is a
succinct power set
251
Convertible Constraint
Suppose all items in patterns are listed in a total
order R
A constraint C is convertible anti-monotone iff a
pattern S satisfying the constraint implies that each
suffix of S w.r.t. R also satisfies C
A constraint C is convertible monotone iff a pattern S
satisfying the constraint implies that each pattern of
which S is a suffix w.r.t. R also satisfies C
252
Relationships Among Categories of Constraints
Succinctness
Anti-monotonicity
Monotonicity
Convertible constraints
Inconvertible constraints
253
Property of Constraints: Anti-Monotone
Anti-monotonicity: If a set S violates the constraint,
any superset of S violates the constraint.
Examples:

sum(S.Price)  v is anti-monotone

sum(S.Price)  v is not anti-monotone

sum(S.Price) = v is partly anti-monotone
Application:

Push “sum(S.price)  1000” deeply into iterative frequent
set computation.
254
Characterization of
Anti-Monotonicity Constraints
S  v,   { , ,  }
vS
SV
SV
SV
min(S)  v
min(S)  v
min(S)  v
max(S)  v
max(S)  v
max(S)  v
count(S)  v
count(S)  v
count(S)  v
sum(S)  v
sum(S)  v
sum(S)  v
avg(S)  v,   { , ,  }
(frequent constraint)
yes
no
no
yes
partly
no
yes
partly
yes
no
partly
yes
no
partly
yes
no
partly
convertible
(yes)
255
Example of Convertible Constraints:
Avg(S)  V
Let R be the value descending order over the set of
items

E.g. I={9, 8, 6, 4, 3, 1}
Avg(S)  v is convertible monotone w.r.t. R

If S is a suffix of S1, avg(S1)  avg(S)



{8, 4, 3} is a suffix of {9, 8, 4, 3}
avg({9, 8, 4, 3})=6  avg({8, 4, 3})=5
If S satisfies avg(S) v, so does S1

{8, 4, 3} satisfies constraint avg(S)  4, so does {9, 8, 4, 3}
256
Property of Constraints:
Succinctness
Succinctness:


For any set S1 and S2 satisfying C, S1  S2 satisfies C
Given A1 is the sets of size 1 satisfying C, then any set S
satisfying C are based on A1 , i.e., it contains a subset belongs
to A1 ,
Example :


sum(S.Price )  v is not succinct
min(S.Price )  v is succinct
Optimization:

If C is succinct, then C is pre-counting prunable. The
satisfaction of the constraint alone is not affected by the
iterative support counting.
257
Characterization of Constraints
by Succinctness
S  v,   { , ,  }
vS
S V
SV
SV
min(S)  v
min(S)  v
min(S)  v
max(S)  v
max(S)  v
max(S)  v
count(S)  v
count(S)  v
count(S)  v
sum(S)  v
sum(S)  v
sum(S)  v
avg(S)  v,   { , ,  }
(frequent constraint)
Yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
weakly
weakly
weakly
no
no
no
no
(no)
258
Chapter 6: Mining Association Rules
in Large Databases
Association rule mining
Mining single-dimensional Boolean association rules from
transactional databases
Mining multilevel association rules from transactional
databases
Mining multidimensional association rules from
transactional databases and data warehouse
From association mining to correlation analysis
Constraint-based association mining
Summary
259
Why Is the Big Pie Still There?
More on constraint-based mining of associations

Boolean vs. quantitative associations


From association to correlation and causal structure analysis.


Association does not necessarily imply correlation or causal
relationships
From intra-trasanction association to inter-transaction
associations


Association on discrete vs. continuous data
E.g., break the barriers of transactions (Lu, et al. TOIS’99).
From association analysis to classification and clustering
analysis

E.g, clustering association rules
260
Chapter 6: Mining Association Rules
in Large Databases
Association rule mining
Mining single-dimensional Boolean association rules from
transactional databases
Mining multilevel association rules from transactional
databases
Mining multidimensional association rules from
transactional databases and data warehouse
From association mining to correlation analysis
Constraint-based association mining
Summary
261
Summary
Association rule mining

probably the most significant contribution from the
database community in KDD

A large number of papers have been published
Many interesting issues have been explored
An interesting research direction

Association analysis in other types of data: spatial data,
multimedia data, time series data, etc.
262
References
R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of
frequent itemsets. In Journal of Parallel and Distributed Computing (Special Issue on High
Performance Data Mining), 2000.
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large
databases. SIGMOD'93, 207-216, Washington, D.C.
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94 487-499,
Santiago, Chile.
R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95, 3-14, Taipei, Taiwan.
R. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD'98, 85-93, Seattle,
Washington.
S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to
correlations. SIGMOD'97, 265-276, Tucson, Arizona.
S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules
for market basket analysis. SIGMOD'97, 255-264, Tucson, Arizona, May 1997.
K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. SIGMOD'99,
359-370, Philadelphia, PA, June 1999.
D.W. Cheung, J. Han, V. Ng, and C.Y. Wong. Maintenance of discovered association rules in large
databases: An incremental updating technique. ICDE'96, 106-114, New Orleans, LA.
M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg
queries efficiently. VLDB'98, 299-310, New York, NY, Aug. 1998.
263
References (2)
G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of constrained correlated sets. ICDE'00, 512-521, San
Diego, CA, Feb. 2000.
Y. Fu and J. Han. Meta-rule-guided mining of association rules in relational databases. KDOOD'95, 39-46,
Singapore, Dec. 1995.
T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using two-dimensional optimized
association rules: Scheme, algorithms, and visualization. SIGMOD'96, 13-23, Montreal, Canada.
E.-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. SIGMOD'97, 277-288,
Tucson, Arizona.
J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns in time series database. ICDE'99,
Sydney, Australia.
J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. VLDB'95, 420-431, Zurich,
Switzerland.
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD'00, 1-12, Dallas, TX,
May 2000.
T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications of ACM, 39:5864, 1996.
M. Kamber, J. Han, and J. Y. Chiang. Metarule-guided mining of multi-dimensional association rules using data
cubes. KDD'97, 207-210, Newport Beach, California.
M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting rules from large
sets of discovered association rules. CIKM'94, 401-408, Gaithersburg, Maryland.
264
References (3)
F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new paradigm for fast, quantifiable data
mining. VLDB'98, 582-593, New York, NY.
B. Lent, A. Swami, and J. Widom. Clustering association rules. ICDE'97, 220-231, Birmingham, England.
H. Lu, J. Han, and L. Feng. Stock movement and n-dimensional inter-transaction association rules. SIGMOD
Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'98), 12:1-12:7, Seattle,
Washington.
H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules. KDD'94,
181-192, Seattle, WA, July 1994.
H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Data Mining
and Knowledge Discovery, 1:259-289, 1997.
R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. VLDB'96, 122-133,
Bombay, India.
R.J. Miller and Y. Yang. Association rules over interval data. SIGMOD'97, 452-461, Tucson, Arizona.
R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Exploratory mining and pruning optimizations of constrained
associations rules. SIGMOD'98, 13-24, Seattle, Washington.
N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules.
ICDT'99, 398-416, Jerusalem, Israel, Jan. 1999.
265
References (4)
J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining association rules.
SIGMOD'95, 175-186, San Jose, CA, May 1995.
J. Pei, J. Han, and R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets.
DMKD'00, Dallas, TX, 11-20, May 2000.
J. Pei and J. Han. Can We Push More Constraints into Frequent Pattern Mining? KDD'00. Boston, MA. Aug.
2000.
G. Piatetsky-Shapiro. Discovery, analysis, and presentation of strong rules. In G. Piatetsky-Shapiro and W. J.
Frawley, editors, Knowledge Discovery in Databases, 229-238. AAAI/MIT Press, 1991.
B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98, 412-421, Orlando, FL.
J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining association rules.
SIGMOD'95, 175-186, San Jose, CA.
S. Ramaswamy, S. Mahajan, and A. Silberschatz. On the discovery of interesting patterns in association rules.
VLDB'98, 368-379, New York, NY..
S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems:
Alternatives and implications. SIGMOD'98, 343-354, Seattle, WA.
A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large
databases. VLDB'95, 432-443, Zurich, Switzerland.
A. Savasere, E. Omiecinski, and S. Navathe. Mining for strong negative associations in a large database of
customer transactions. ICDE'98, 494-502, Orlando, FL, Feb. 1998.
266
References (5)
C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining causal structures.
VLDB'98, 594-605, New York, NY.
R. Srikant and R. Agrawal. Mining generalized association rules. VLDB'95, 407-419, Zurich, Switzerland,
Sept. 1995.
R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. SIGMOD'96, 112, Montreal, Canada.
R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints. KDD'97, 67-73,
Newport Beach, California.
H. Toivonen. Sampling large databases for association rules. VLDB'96, 134-145, Bombay, India, Sept.
1996.
D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A
generalization of association-rule mining. SIGMOD'98, 1-12, Seattle, Washington.
K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Computing optimized rectilinear
regions for association rules. KDD'97, 96-103, Newport Beach, CA, Aug. 1997.
M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for discovery of association rules.
Data Mining and Knowledge Discovery, 1:343-374, 1997.
M. Zaki. Generating Non-Redundant Association Rules. KDD'00. Boston, MA. Aug. 2000.
O. R. Zaiane, J. Han, and H. Zhu. Mining Recurrent Items in Multimedia with Progressive Resolution
Refinement. ICDE'00, 461-470, San Diego, CA, Feb. 2000.
267
http://www.cs.sfu.ca/~han/dmbook
Thank you !!!
268
269
Data Mining:
Concepts and Techniques
— Slides for Textbook —
— Chapter 7 —
©Jiawei Han and Micheline Kamber
Intelligent Database Systems Research Lab
School of Computing Science
Simon Fraser University, Canada
http://www.cs.sfu.ca
Chapter 7. Classification and Prediction
What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Classification by backpropagation
Classification based on concepts from association rule
mining
Other Classification Methods
Prediction
Classification accuracy
Summary
271
Classification vs. Prediction
Classification:


predicts categorical class labels
classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in classifying
new data
Prediction:

models continuous-valued functions, i.e., predicts unknown or missing
values
Typical Applications




credit approval
target marketing
medical diagnosis
treatment effectiveness analysis
272
Classification—A Two-Step Process
Model construction: describing a set of predetermined classes



Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
The set of tuples used for model construction: training set
The model is represented as classification rules, decision trees, or
mathematical formulae
Model usage: for classifying future or unknown objects

Estimate accuracy of the model
 The known label of test sample is compared with the classified
result from the model
 Accuracy rate is the percentage of test set samples that are
correctly classified by the model
 Test set is independent of training set, otherwise over-fitting will
occur
273
Classification Process (1): Model Construction
Training
Data
NAME
M ike
M ary
B ill
Jim
D ave
A nne
RANK
YEARS TENURED
A ssistant P rof
3
no
A ssistant P rof
7
yes
P rofessor
2
yes
A ssociate P rof
7
yes
A ssistant P rof
6
no
A ssociate P rof
3
no
Classification
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
274
Classification Process (2): Use the
Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
T om
M erlisa
G eorge
Joseph
RANK
YEARS TENURED
A ssistant P rof
2
no
A ssociate P rof
7
no
P rofessor
5
yes
A ssistant P rof
7
yes
Tenured?
275
Supervised vs. Unsupervised Learning
Supervised learning (classification)

Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations

New data is classified based on the training set
Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
276
Chapter 7. Classification and Prediction
What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Classification by backpropagation
Classification based on concepts from association rule
mining
Other Classification Methods
Prediction
Classification accuracy
Summary
277
Issues regarding classification and prediction
(1): Data Preparation
Data cleaning

Preprocess data in order to reduce noise and handle missing values
Relevance analysis (feature selection)

Remove the irrelevant or redundant attributes
Data transformation

Generalize and/or normalize data
278
Issues regarding classification and prediction
(2): Evaluating Classification Methods
Predictive accuracy
Speed and scalability


time to construct the model
time to use the model
Robustness

handling noise and missing values
Scalability

efficiency in disk-resident databases
Interpretability:

understanding and insight provded by the model
Goodness of rules


decision tree size
compactness of classification rules
279
Chapter 7. Classification and Prediction
What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Classification by backpropagation
Classification based on concepts from association rule
mining
Other Classification Methods
Prediction
Classification accuracy
Summary
280
Classification by Decision Tree Induction
Decision tree




A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases


Tree construction
 At start, all the training examples are at the root
 Partition examples recursively based on selected attributes
Tree pruning
 Identify and remove branches that reflect noise or outliers
Use of decision tree: Classifying an unknown sample

Test the attribute values of the sample against the decision tree
281
Training Dataset
This
follows
an
example
from
Quinlan’s
ID3
age
<=30
<=30
30…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
282
Output: A Decision Tree for “buys_computer”
age?
<=30
student?
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
283
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)





Tree is constructed in a top-down recursive divide-and-conquer manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are discretized in
advance)
Examples are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
Conditions for stopping partitioning



All samples for a given node belong to the same class
There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf
There are no samples left
284
Attribute Selection Measure
Information gain (ID3/C4.5)


All attributes are assumed to be categorical
Can be modified for continuous-valued attributes
Gini index (IBM IntelligentMiner)




All attributes are assumed continuous-valued
Assume there exist several possible split values for each attribute
May need other tools, such as clustering, to get the possible split values
Can be modified for categorical attributes
285
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
Assume there are two classes, P and N

Let the set of examples S contain p elements of class P and n elements of
class N

The amount of information, needed to decide if an arbitrary example in S
belongs to P or N is defined as
p
p
n
n
I ( p, n)  
log 2

log 2
pn
pn pn
pn
286
Information Gain in Decision Tree Induction
Assume that using attribute A a set S will be
partitioned into sets {S1, S2 , …, Sv}

If Si contains pi examples of P and ni examples of N, the
entropy, or the expected information needed to classify
objects in all subtrees Si is
pi  ni
E ( A)  
I ( pi , ni )
i 1 p  n

The encoding information that would be gained by
branching on A
Gain( A)  I ( p, n)  E ( A)
287
Attribute Selection by Information Gain
Computation

Class P: buys_computer = “yes”

Class N: buys_computer = “no”

I(p, n) = I(9, 5) =0.940

Compute the entropy for age:
5
4
I ( 2,3) 
I ( 4,0)
14
14
5

I (3,2)  0.971
14
E ( age) 
Hence
Gain(age)  I ( p, n)  E (age)
age
<=30
30…40
>40
pi
2
4
3
ni I(pi, ni)
3 0.971
0 0
2 0.971
Similarly
Gain(income)  0.029
Gain( student )  0.151
Gain(credit _ rating )  0.048
288
Gini Index (IBM IntelligentMiner)
If a data set T contains examples from n classes, gini index, gini(T)
is defined as
n
2
gini(T ) 1  p j
j 1
where pj is the relative frequency of class j in T.
If a data set T is split into two subsets T1 and T2 with sizes N1 and
N2 respectively, the gini index of the split data contains examples
from n classes, the gini index gini(T) is defined as
N 1 gini( )  N 2 gini( )
(
T
)

gini split
T1
T2
N
N
The attribute provides the smallest ginisplit(T) is chosen to split the
node (need to enumerate all possible splitting points for each
attribute).
289
Extracting Classification Rules from Trees
Represent the knowledge in the form of IF-THEN rules
One rule is created for each path from the root to a leaf
Each attribute-value pair along a path forms a conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand
Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40”
THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”
IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”
290
Avoid Overfitting in Classification
The generated tree may overfit the training data


Too many branches, some may reflect anomalies due to noise or
outliers
Result is in poor accuracy for unseen samples
Two approaches to avoid overfitting


Prepruning: Halt tree construction early—do not split a node if this
would result in the goodness measure falling below a threshold
 Difficult to choose an appropriate threshold
Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees
 Use a set of data different from the training data to decide which
is the “best pruned tree”
291
Approaches to Determine the Final Tree Size
Separate training (2/3) and testing (1/3) sets
Use cross validation, e.g., 10-fold cross validation
Use all the data for training

but apply a statistical test (e.g., chi-square) to estimate
whether expanding or pruning a node may improve the
entire distribution
Use minimum description length (MDL) principle:

halting growth of the tree when the encoding is minimized
292
Enhancements to basic decision tree induction
Allow for continuous-valued attributes

Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete set of
intervals
Handle missing attribute values


Assign the most common value of the attribute
Assign probability to each of the possible values
Attribute construction


Create new attributes based on existing ones that are
sparsely represented
This reduces fragmentation, repetition, and replication
293
Classification in Large Databases
Classification—a classical problem extensively studied by
statisticians and machine learning researchers
Scalability: Classifying data sets with millions of examples and
hundreds of attributes with reasonable speed
Why decision tree induction in data mining?




relatively faster learning speed (than other classification methods)
convertible to simple and easy to understand classification rules
can use SQL queries for accessing databases
comparable classification accuracy with other methods
294
Scalable Decision Tree Induction
Methods in Data Mining Studies
SLIQ (EDBT’96 — Mehta et al.)

builds an index for each attribute and only class list and the current
attribute list reside in memory
SPRINT (VLDB’96 — J. Shafer et al.)

constructs an attribute list data structure
PUBLIC (VLDB’98 — Rastogi & Shim)

integrates tree splitting and tree pruning: stop growing the tree earlier
RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)


separates the scalability aspects from the criteria that determine the quality
of the tree
builds an AVC-list (attribute, value, class label)
295
Data Cube-Based Decision-Tree Induction
Integration of generalization with decision-tree
induction (Kamber et al’97).
Classification at primitive concept levels



E.g., precise temperature, humidity, outlook, etc.
Low-level concepts, scattered classes, bushy
classification-trees
Semantic interpretation problems.
Cube-based multi-level classification


Relevance analysis at multi-levels.
Information-gain analysis with dimension + level.
296
Presentation of Classification Results
297
References (I)
C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future Generation
Computer Systems, 13, 1997.
L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees.
Wadsworth International Group, 1984.
P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data for scaling
machine learning. In Proc. 1st Int. Conf. Knowledge Discovery and Data Mining (KDD'95),
pages 39-44, Montreal, Canada, August 1995.
U. M. Fayyad. Branching on attribute values in decision tree generation. In Proc. 1994 AAAI
Conf., pages 601-606, AAAI Press, 1994.
J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree
construction of large datasets. In Proc. 1998 Int. Conf. Very Large Data Bases, pages 416-427,
New York, NY, August 1998.
M. Kamber, L. Winstone, W. Gong, S. Cheng, and J. Han. Generalization and decision tree
induction: Efficient classification in data mining. In Proc. 1997 Int. Workshop Research Issues
on Data Engineering (RIDE'97), pages 111-120, Birmingham, England, April 1997.
298
References (II)
J. Magidson. The Chaid approach to segmentation modeling: Chi-squared automatic interaction
detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing Research, pages 118-159.
Blackwell Business, Cambridge Massechusetts, 1994.
M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining. In Proc.
1996 Int. Conf. Extending Database Technology (EDBT'96), Avignon, France, March 1996.
S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-Diciplinary Survey,
Data Mining and Knowledge Discovery 2(4): 345-389, 1998
J. R. Quinlan. Bagging, boosting, and c4.5. In Proc. 13th Natl. Conf. on Artificial Intelligence
(AAAI'96), 725-730, Portland, OR, Aug. 1996.
R. Rastogi and K. Shim. Public: A decision tree classifer that integrates building and pruning. In
Proc. 1998 Int. Conf. Very Large Data Bases, 404-415, New York, NY, August 1998.
J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data mining. In
Proc. 1996 Int. Conf. Very Large Data Bases, 544-555, Bombay, India, Sept. 1996.
S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and Prediction
Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman,
1991.
299
http://www.cs.sfu.ca/~han
Thank you !!!
300