Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining:
Concepts and Techniques
— Slides for Textbook —
— potpourri —
©Jiawei Han and Micheline Kamber
http://www.cs.sfu.ca
Potpourri composed by
Yannis Theodoridis (May 2001)
January 20, 2006
Data Mining: Concepts and Techniques
1
Contents
Introduction
Data Warehouses
Data Preprocessing
Data Mining Functionality
Association Rules
Classification
Clustering
Trend Analysis
Social Impact
A prototype system: DBMiner
January 20, 2006
Data Mining: Concepts and Techniques
2
1
What Is Data Mining?
Data mining (knowledge discovery in databases):
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns from
data in large databases
Alternative names and their “inside stories”:
Data mining: a misnomer?
Knowledge discovery(mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
What is not data mining?
January 20, 2006
Expert systems or small ML/statistical programs
Data Mining: Concepts and Techniques
3
Data Mining Applications
Data mining is a young discipline with wide and diverse applications
9
a nontrivial gap exists between general principles of data
mining and domain-specific, effective data mining tools for
particular applications
Some application domains (covered in this chapter)
9
9
9
9
Biomedical and DNA data analysis
Financial data analysis
Retail industry
Telecommunication industry
January 20, 2006
Data Mining: Concepts and Techniques
4
2
Commercial Data Mining tools
Commercial data mining systems have little in common
9
9
Different data mining functionality or methodology
May even work with completely different kinds of data sets
Need multiple dimensional view in selection
Data types: relational, transactional, text, time sequence, spatial?
System issues
9
9
9
running on only one or on several operating systems?
a client/server architecture?
Provide Web-based interfaces and allow XML data as input and/or
output?
January 20, 2006
Data Mining: Concepts and Techniques
5
Commercial Data Mining tools
Data sources
9
9
ASCII text files, multiple relational data sources
support ODBC connections (OLE DB, JDBC)?
Data mining functions and methodologies
9
9
One vs. multiple data mining functions
One vs. variety of methods per function
More data mining functions and methods per function provide the user with
greater flexibility and analysis power
Coupling with DB and/or data warehouse systems
9
Four forms of coupling: no coupling, loose coupling, semitight
coupling, and tight coupling
January 20, 2006
Ideally, a data mining system should be tightly coupled with a database
system
Data Mining: Concepts and Techniques
6
3
Commercial Data Mining tools
Scalability
9
9
9
Visualization tools
9
9
Row (or database size) scalability
Column (or dimension) scalability
Curse of dimensionality: it is much more challenging to make a
system column scalable that row scalable
“A picture is worth a thousand words”
Visualization categories: data visualization, mining result
visualization, mining process visualization, and visual data mining
Data mining query language and graphical user interface
9
9
Easy-to-use and high-quality graphical user interface
Essential for user-guided, highly interactive data mining
January 20, 2006
Data Mining: Concepts and Techniques
7
Examples of Data Mining Systems (1)
IBM Intelligent Miner
A wide range of data mining algorithms
Scalable mining algorithms
Toolkits: neural network algorithms, statistical methods, data
preparation, and data visualization tools
Tight integration with IBM's DB2 relational database system
Mirosoft SQLServer 2000
Integrate DB and OLAP with mining
Support OLEDB for DM standard
January 20, 2006
Data Mining: Concepts and Techniques
8
4
Examples of Data Mining Systems (2)
SGI MineSet
Multiple data mining algorithms and advanced statistics
Advanced visualization tools
SAS Enterprise Miner
A variety of statistical analysis tools
Data warehouse tools and multiple data mining algorithms
Clementine (SPSS)
An integrated data mining development environment for end-users
and developers
Multiple data mining algorithms and visualization tools
January 20, 2006
Data Mining: Concepts and Techniques
9
Data Mining: A KDD Process
Pattern Evaluation
Data mining: the core of
knowledge discovery process.
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
January 20, 2006
Data Mining: Concepts and Techniques
10
5
Data Mining and Business Intelligence
Increasing potential
to support
business decisions
End User
Making
Decisions
Business
Analyst
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
January 20, 2006
DBA
Data Mining: Concepts and Techniques
11
Architecture of a Typical DM System
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data
warehouse server
Data cleaning
& data integration
Databases
January 20, 2006
Filtering
Data
Warehouse
Data Mining: Concepts and Techniques
12
6
Data Mining Functionalities (1)
Concept description: Characterization and discrimination
Generalize, summarize, and contrast data characteristics,
e.g., dry vs. wet regions
Association (correlation and causality)
Multi-dimensional vs. single-dimensional association
age(X, “20..29”) ^ income(X, “20..29K”) Æ buys(X, “PC”)
[support = 2%, confidence = 60%]
contains(T, “computer”) Æ contains(x, “software”) [1%,
75%]
January 20, 2006
Data Mining: Concepts and Techniques
13
Data Mining Functionalities (2)
Classification and Prediction
Finding models (functions) that describe and distinguish classes or
concepts for future prediction
E.g., classify countries based on climate, or classify cars based on gas
mileage
Presentation: decision-tree, classification rule, neural network
Prediction: Predict some unknown or missing numerical values
Cluster analysis
January 20, 2006
Class label is unknown: Group data to form new classes, e.g., cluster
houses to find distribution patterns
Clustering based on the principle: maximizing the intra-class similarity
and minimizing the interclass similarity
Data Mining: Concepts and Techniques
14
7
Data Mining Functionalities (3)
Outlier analysis
Outlier: a data object that does not comply with the general behavior of the
data
It can be considered as noise or exception but is quite useful in fraud
detection, rare events analysis
Trend and evolution analysis
Trend and deviation: regression analysis
Sequential pattern mining, periodicity analysis
Similarity-based analysis
Other pattern-directed or statistical analyses
January 20, 2006
Data Mining: Concepts and Techniques
15
Why Data Preprocessing?
Data in the real world is dirty
incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
noisy: containing errors or outliers
inconsistent: containing discrepancies in codes or names
No quality data, no quality mining results!
Quality decisions must be based on quality data
Data warehouse needs consistent integration of quality
data
January 20, 2006
Data Mining: Concepts and Techniques
16
8
Association Rule: Basic Concepts
Given: (1) database of transactions, (2) each transaction is a list of items
(purchased by a customer in a visit)
Find: all rules that correlate the presence of one set of items with that of
another set of items
E.g., 98% of people who purchase tires and auto accessories also get
automotive services done
Applications
* ⇒ Maintenance Agreement (What the store should do to boost
Maintenance Agreement sales)
Home Electronics ⇒ * (What other products should the store stocks
up?)
Attached mailing in direct marketing
Detecting “ping-pong”ing of patients, faulty “collisions”
January 20, 2006
Data Mining: Concepts and Techniques
17
Rule Measures: Support & Confidence
Customer
buys both
Customer
buys beer
Customer
buys diaper
Find all the rules X & Y ⇒ Z with
minimum confidence and support
support, s, probability that a
transaction contains {X Y Z}
confidence, c, conditional probability
that a transaction having {X Y}
also contains Z
Transaction ID Items Bought Let minimum support 50%, and minimum
confidence 50%, we have
2000
A,B,C
A ⇒ C (50%, 66.6%)
1000
A,C
C ⇒ A (50%, 100%)
4000
A,D
5000
B,E,F
January 20, 2006
Data Mining: Concepts and Techniques
18
9
Visualization of Association Rule Using Plane Graph
January 20, 2006
Data Mining: Concepts and Techniques
19
Visualization of Association Rule Using Rule Graph
January 20, 2006
Data Mining: Concepts and Techniques
20
10
Rule Mining: A Road Map
Boolean vs. quantitative associations (Based on the types of values handled)
9
9
buys(x, “SQLServer”) ^ buys(x, “DMBook”) ⇒ buys(x, “UMiner”) [0.2%, 60%]
age(x, “30..39”) ^ income(x, “42..48K”) ⇒ buys(x, “PC”) [1%, 75%]
Single dimension vs. multiple dimensional associations (see ex. Above)
Single level vs. multiple-level analysis
Various extensions
9
9
9
9
What brands of beers are associated with what brands of diapers?
Correlation, causality analysis
Association does not necessarily imply correlation or causality
Maxpatterns and closed itemsets
Constraints enforced
E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?
January 20, 2006
Data Mining: Concepts and Techniques
21
Mining Association Rules—An Example
Transaction ID
2000
1000
4000
5000
For rule A ⇒ C:
Items Bought
A,B,C
A,C
A,D
B,E,F
Min. support 50%
Min. confidence 50%
Frequent Itemset
{A}
{B}
{C}
{A,C}
Support
75%
50%
50%
50%
support = support({A &C}) = 50%
confidence = support({A &C})/support({A}) = 66.6%
The Apriori principle:
Any subset of a frequent itemset must be frequent
January 20, 2006
Data Mining: Concepts and Techniques
22
11
Mining Frequent Itemsets: the Key Step
Find the frequent itemsets: the sets of items that have minimum support
9
A subset of a frequent itemset must also be a frequent itemset
y
9
i.e., if {AB} is a frequent itemset, both {A} and {B} should
be a frequent itemset
Iteratively find frequent itemsets with cardinality from 1 to k
(k-itemset)
Use the frequent itemsets to generate association rules.
January 20, 2006
Data Mining: Concepts and Techniques
23
The Apriori Algorithm
Join Step: Ck is generated by joining Lk-1with itself
Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent kitemset
Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
January 20, 2006
Data Mining: Concepts and Techniques
24
12
The Apriori Algorithm — Example
Database D
TID
100
200
300
400
itemset sup.
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
C1
Items
134
235
1235
25
L1 itemset sup.
{1}
{2}
{3}
{5}
C2 itemset
C2 itemset sup
L2 itemset sup
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
{2 3 5}
Scan D
{1
{2
{2
{3
3}
3}
5}
5}
January 20, 2006
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
2
3
3
3
Scan D
{1
{1
{1
{2
{2
{3
2}
3}
5}
3}
5}
5}
L3 itemset sup
{2 3 5} 2
Data Mining: Concepts and Techniques
25
Candidates Generation
Suppose the items in Lk-1 are listed in an order
Step 1:
1 self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
Step 2:
2 pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
January 20, 2006
Data Mining: Concepts and Techniques
26
13
How to Count Supports of Candidates?
Why counting supports of candidates a problem?
The total number of candidates can be very huge
One transaction may contain many candidates
Method:
Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of itemsets and counts
Interior node contains a hash table
Subset function: finds all the candidates contained in a
transaction
January 20, 2006
Data Mining: Concepts and Techniques
27
Example of Generating Candidates
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4={abcd}
January 20, 2006
Data Mining: Concepts and Techniques
28
14
Mining Distance-based Association Rules
Binning methods do not capture the semantics of interval data
Price($)
Equi-width
(width $10)
Equi-depth
(depth 2)
Distancebased
7
20
22
50
51
53
[0,10]
[11,20]
[21,30]
[31,40]
[41,50]
[51,60]
[7,20]
[22,50]
[51,53]
[7,7]
[20,22]
[50,53]
Distance-based partitioning, more meaningful discretization considering:
density/number of points in an interval
“closeness” of points in an interval
January 20, 2006
Data Mining: Concepts and Techniques
29
Clusters and Distance Measurements
S[X] is a set of N tuples t1, t2, …, tN , projected on the attribute set X
The diameter of S[X]:
d ( S [ X ]) =
∑ ∑
N
N
i =1
j =1
dist X ( t i[ X ], t j[ X ])
N ( N − 1)
distx:distance metric, e.g. Euclidean distance or Manhattan
January 20, 2006
Data Mining: Concepts and Techniques
30
15
Clusters and Distance Measurements(Cont.)
The diameter, d, assesses the density of a cluster CX , where
d (C X ) ≤ d 0 X
CX ≥ s 0
Finding clusters and distance-based rules
the density threshold, d0 , replaces the notion of support
modified version of the BIRCH clustering algorithm
January 20, 2006
Data Mining: Concepts and Techniques
31
Interestingness Measurements
Objective measures
Two popular measurements:
support; and
confidence
Subjective measures (Silberschatz & Tuzhilin, KDD95)
A rule (pattern) is interesting if
it is unexpected (surprising to the user); and/or
actionable (the user can do something with it)
January 20, 2006
Data Mining: Concepts and Techniques
32
16
Criticism to Support and Confidence
Example 1: (Aggarwal & Yu, PODS98)
Among 5000 students
3000 play basketball
3750 eat cereal
2000 both play basket ball and eat cereal
play basketball ⇒ eat cereal [40%, 66.7%] is misleading because
the overall percentage of students eating cereal is 75% which is
higher than 66.7%.
play basketball ⇒ not eat cereal [20%, 33.3%] is far more
accurate, although with lower support and confidence
cereal
not cereal
sum(col.)
January 20, 2006
basketball
not basketball sum(row)
2000
1750
3750
1000
250
1250
3000
2000
5000
Data Mining: Concepts and Techniques
33
Criticism to Support and Confidence
Example 2:
X and Y: positively correlated,
X and Z, negatively related
support and confidence of
X=>Z dominates
We need a measure of dependent or
correlated events
corrA, B =
P ( A∪ B )
P ( A) P ( B )
X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
Z 0 1 1 1 1 1 1 1
Rule Support
X=>Y
25%
X=>Z 37,50%
Confidence
50%
75%
P(B|A)/P(B) is also called the lift of rule A =>
B
January 20, 2006
Data Mining: Concepts and Techniques
34
17
Other Interestingness Measures: Interest
P( A ∧ B)
P ( A) P( B)
Interest (correlation, lift)
taking both P(A) and P(B) in consideration
P(A^B)=P(B)*P(A), if A and B are independent events
A and B negatively correlated, if the value is less than 1; otherwise A and B
positively correlated
X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
Z 0 1 1 1 1 1 1 1
January 20, 2006
Itemset
Support
Interest
X,Y
X,Z
Y,Z
25%
37.50%
12.50%
2
0.9
0.57
Data Mining: Concepts and Techniques
35
Association Rules:Summary
Association rule mining
probably the most significant contribution from the database
community in KDD
A large number of papers have been published
Many interesting issues have been explored
An interesting research direction
Association analysis in other types of data: spatial data,
multimedia data, time series data, etc.
January 20, 2006
Data Mining: Concepts and Techniques
36
18
Training Dataset
This
follows an
example
from
Quinlan’s
ID3
age
<=30
<=30
30…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
January 20, 2006
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Data Mining: Concepts and Techniques
37
Output: A Decision Tree for “buys_computer”
age?
<=30
30..40
overcast
yes
student?
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
January 20, 2006
Data Mining: Concepts and Techniques
38
19
Presentation of Classification Results
January 20, 2006
Data Mining: Concepts and Techniques
39
AGNES (Agglomerative Nesting)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical analysis packages, e.g., Splus
Use the Single-Link method and the dissimilarity matrix.
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
0
0
0
1
2
January 20, 2006
3
4
5
6
7
8
9
10
1
0
0
1
2
3
4
5
6
7
8
9
10
Data Mining: Concepts and Techniques
0
1
2
3
4
5
6
7
8
9
10
40
20
DBSCAN: Density Based Spatial Clustering
of Applications with Noise
Relies on a density-based notion of cluster: A cluster is defined as a
maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases with noise
Outlier
Border
Eps = 1cm
Core
January 20, 2006
MinPts = 5
Data Mining: Concepts and Techniques
41
Reachabilitydistance
undefined
ε
ε‘
January 20, 2006
ε
Data Mining: Concepts and Techniques
Cluster-order
of the objects
42
21
Constraint-Based Clustering Analysis
Clustering analysis: less parameters but more user-desired constraints, e.g., an
ATM allocation problem
January 20, 2006
Data Mining: Concepts and Techniques
43
Mining Time-Series and Sequence Data
Time-series plot
January 20, 2006
Data Mining: Concepts and Techniques
44
22
Mining Time-Series and Sequence Data:
Trend analysis
A time series can be illustrated as a time-series graph which
describes a point moving with the passage of time
Categories of Time-Series Movements
Long-term or trend movements (trend curve)
Cyclic movements or cycle variations, e.g., business cycles
Seasonal movements or seasonal variations
i.e, almost identical patterns that a time series appears
to follow during corresponding months of successive
years.
Irregular or random movements
January 20, 2006
Data Mining: Concepts and Techniques
45
Social Impacts: Threat to Privacy and Data
Security?
Is data mining a threat to privacy and data security?
“Big Brother”, “Big Banker”, and “Big Business” are carefully
watching you
Profiling information is collected every time
You use your credit card, debit card, supermarket loyalty card, or
frequent flyer card, or apply for any of the above
You surf the Web, reply to an Internet newsgroup, subscribe to a
magazine, rent a video, join a club, fill out a contest entry form,
You pay for prescription drugs, or present you medical care number
when visiting the doctor
Collection of personal data may be beneficial for companies
and consumers, there is also potential for misuse
January 20, 2006
Data Mining: Concepts and Techniques
46
23
Protect Privacy and Data Security
Fair information practices
International guidelines for data privacy protection
Cover aspects relating to data collection, purpose, use,
quality, openness, individual participation, and
accountability
Purpose specification and use limitation
Openness: Individuals have the right to know what
information is collected about them, who has access to the
data, and how the data are being used
Develop and use data security-enhancing techniques
Blind signatures
Biometric encryption
Anonymous databases
January 20, 2006
Data Mining: Concepts and Techniques
47
OLAP (Summarization) Display Using MS/Excel 2000
January 20, 2006
Data Mining: Concepts and Techniques
48
24
Market-Basket-Analysis (Association)—Ball graph
January 20, 2006
Data Mining: Concepts and Techniques
49
Display of Association Rules in Rule Plane Form
January 20, 2006
Data Mining: Concepts and Techniques
50
25
Display of Decision Tree (Classification Results)
January 20, 2006
Data Mining: Concepts and Techniques
51
Display of Clustering (Segmentation) Results
January 20, 2006
Data Mining: Concepts and Techniques
52
26
3D Cube Browser
January 20, 2006
Data Mining: Concepts and Techniques
53
Trends in Data Mining (1)
Scalable data mining methods
Constraint-based mining: use of constraints to guide data
mining systems in their search for interesting patterns
Application exploration
development of application-specific data mining system
Invisible data mining (mining as built-in function)
Integration of data mining with database systems, data warehouse
systems, and Web database systems
Quality assessment
January 20, 2006
Data Mining: Concepts and Techniques
54
27
Trends in Data Mining (2)
Standardization of data mining language
A standard will facilitate systematic development, improve
interoperability, and promote the education and use of data mining
systems in industry and society
Visual data mining
Uncertainty handling
New methods for mining complex types of data
More research is required towards the integration of data mining
methods with existing data analysis techniques for the complex
types of data
Web mining
Privacy protection and information security in data mining
January 20, 2006
Data Mining: Concepts and Techniques
55
28