Download Data Mining: Concepts and Techniques

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining:
Concepts and Techniques
— Slides for Textbook —
— potpourri —
©Jiawei Han and Micheline Kamber
http://www.cs.sfu.ca
Potpourri composed by
Yannis Theodoridis (May 2001)
January 20, 2006
Data Mining: Concepts and Techniques
1
Contents
„
„
„
„
„
Introduction
Data Warehouses
„ Data Preprocessing
Data Mining Functionality
„ Association Rules
„ Classification
„ Clustering
„ Trend Analysis
Social Impact
A prototype system: DBMiner
January 20, 2006
Data Mining: Concepts and Techniques
2
1
What Is Data Mining?
„
Data mining (knowledge discovery in databases):
„
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) information or patterns from
data in large databases
„
Alternative names and their “inside stories”:
„
„
„
Data mining: a misnomer?
Knowledge discovery(mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
What is not data mining?
„
January 20, 2006
Expert systems or small ML/statistical programs
Data Mining: Concepts and Techniques
3
Data Mining Applications
„
Data mining is a young discipline with wide and diverse applications
9
„
a nontrivial gap exists between general principles of data
mining and domain-specific, effective data mining tools for
particular applications
Some application domains (covered in this chapter)
9
9
9
9
Biomedical and DNA data analysis
Financial data analysis
Retail industry
Telecommunication industry
January 20, 2006
Data Mining: Concepts and Techniques
4
2
Commercial Data Mining tools
„
Commercial data mining systems have little in common
9
9
„
„
„
Different data mining functionality or methodology
May even work with completely different kinds of data sets
Need multiple dimensional view in selection
Data types: relational, transactional, text, time sequence, spatial?
System issues
9
9
9
running on only one or on several operating systems?
a client/server architecture?
Provide Web-based interfaces and allow XML data as input and/or
output?
January 20, 2006
Data Mining: Concepts and Techniques
5
Commercial Data Mining tools
„
Data sources
9
9
„
ASCII text files, multiple relational data sources
support ODBC connections (OLE DB, JDBC)?
Data mining functions and methodologies
9
9
One vs. multiple data mining functions
One vs. variety of methods per function
More data mining functions and methods per function provide the user with
greater flexibility and analysis power
Coupling with DB and/or data warehouse systems
„
„
9
Four forms of coupling: no coupling, loose coupling, semitight
coupling, and tight coupling
„
January 20, 2006
Ideally, a data mining system should be tightly coupled with a database
system
Data Mining: Concepts and Techniques
6
3
Commercial Data Mining tools
„
Scalability
9
9
9
„
Visualization tools
9
9
„
Row (or database size) scalability
Column (or dimension) scalability
Curse of dimensionality: it is much more challenging to make a
system column scalable that row scalable
“A picture is worth a thousand words”
Visualization categories: data visualization, mining result
visualization, mining process visualization, and visual data mining
Data mining query language and graphical user interface
9
9
Easy-to-use and high-quality graphical user interface
Essential for user-guided, highly interactive data mining
January 20, 2006
Data Mining: Concepts and Techniques
7
Examples of Data Mining Systems (1)
„
„
IBM Intelligent Miner
„ A wide range of data mining algorithms
„ Scalable mining algorithms
„ Toolkits: neural network algorithms, statistical methods, data
preparation, and data visualization tools
„ Tight integration with IBM's DB2 relational database system
Mirosoft SQLServer 2000
„ Integrate DB and OLAP with mining
„ Support OLEDB for DM standard
January 20, 2006
Data Mining: Concepts and Techniques
8
4
Examples of Data Mining Systems (2)
„
SGI MineSet
„ Multiple data mining algorithms and advanced statistics
„ Advanced visualization tools
„
SAS Enterprise Miner
„ A variety of statistical analysis tools
„ Data warehouse tools and multiple data mining algorithms
„
Clementine (SPSS)
„ An integrated data mining development environment for end-users
and developers
„ Multiple data mining algorithms and visualization tools
January 20, 2006
Data Mining: Concepts and Techniques
9
Data Mining: A KDD Process
Pattern Evaluation
„
Data mining: the core of
knowledge discovery process.
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
January 20, 2006
Data Mining: Concepts and Techniques
10
5
Data Mining and Business Intelligence
Increasing potential
to support
business decisions
End User
Making
Decisions
Business
Analyst
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
January 20, 2006
DBA
Data Mining: Concepts and Techniques
11
Architecture of a Typical DM System
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data
warehouse server
Data cleaning
& data integration
Databases
January 20, 2006
Filtering
Data
Warehouse
Data Mining: Concepts and Techniques
12
6
Data Mining Functionalities (1)
„
Concept description: Characterization and discrimination
„
„
Generalize, summarize, and contrast data characteristics,
e.g., dry vs. wet regions
Association (correlation and causality)
„
„
„
Multi-dimensional vs. single-dimensional association
age(X, “20..29”) ^ income(X, “20..29K”) Æ buys(X, “PC”)
[support = 2%, confidence = 60%]
contains(T, “computer”) Æ contains(x, “software”) [1%,
75%]
January 20, 2006
Data Mining: Concepts and Techniques
13
Data Mining Functionalities (2)
„
Classification and Prediction
„
„
„
Finding models (functions) that describe and distinguish classes or
concepts for future prediction
E.g., classify countries based on climate, or classify cars based on gas
mileage
„
Presentation: decision-tree, classification rule, neural network
„
Prediction: Predict some unknown or missing numerical values
Cluster analysis
„
„
January 20, 2006
Class label is unknown: Group data to form new classes, e.g., cluster
houses to find distribution patterns
Clustering based on the principle: maximizing the intra-class similarity
and minimizing the interclass similarity
Data Mining: Concepts and Techniques
14
7
Data Mining Functionalities (3)
„
Outlier analysis
„
Outlier: a data object that does not comply with the general behavior of the
data
„
It can be considered as noise or exception but is quite useful in fraud
detection, rare events analysis
„
„
Trend and evolution analysis
„
Trend and deviation: regression analysis
„
Sequential pattern mining, periodicity analysis
„
Similarity-based analysis
Other pattern-directed or statistical analyses
January 20, 2006
Data Mining: Concepts and Techniques
15
Why Data Preprocessing?
„
„
Data in the real world is dirty
„ incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
„ noisy: containing errors or outliers
„ inconsistent: containing discrepancies in codes or names
No quality data, no quality mining results!
„ Quality decisions must be based on quality data
„ Data warehouse needs consistent integration of quality
data
January 20, 2006
Data Mining: Concepts and Techniques
16
8
Association Rule: Basic Concepts
„
„
Given: (1) database of transactions, (2) each transaction is a list of items
(purchased by a customer in a visit)
Find: all rules that correlate the presence of one set of items with that of
another set of items
„ E.g., 98% of people who purchase tires and auto accessories also get
automotive services done
„
Applications
„
„
„
„
* ⇒ Maintenance Agreement (What the store should do to boost
Maintenance Agreement sales)
Home Electronics ⇒ * (What other products should the store stocks
up?)
Attached mailing in direct marketing
Detecting “ping-pong”ing of patients, faulty “collisions”
January 20, 2006
Data Mining: Concepts and Techniques
17
Rule Measures: Support & Confidence
Customer
buys both
Customer
buys beer
Customer
buys diaper
„
Find all the rules X & Y ⇒ Z with
minimum confidence and support
„ support, s, probability that a
transaction contains {X Y Z}
„ confidence, c, conditional probability
that a transaction having {X Y}
also contains Z
Transaction ID Items Bought Let minimum support 50%, and minimum
confidence 50%, we have
2000
A,B,C
„ A ⇒ C (50%, 66.6%)
1000
A,C
„ C ⇒ A (50%, 100%)
4000
A,D
5000
B,E,F
January 20, 2006
Data Mining: Concepts and Techniques
18
9
Visualization of Association Rule Using Plane Graph
January 20, 2006
Data Mining: Concepts and Techniques
19
Visualization of Association Rule Using Rule Graph
January 20, 2006
Data Mining: Concepts and Techniques
20
10
Rule Mining: A Road Map
„
Boolean vs. quantitative associations (Based on the types of values handled)
9
9
„
buys(x, “SQLServer”) ^ buys(x, “DMBook”) ⇒ buys(x, “UMiner”) [0.2%, 60%]
age(x, “30..39”) ^ income(x, “42..48K”) ⇒ buys(x, “PC”) [1%, 75%]
„
Single dimension vs. multiple dimensional associations (see ex. Above)
Single level vs. multiple-level analysis
„
Various extensions
9
9
9
9
What brands of beers are associated with what brands of diapers?
Correlation, causality analysis
„ Association does not necessarily imply correlation or causality
Maxpatterns and closed itemsets
Constraints enforced
„ E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?
January 20, 2006
Data Mining: Concepts and Techniques
21
Mining Association Rules—An Example
Transaction ID
2000
1000
4000
5000
For rule A ⇒ C:
Items Bought
A,B,C
A,C
A,D
B,E,F
Min. support 50%
Min. confidence 50%
Frequent Itemset
{A}
{B}
{C}
{A,C}
Support
75%
50%
50%
50%
support = support({A &C}) = 50%
confidence = support({A &C})/support({A}) = 66.6%
The Apriori principle:
Any subset of a frequent itemset must be frequent
January 20, 2006
Data Mining: Concepts and Techniques
22
11
Mining Frequent Itemsets: the Key Step
„
Find the frequent itemsets: the sets of items that have minimum support
9
A subset of a frequent itemset must also be a frequent itemset
y
9
„
i.e., if {AB} is a frequent itemset, both {A} and {B} should
be a frequent itemset
Iteratively find frequent itemsets with cardinality from 1 to k
(k-itemset)
Use the frequent itemsets to generate association rules.
January 20, 2006
Data Mining: Concepts and Techniques
23
The Apriori Algorithm
„
„
„
Join Step: Ck is generated by joining Lk-1with itself
Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent kitemset
Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
January 20, 2006
Data Mining: Concepts and Techniques
24
12
The Apriori Algorithm — Example
Database D
TID
100
200
300
400
itemset sup.
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
C1
Items
134
235
1235
25
L1 itemset sup.
{1}
{2}
{3}
{5}
C2 itemset
C2 itemset sup
L2 itemset sup
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
{2 3 5}
Scan D
{1
{2
{2
{3
3}
3}
5}
5}
January 20, 2006
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
2
3
3
3
Scan D
{1
{1
{1
{2
{2
{3
2}
3}
5}
3}
5}
5}
L3 itemset sup
{2 3 5} 2
Data Mining: Concepts and Techniques
25
Candidates Generation
„
Suppose the items in Lk-1 are listed in an order
„
Step 1:
1 self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
„
Step 2:
2 pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
January 20, 2006
Data Mining: Concepts and Techniques
26
13
How to Count Supports of Candidates?
Why counting supports of candidates a problem?
„
The total number of candidates can be very huge
„
One transaction may contain many candidates
„
Method:
„
„
Candidate itemsets are stored in a hash-tree
„
Leaf node of hash-tree contains a list of itemsets and counts
„
Interior node contains a hash table
„
Subset function: finds all the candidates contained in a
transaction
January 20, 2006
Data Mining: Concepts and Techniques
27
Example of Generating Candidates
„
L3={abc, abd, acd, ace, bcd}
„
Self-joining: L3*L3
„
„
abcd from abc and abd
„
acde from acd and ace
Pruning:
„
„
acde is removed because ade is not in L3
C4={abcd}
January 20, 2006
Data Mining: Concepts and Techniques
28
14
Mining Distance-based Association Rules
„
„
Binning methods do not capture the semantics of interval data
Price($)
Equi-width
(width $10)
Equi-depth
(depth 2)
Distancebased
7
20
22
50
51
53
[0,10]
[11,20]
[21,30]
[31,40]
[41,50]
[51,60]
[7,20]
[22,50]
[51,53]
[7,7]
[20,22]
[50,53]
Distance-based partitioning, more meaningful discretization considering:
„
„
density/number of points in an interval
“closeness” of points in an interval
January 20, 2006
Data Mining: Concepts and Techniques
29
Clusters and Distance Measurements
„
„
S[X] is a set of N tuples t1, t2, …, tN , projected on the attribute set X
The diameter of S[X]:
d ( S [ X ]) =
„
∑ ∑
N
N
i =1
j =1
dist X ( t i[ X ], t j[ X ])
N ( N − 1)
distx:distance metric, e.g. Euclidean distance or Manhattan
January 20, 2006
Data Mining: Concepts and Techniques
30
15
Clusters and Distance Measurements(Cont.)
„
The diameter, d, assesses the density of a cluster CX , where
d (C X ) ≤ d 0 X
CX ≥ s 0
„
Finding clusters and distance-based rules
„
the density threshold, d0 , replaces the notion of support
„
modified version of the BIRCH clustering algorithm
January 20, 2006
Data Mining: Concepts and Techniques
31
Interestingness Measurements
„
Objective measures
Two popular measurements:
œ support; and
 confidence
„
Subjective measures (Silberschatz & Tuzhilin, KDD95)
A rule (pattern) is interesting if
œ it is unexpected (surprising to the user); and/or
 actionable (the user can do something with it)
January 20, 2006
Data Mining: Concepts and Techniques
32
16
Criticism to Support and Confidence
„
Example 1: (Aggarwal & Yu, PODS98)
„
„
„
Among 5000 students
„ 3000 play basketball
„ 3750 eat cereal
„ 2000 both play basket ball and eat cereal
play basketball ⇒ eat cereal [40%, 66.7%] is misleading because
the overall percentage of students eating cereal is 75% which is
higher than 66.7%.
play basketball ⇒ not eat cereal [20%, 33.3%] is far more
accurate, although with lower support and confidence
cereal
not cereal
sum(col.)
January 20, 2006
basketball
not basketball sum(row)
2000
1750
3750
1000
250
1250
3000
2000
5000
Data Mining: Concepts and Techniques
33
Criticism to Support and Confidence
„
Example 2:
„
„
„
„
X and Y: positively correlated,
X and Z, negatively related
support and confidence of
X=>Z dominates
We need a measure of dependent or
correlated events
corrA, B =
„
P ( A∪ B )
P ( A) P ( B )
X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
Z 0 1 1 1 1 1 1 1
Rule Support
X=>Y
25%
X=>Z 37,50%
Confidence
50%
75%
P(B|A)/P(B) is also called the lift of rule A =>
B
January 20, 2006
Data Mining: Concepts and Techniques
34
17
Other Interestingness Measures: Interest
„
P( A ∧ B)
P ( A) P( B)
Interest (correlation, lift)
„
taking both P(A) and P(B) in consideration
„
P(A^B)=P(B)*P(A), if A and B are independent events
„
A and B negatively correlated, if the value is less than 1; otherwise A and B
positively correlated
X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
Z 0 1 1 1 1 1 1 1
January 20, 2006
Itemset
Support
Interest
X,Y
X,Z
Y,Z
25%
37.50%
12.50%
2
0.9
0.57
Data Mining: Concepts and Techniques
35
Association Rules:Summary
„
Association rule mining
„
probably the most significant contribution from the database
community in KDD
„
A large number of papers have been published
„
Many interesting issues have been explored
„
An interesting research direction
„
Association analysis in other types of data: spatial data,
multimedia data, time series data, etc.
January 20, 2006
Data Mining: Concepts and Techniques
36
18
Training Dataset
This
follows an
example
from
Quinlan’s
ID3
age
<=30
<=30
30…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
January 20, 2006
income student credit_rating
high
no fair
high
no excellent
high
no fair
medium
no fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no excellent
high
yes fair
medium
no excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Data Mining: Concepts and Techniques
37
Output: A Decision Tree for “buys_computer”
age?
<=30
30..40
overcast
yes
student?
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
January 20, 2006
Data Mining: Concepts and Techniques
38
19
Presentation of Classification Results
January 20, 2006
Data Mining: Concepts and Techniques
39
AGNES (Agglomerative Nesting)
„
Introduced in Kaufmann and Rousseeuw (1990)
„
Implemented in statistical analysis packages, e.g., Splus
„
Use the Single-Link method and the dissimilarity matrix.
„
Merge nodes that have the least dissimilarity
„
Go on in a non-descending fashion
„
Eventually all nodes belong to the same cluster
10
10
10
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
0
0
0
1
2
January 20, 2006
3
4
5
6
7
8
9
10
1
0
0
1
2
3
4
5
6
7
8
9
10
Data Mining: Concepts and Techniques
0
1
2
3
4
5
6
7
8
9
10
40
20
DBSCAN: Density Based Spatial Clustering
of Applications with Noise
„
„
Relies on a density-based notion of cluster: A cluster is defined as a
maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases with noise
Outlier
Border
Eps = 1cm
Core
January 20, 2006
MinPts = 5
Data Mining: Concepts and Techniques
41
Reachabilitydistance
undefined
ε
ε‘
January 20, 2006
ε
Data Mining: Concepts and Techniques
Cluster-order
of the objects
42
21
Constraint-Based Clustering Analysis
„
Clustering analysis: less parameters but more user-desired constraints, e.g., an
ATM allocation problem
January 20, 2006
Data Mining: Concepts and Techniques
43
Mining Time-Series and Sequence Data
Time-series plot
January 20, 2006
Data Mining: Concepts and Techniques
44
22
Mining Time-Series and Sequence Data:
Trend analysis
„
„
A time series can be illustrated as a time-series graph which
describes a point moving with the passage of time
Categories of Time-Series Movements
„
Long-term or trend movements (trend curve)
„
Cyclic movements or cycle variations, e.g., business cycles
„
Seasonal movements or seasonal variations
„
„
i.e, almost identical patterns that a time series appears
to follow during corresponding months of successive
years.
Irregular or random movements
January 20, 2006
Data Mining: Concepts and Techniques
45
Social Impacts: Threat to Privacy and Data
Security?
„
Is data mining a threat to privacy and data security?
„ “Big Brother”, “Big Banker”, and “Big Business” are carefully
watching you
„ Profiling information is collected every time
„
„
„
„
You use your credit card, debit card, supermarket loyalty card, or
frequent flyer card, or apply for any of the above
You surf the Web, reply to an Internet newsgroup, subscribe to a
magazine, rent a video, join a club, fill out a contest entry form,
You pay for prescription drugs, or present you medical care number
when visiting the doctor
Collection of personal data may be beneficial for companies
and consumers, there is also potential for misuse
January 20, 2006
Data Mining: Concepts and Techniques
46
23
Protect Privacy and Data Security
„
„
Fair information practices
„ International guidelines for data privacy protection
„ Cover aspects relating to data collection, purpose, use,
quality, openness, individual participation, and
accountability
„ Purpose specification and use limitation
„ Openness: Individuals have the right to know what
information is collected about them, who has access to the
data, and how the data are being used
Develop and use data security-enhancing techniques
„ Blind signatures
„ Biometric encryption
„ Anonymous databases
January 20, 2006
Data Mining: Concepts and Techniques
47
OLAP (Summarization) Display Using MS/Excel 2000
January 20, 2006
Data Mining: Concepts and Techniques
48
24
Market-Basket-Analysis (Association)—Ball graph
January 20, 2006
Data Mining: Concepts and Techniques
49
Display of Association Rules in Rule Plane Form
January 20, 2006
Data Mining: Concepts and Techniques
50
25
Display of Decision Tree (Classification Results)
January 20, 2006
Data Mining: Concepts and Techniques
51
Display of Clustering (Segmentation) Results
January 20, 2006
Data Mining: Concepts and Techniques
52
26
3D Cube Browser
January 20, 2006
Data Mining: Concepts and Techniques
53
Trends in Data Mining (1)
„
„
„
„
Scalable data mining methods
„ Constraint-based mining: use of constraints to guide data
mining systems in their search for interesting patterns
Application exploration
„ development of application-specific data mining system
„ Invisible data mining (mining as built-in function)
Integration of data mining with database systems, data warehouse
systems, and Web database systems
Quality assessment
January 20, 2006
Data Mining: Concepts and Techniques
54
27
Trends in Data Mining (2)
„
„
„
„
„
„
Standardization of data mining language
„ A standard will facilitate systematic development, improve
interoperability, and promote the education and use of data mining
systems in industry and society
Visual data mining
Uncertainty handling
New methods for mining complex types of data
„ More research is required towards the integration of data mining
methods with existing data analysis techniques for the complex
types of data
Web mining
Privacy protection and information security in data mining
January 20, 2006
Data Mining: Concepts and Techniques
55
28