Download document

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Data Mining
Page 1
Outline
• What is data mining?
• Data Mining Tasks
– Association
– Classification
– Clustering
• Data mining Algorithms
• Are all the patterns interesting?
Page 2
What is Data Mining:
• Huge amount of databases and web pages make information
extraction next to impossible (remember the favored statement: I
will bury them in data!)
• Inability of many other disciplines: (statistic, AI, information
retrieval) to have scalable algorithms to extract information
and/or rules from the databases
• Necessity to find relationships among data
Page 3
What is Data Mining:
• Discovery of useful, possibly unexpected data patterns
• Subsidiary issues:
– Data cleansing
– Visualization
– Warehousing
Page 4
Examples
• A big objection to was that it was
looking for so many vague
connections that it was sure to find
things that were bogus and thus
violate innocents’ privacy.
• The Rhine Paradox: a great
example of how not to conduct
scientific research.
Page 5
Rhine Paradox --- (1)
• David Rhine was a parapsychologist in the 1950’s who
hypothesized that some people had Extra-Sensory
Perception.
• He devised an experiment where subjects were asked to
guess 10 hidden cards --- red or blue.
• He discovered that almost 1 in 1000 had ESP --- they
were able to get all 10 right!
Page 6
Rhine Paradox --- (2)
• He told these people they had ESP and called them in
for another test of the same type.
• Alas, he discovered that almost all of them had lost
their ESP.
• What did he conclude?
– Answer on next slide.
Page 7
Rhine Paradox --- (3)
• He concluded that you shouldn’t tell people they have
ESP; it causes them to lose it.
Page 8
A Concrete Example
• This example illustrates a problem with intelligencegathering.
• Suppose we believe that certain groups of evil-doers are
meeting occasionally in hotels to plot doing evil.
• We want to find people who at least twice have stayed at
the same hotel on the same day.
Page 9
The Details
• 109 people being tracked.
• 1000 days.
• Each person stays in a hotel 1% of the time (10 days
out of 1000).
• Hotels hold 100 people (so 105 hotels).
• If everyone behaves randomly (I.e., no evil-doers) will
the data mining detect anything suspicious?
Page 10
Calculations --- (1)
• Probability that persons p and q will be at the same
hotel on day d :
– 1/100 * 1/100 * 10-5 = 10-9.
• Probability that p and q will be at the same hotel on
two given days:
– 10-9 * 10-9 = 10-18.
• Pairs of days:
– 5*105.
Page 11
Calculations --- (2)
• Probability that p and q will be at the same hotel on
some two days:
– 5*105 * 10-18 = 5*10-13.
• Pairs of people:
– 5*1017.
• Expected number of suspicious pairs of people:
– 5*1017 * 5*10-13 = 250,000.
Page 12
Conclusion
• Suppose there are (say) 10 pairs of evil-doers who
definitely stayed at the same hotel twice.
• Analysts have to sift through 250,010 candidates to
find the 10 real cases.
– Not gonna happen.
– But how can we improve the scheme?
Page 13
Appetizer
• Consider a file consisting of 24471 records.
File contains at least two condition attributes:
A and D
A/D
0
1
total
0
9272
232
9504
1
14695 272
14967
Total 23967 504
24471
Page 14
Appetizer (con’t)
• Probability that person has A: P(A)=0.6,
• Probability that person has D: P(D)=0.02
• Conditional probability that person has D provided it has A:
P(D|A) = P(AD)/P(A)=(272/24471)/.6 = .02
• P(A|D) = P(AD)/P(D)= .54
• What can we say about dependencies between A and D?
A/D
0
1
total
0
9272
232
9504
1
14695 272
14967
Total 23967 504
24471
Page 15
Appetizer(3)
• So far we did not ask anything that statistics would not
have ask. So Data Mining another word for statistic?
• We hope that the response will be resounding NO
• The major difference is that statistical methods work
with random data samples, whereas the data in
databases is not necessarily random
• The second difference is the size of the data set
• The third data is that statistical samples do not
contain “dirty” data
Page 16
Architecture of a Typical Data Mining
System
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data
warehouse server
Data cleaning & data integration
Databases
Filtering
Data
Warehouse
Page 17
Data Mining Tasks
• Association (correlation and causality)
– Multi-dimensional vs. single-dimensional association
– age(X, “20..29”) ^ income(X, “20..29K”) -> buys(X, “PC”) [support
= 2%, confidence = 60%]
– contains(T, “computer”) -> contains(x, “software”) [1%, 75%]
– What is support? – the percentage of the tuples in the database
that have age between 20 and 29 and income between 20K and
29K and buying PC
– What is confidence? – the probability that if person is between 20
and 29 and income between 20K and 29K then it buys PC
• Clustering (getting data that are close together into the same
cluster.
– What does “close together” means?
Page 18
Distances between data
• Distance between data is a measure of dissimilarity
between data.
d(i,j)>=0; d(i,j) = d(j,i); d(i,j)<= d(i,k) + d(k,j)
• Euclidean distance: <x1,x2, … xk> and <y1,y2,…yk>
• Standardize variables by finding standard deviation and
dividing each xi by standard deviation of X
• Covariance(X,Y)=1/k(Sum(xi-mean(x))(y(I)-mean(y))
• Boolean variables and their distances
Page 19
Data Mining Tasks
• Outlier analysis
– Outlier: a data object that does not comply with the general behavior of
the data
– It can be considered as noise or exception but is quite useful in fraud
detection, rare events analysis
• Trend and evolution analysis
– Trend and deviation: regression analysis
– Sequential pattern mining, periodicity analysis
– Similarity-based analysis
• Other pattern-directed or statistical analyses
Page 20
Are All the “Discovered” Patterns
Interesting?
• A data mining system/query may generate thousands of patterns,
not all of them are interesting.
– Suggested approach: Human-centered, query-based, focused mining
• Interestingness measures: A pattern is interesting if it is easily
understood by humans, valid on new or test data with some degree
of certainty, potentially useful, novel, or validates some hypothesis
that a user seeks to confirm
• Objective vs. subjective interestingness measures:
– Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
– Subjective: based on user’s belief in the data, e.g., unexpectedness,
novelty, actionability, etc.
Page 21
Are All the “Discovered” Patterns
Interesting? - Example
coffee
0
1
0
5
70
1
5
tea
20
75
25
Conditional probability that if one buys coffee, one also buys tea
is 2/9
Conditional probability that if one buys tea she also buys coffee
is 20/25=.8
However, the probability that she buys coffee is .9
So, is it significant inference that if customer buys tea she also buys
coffee?
Is buying tea and coffee independent activities?
Page 22
How to measure Interestingness
• RI = | X , Y| - |X||Y|/N
• Support and Confidence: |X Y|/N – support and |X Y|/|X| confidence of X->Y
• Chi^2: (|XY| - E(|XY|)) ^2 /E(|XY|);
• J(X->Y) = P(Y)(P(X|Y)*log (P(X|Y)/P(X)) + (1- P(X|Y))*log ((1P(X|Y)/(1-P(X))
• Sufficiency (X->Y) = P(X|Y)/P(X|!Y); Necessity (X->Y) =
P(!X|Y)/P(!X|!Y). Interestingness of Y->X is
NC++ = 1-N(X->Y)*P(Y), if N(…) is less than 1 or 0 otherwise
Page 23
Can We Find All and Only Interesting
Patterns?
• Find all the interesting patterns: Completeness
– Can a data mining system find all the interesting patterns?
– Association vs. classification vs. clustering
• Search for only interesting patterns: Optimization
– Can a data mining system find only the interesting patterns?
– Approaches
• First general all the patterns and then filter out the uninteresting
ones.
• Generate only the interesting patterns—mining query optimization
Page 24
Clustering
• Partition data set into clusters, and one can store cluster
representation only
• Can be very effective if data is clustered but not if data
is “smeared”
• Can have hierarchical clustering and be stored in multidimensional index tree structures
• There are many choices of clustering definitions and
clustering algorithms.
Page 25
Example: Clusters
Outliers
x
x
x
x x x x
x xx x
x x x
x x
x
x
xx x
x x
x x x
x
xx x
x
x x
x x x x
x x x
x
Page 26
Sampling
• Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
• Choose a representative subset of the data
– Simple random sampling may have very poor performance in the
presence of skew
• Develop adaptive sampling methods
– Stratified sampling:
• Approximate the percentage of each class (or subpopulation of
interest) in the overall database
• Used in conjunction with skewed data
• Sampling may not reduce database I/Os (page at a time).
Page 27
Sampling
Raw Data
Page 28
Sampling
Raw Data
Cluster/Stratified Sample
Page 29
Discretization
• Three types of attributes:
– Nominal — values from an unordered set
– Ordinal — values from an ordered set
– Continuous — real numbers
• Discretization:
divide the range of a continuous attribute into intervals
– Some classification algorithms only accept categorical
attributes.
– Reduce data size by discretization
– Prepare for further analysis
Page 30
Discretization
• Discretization
– reduce the number of values for a given continuous attribute by
dividing the range of the attribute into intervals. Interval labels
can then be used to replace actual data values.
Page 31
Discretization
Sort Attribute
Select cut Point
Evaluate Measure
NO
NO
Satisfied
Yes
Split/Merge
Stop
DONE
Page 32
Discretization
• Dynamic vs Static
• Local vs Global
• Top-Down vs Bottom-Up
• Direct vs Incremental
Page 33
Discretization – Quality Evaluation
• Total number of Intervals
• The Number of Inconsistencies
• Predictive Accuracy
• Complexity
Page 34
Discretization - Binning
• Equal width – all range is between min and
max values is split in equal width intervals
• Equal-frequency - Each bin contains
approximately the same number of data
Page 35
Entropy-Based Discretization
• Given a set of samples S, if S is partitioned into two
intervals S1 and S2 using boundary T, the entropy after
partitioning is
|S |
|S |
E (S ,T ) 
1 Ent ( )  2 Ent ( )
S1 | S |
S2
| S|
• The boundary that minimizes the entropy function over all
possible boundaries is selected as a binary discretization.
• The process is recursively applied to partitions obtained
until some stopping criterion is met, e.g.,
Ent ( S )  E (T , S )  
• Experiments show that it may reduce data size and
improve classification accuracy
Page 36
Data Mining Primitives, Languages, and
System Architectures
• Data mining primitives: What defines a data
mining task?
• A data mining query language
• Design graphical user interfaces based on a
data mining query language
• Architecture of data mining systems
Page 37
Why Data Mining Primitives and
Languages?
• Data mining should be an interactive process
– User directs what to be mined
• Users must be provided with a set of primitives to be
used to communicate with the data mining system
• Incorporating these primitives in a data mining query
language
– More flexible user interaction
– Foundation for design of graphical user interface
– Standardization of data mining industry and practice
Page 38
What Defines a Data Mining Task ?
• Task-relevant data
• Type of knowledge to be mined
• Background knowledge
• Pattern interestingness measurements
• Visualization of discovered patterns
Page 39
Task-Relevant Data (Minable View)
• Database or data warehouse name
• Database tables or data warehouse cubes
• Condition for data selection
• Relevant attributes or dimensions
• Data grouping criteria
Page 40
Types of knowledge to be mined
• Characterization
• Discrimination
• Association
• Classification/prediction
• Clustering
• Outlier analysis
• Other data mining tasks
Page 41
A Data Mining Query Language
(DMQL)
• Motivation
– A DMQL can provide the ability to support ad-hoc and interactive
data mining
– By providing a standardized language like SQL
• Hope to achieve a similar effect like that SQL has on relational
database
• Foundation for system development and evolution
• Facilitate information exchange, technology transfer, commercialization
and wide acceptance
• Design
– DMQL is designed with the primitives described earlier
Page 42
Syntax for DMQL
• Syntax for specification of
– task-relevant data
– the kind of knowledge to be mined
– concept hierarchy specification
– interestingness measure
– pattern presentation and visualization
• Putting it all together — a DMQL query
Page 43
Syntax for task-relevant data
specification
• use database database_name, or use data
warehouse data_warehouse_name
• from relation(s)/cube(s) [where condition]
• in relevance to att_or_dim_list
• order by order_list
• group by grouping_list
• having condition
Page 44
Specification of task-relevant data
Page 45
Syntax for specifying the kind of
knowledge to be mined
• Characterization
Mine_Knowledge_Specification ::=
mine characteristics [as pattern_name]
analyze measure(s)
• Discrimination
Mine_Knowledge_Specification ::=
mine comparison [as pattern_name]
for target_class where target_condition
{versus contrast_class_i where contrast_condition_i}
analyze measure(s)
• Association
Mine_Knowledge_Specification ::=
mine associations [as pattern_name]
Page 46
Syntax for specifying the kind of
knowledge to be mined (cont.)
Classification
Mine_Knowledge_Specification ::=
mine classification [as pattern_name]
analyze classifying_attribute_or_dimension
 Prediction
Mine_Knowledge_Specification ::=
mine prediction [as pattern_name]
analyze prediction_attribute_or_dimension
{set {attribute_or_dimension_i= value_i}}
Page 47
Syntax for concept hierarchy
specification
• To specify what concept hierarchies to use
use hierarchy <hierarchy> for <attribute_or_dimension>
• We use different syntax to define different type of hierarchies
– schema hierarchies
define hierarchy time_hierarchy on date as [date,month
quarter,year]
– set-grouping hierarchies
define hierarchy age_hierarchy for age on customer as
level1: {young, middle_aged, senior} < level0: all
level2: {20, ..., 39} < level1: young
level2: {40, ..., 59} < level1: middle_aged
level2: {60, ..., 89} < level1: senior
Page 48
Syntax for concept hierarchy
specification (Cont.)
– operation-derived hierarchies
define hierarchy age_hierarchy for age on customer as
{age_category(1), ..., age_category(5)} := cluster(default, age,
5) < all(age)
– rule-based hierarchies
define hierarchy profit_margin_hierarchy on item as
level_1: low_profit_margin < level_0: all
if (price - cost)< $50
level_1: medium-profit_margin < level_0: all
if ((price - cost) > $50) and ((price - cost) <= $250))
level_1: high_profit_margin < level_0: all
if (price - cost) > $250
Page 49
Syntax for interestingness measure
specification
• Interestingness measures and thresholds can be
specified by the user with the statement:
with <interest_measure_name> threshold = threshold_value
• Example:
with support threshold = 0.05
with confidence threshold = 0.7
Page 50
Syntax for pattern presentation and
visualization specification
• We have syntax which allows users to specify the
display of discovered patterns in one or more forms
display as <result_form>
•
To facilitate interactive viewing at different concept level,
the following syntax is defined:
Multilevel_Manipulation ::= roll up on attribute_or_dimension
| drill down on attribute_or_dimension
| add attribute_or_dimension
| drop attribute_or_dimension
Page 51
Putting it all together: the full
specification of a DMQL query
use database AllElectronics_db
use hierarchy location_hierarchy for B.address
mine characteristics as customerPurchasing
analyze count%
in relevance to C.age, I.type, I.place_made
from customer C, item I, purchases P, items_sold S, works_at W,
branch
where I.item_ID = S.item_ID and S.trans_ID = P.trans_ID
and P.cust_ID = C.cust_ID and P.method_paid = ``AmEx''
and P.empl_ID = W.empl_ID and W.branch_ID = B.branch_ID and
B.address = ``Canada" and I.price >= 100
with noise threshold = 0.05
display as table
Page 52
DMQL and SQL
• DMQL: Describe general characteristics of graduate
students in the Big-University database
use Big_University_DB
mine characteristics as “Science_Students”
in relevance to name, gender, major, birth_place, birth_date,
residence, phone#, gpa
from student
where status in “graduate”
• Corresponding SQL statement:
Select name, gender, major, birth_place, birth_date, residence,
phone#, gpa
from student
where status in {“Msc”, “MBA”, “PhD” }
Page 53
Decision Trees
Example:
• Conducted survey to see what customers were
interested in new model car
• Want to select customers for advertising campaign
sale
custId
c1
c2
c3
c4
c5
c6
car
taurus
van
van
taurus
merc
taurus
age
27
35
40
22
50
25
city newCar
sf
yes
la
yes
sf
yes
sf
yes
la
no
la
no
training
set
Page 54
One Possibility
sale
custId
c1
c2
c3
c4
c5
c6
age<30
Y
N
city=sf
Y
likely
car
taurus
van
van
taurus
merc
taurus
age
27
35
40
22
50
25
city newCar
sf
yes
la
yes
sf
yes
sf
yes
la
no
la
no
car=van
N
unlikely
Y
likely
N
unlikely
Page 55
Another Possibility
sale
custId
c1
c2
c3
c4
c5
c6
car=taurus
Y
N
city=sf
Y
likely
car
taurus
van
van
taurus
merc
taurus
age
27
35
40
22
50
25
city newCar
sf
yes
la
yes
sf
yes
sf
yes
la
no
la
no
age<45
N
unlikely
Y
likely
N
unlikely
Page 56
Issues
• Decision tree cannot be “too deep”
• would not have statistically significant
amounts of data for lower decisions
• Need to select tree that most
reliably predicts outcomes
Page 57
Top-Down Induction of Decision Tree
Attributes = {Outlook, Temperature, Humidity, Wind}
PlayTennis = {yes, no}
Outlook
sunny
overcast
Humidity
high
no
rain
Wind
yes
normal
yes
strong
no
weak
yes
Page 58
Entropy and Information Gain
• S contains si tuples of class Ci for i = {1, …, m}
• Information measures info required to classify
any arbitrary tuple
s
s
I( s ,s ,...,s )   log
s
s
m
1
2
m
i
i
2
i 1
• Entropy of attribute A with values {a1,a2,…,av}
s1 j  ...  smj
I ( s1 j ,..., smj )
s
j 1
v
E(A)  
• Information gained by branching on attribute A
Gain(A)  I(s 1, s 2 ,..., sm)  E(A)
Page 59
Example: Analytical Characterization
• Task
– Mine general characteristics describing graduate students
using analytical characterization
• Given
– attributes name, gender, major, birth_place, birth_date, phone#,
and gpa
– Gen(ai) = concept hierarchies on ai
– Ui = attribute analytical thresholds for ai
– Ti = attribute generalization thresholds for ai
– R = attribute relevance threshold
Page 60
Example: Analytical
Characterization (cont’d)
• 1. Data collection
– target class: graduate student
– contrasting class: undergraduate student
• 2. Analytical generalization using Ui
– attribute removal
• remove name and phone#
– attribute generalization
• generalize major, birth_place, birth_date and gpa
• accumulate counts
– candidate relation: gender, major, birth_country, age_range
and gpa
Page 61
Example: Analytical characterization (3)
• 3. Relevance analysis
– Calculate expected info required to classify an arbitrary tuple
I(s 1, s 2 )  I( 120,130 )  
120
120 130
130
log 2

log 2
 0.9988
250
250 250
250
– Calculate entropy of each attribute: e.g. major
For major=”Science”:
S11=84
S21=42
I(s11,s21)=0.9183
For major=”Engineering”: S12=36
S22=46
I(s12,s22)=0.9892
For major=”Business”:
S23=42
I(s13,s23)=0
S13=0
Number of grad
students in “Science”
Number of undergrad
students in “Science”
Page 62
Example: Analytical Characterization (4)
• Calculate expected info required to classify a given
sample if S is partitioned according to the attribute
E(major) 
126
82
42
I ( s11, s 21 ) 
I ( s12, s 22 ) 
I ( s13, s 23 )  0.7873
250
250
250
• Calculate information gain for each attribute
Gain(major)  I(s 1, s 2 )  E(major)  0.2115
– Information gain for all attributes
Gain(gender)
= 0.0003
Gain(birth_country)
= 0.0407
Gain(major)
Gain(gpa)
= 0.2115
= 0.4490
Gain(age_range)
= 0.5971
Page 63
Example: Analytical characterization (5)
• 4. Initial working relation (W0) derivation
– R = 0.1
– remove irrelevant/weakly relevant attributes from candidate relation =>
drop gender, birth_country
– remove contrasting class candidate relation
major
Science
Science
Science
Engineering
Engineering
age_range
20-25
25-30
20-25
20-25
25-30
gpa
Very_good
Excellent
Excellent
Excellent
Excellent
count
16
47
21
18
18
Initial target class working relation W0: Graduate students
• 5. Perform attribute-oriented induction on W0 using Ti
Page 64
What Is Association Mining?
•
Association rule mining:
–
•
Applications:
–
•
Finding frequent patterns, associations, correlations, or causal
structures among sets of items or objects in transaction
databases, relational databases, and other information
repositories.
Basket data analysis, cross-marketing, catalog design, loss-leader
analysis, clustering, classification, etc.
Examples.
–
–
–
Rule form: “Body ® Head [support, confidence]”.
buys(x, “diapers”) ® buys(x, “beers”) [0.5%, 60%]
major(x, “CS”) ^ takes(x, “DB”) ® grade(x, “A”) [1%, 75%]
Page 65
Association Rule Mining
sales
records:
tran1
tran2
tran3
tran4
tran5
tran6
cust33
cust45
cust12
cust40
cust12
cust12
p2,
p5,
p1,
p5,
p2,
p9
p5, p8
p8, p11
p9
p8, p11
p9
market-basket
data
• Trend: Products p5, p8 often bough together
• Trend: Customer 12 likes product p9
Page 66
Association Rule
• Rule: {p1, p3, p8}
• Support: number of baskets where
these products appear
• High-support set: support 
threshold s
• Problem: find all high support sets
Page 67
Association Rule: Basic Concepts
• Given: (1) database of transactions, (2) each transaction is
a list of items (purchased by a customer in a visit)
• Find: all rules that correlate the presence of one set of
items with that of another set of items
– E.g., 98% of people who purchase tires and auto accessories also
get automotive services done
• Applications
– *  Maintenance Agreement (What the store should do to boost
Maintenance Agreement sales)
– Home Electronics  * (What other products should the store
stocks up?)
– Attached mailing in direct marketing
– Detecting “ping-pong”ing of patients, faulty “collisions”
Page 68
Rule Measures: Support and
Confidence
Customer
buys both
Customer
buys beer
Customer
buys diaper
• Find all the rules X & Y  Z with
minimum confidence and support
– support, s, probability that a transaction
contains {X  Y  Z}
– confidence, c, conditional probability
that a transaction having {X  Y} also
contains Z
Transaction ID Items Bought Let minimum support 50%, and
minimum confidence 50%,
2000
A,B,C
we have
1000
A,C
– A  C (50%, 66.6%)
4000
A,D
– C  A (50%, 100%)
5000
B,E,F
Page 69
Mining Association Rules—An Example
Transaction ID
2000
1000
4000
5000
Items Bought
A,B,C
A,C
A,D
B,E,F
For rule A  C:
Min. support 50%
Min. confidence 50%
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%
support = support({A C}) = 50%
confidence = support({A C})/support({A}) = 66.6%
The Apriori principle:
Any subset of a frequent itemset must be frequent
Page 70
Mining Frequent Itemsets: the
Key Step
• Find the frequent itemsets: the sets of items that have
minimum support
– A subset of a frequent itemset must also be a frequent itemset
• i.e., if {AB} is a frequent itemset, both {A} and {B} should be a
frequent itemset
– Iteratively find frequent itemsets with cardinality from 1 to k (kitemset)
• Use the frequent itemsets to generate association
rules.
Page 71
The Apriori Algorithm
• Join Step: Ck is generated by joining Lk-1with itself
• Prune Step: Any (k-1)-itemset that is not frequent cannot be a
subset of a frequent k-itemset
• Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
are contained in t
that
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
Page 72
The Apriori Algorithm — Example
Database D
TID
100
200
300
400
itemset sup.
C1
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
Items
134
235
1235
25
C2 itemset sup
L2 itemset sup
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
{2 3 5}
Scan D
{1 3}
{2 3}
{2 5}
{3 5}
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
{1 2}
Scan D
L3 itemset sup
{2 3 5} 2
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
Page 73
How to Generate Candidates?
• Suppose the items in Lk-1 are listed in an order
• Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
• Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
Page 74
How to Count Supports of
Candidates?
• Why counting supports of candidates a problem?
– The total number of candidates can be very huge
– One transaction may contain many candidates
• Method:
– Candidate itemsets are stored in a hash-tree
– Leaf node of hash-tree contains a list of itemsets and counts
– Interior node contains a hash table
– Subset function: finds all the candidates contained in a
transaction
Page 75
Example of Generating Candidates
• L3={abc, abd, acd, ace, bcd}
• Self-joining: L3*L3
– abcd from abc and abd
– acde from acd and ace
• Pruning:
– acde is removed because ade is not in L3
• C4={abcd}
Page 76
Criticism to Support and Confidence
• Example 1: (Aggarwal & Yu, PODS98)
– Among 5000 students
• 3000 play basketball
• 3750 eat cereal
• 2000 both play basket ball and eat cereal
– play basketball  eat cereal [40%, 66.7%] is misleading because the
overall percentage of students eating cereal is 75% which is higher than
66.7%.
– play basketball  not eat cereal [20%, 33.3%] is far more accurate,
although with lower support and confidence
basketball not basketball sum(row)
cereal
2000
1750
3750
not cereal
1000
250
1250
sum(col.)
3000
2000
5000
Page 77
Criticism to Support and Confidence
(Cont.)
• Example 2:
X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
Z 0 1 1 1 1 1 1 1
– X and Y: positively correlated,
– X and Z, negatively related
– support and confidence of
X=>Z dominates
• We need a measure of
dependent or correlated events
corrA, B
P( A B)

P( A) P( B)
Rule Support Confidence
X=>Y 25%
50%
X=>Z 37.50%
75%
• P(B|A)/P(B) is also called the lift
of rule A => B
Page 78
Other Interestingness Measures: Interest
• Interest (correlation, lift)
P( A  B)
P( A) P( B)
– taking both P(A) and P(B) in consideration
– P(A^B)=P(B)*P(A), if A and B are independent events
– A and B negatively correlated, if the value is less than 1; otherwise A
and B positively correlated
X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
Z 0 1 1 1 1 1 1 1
Itemset
Support
Interest
X,Y
X,Z
Y,Z
25%
37.50%
12.50%
2
0.9
0.57
Page 79
Classification vs. Prediction
• Classification:
– predicts categorical class labels
– classifies data (constructs a model) based on the training set and
the values (class labels) in a classifying attribute and uses it in
classifying new data
• Prediction:
– models continuous-valued functions, i.e., predicts unknown or
missing values
• Typical Applications
–
–
–
–
credit approval
target marketing
medical diagnosis
treatment effectiveness analysis
Page 80
Classification Process: Model
Construction
Training
Data
NAME
M ike
M ary
B ill
Jim
D ave
A nne
RANK
YEARS TENURED
A ssistant P rof
3
no
A ssistant P rof
7
yes
P rofessor
2
yes
A ssociate P rof
7
yes
A ssistant P rof
6
no
A ssociate P rof
3
no
Classification
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured
= ‘yes’
Page 81
Classification Process: Use the
Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
T om
M erlisa
G eorge
Joseph
RANK
YEARS TENURED
A ssistant P rof
2
no
A ssociate P rof
7
no
P rofessor
5
yes
A ssistant P rof
7
yes
Tenured?
Page 82
Supervised vs. Unsupervised
Learning
• Supervised learning (classification)
– Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the
observations
– New data is classified based on the training set
• Unsupervised learning (clustering)
– The class labels of training data are unknown
– Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
Page 83
Training Dataset
This
follows
an
example
from
Quinlan’s
ID3
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income
high
high
high
medium
low
low
low
medium
low
medium
medium
medium
high
medium
student
no
no
no
no
yes
yes
yes
no
yes
yes
yes
no
yes
no
credit_rating
fair
excellent
fair
fair
fair
excellent
excellent
fair
fair
fair
excellent
excellent
fair
excellent
Page 84
Output: A Decision Tree for “buys_computer”
age?
<=30
student?
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
Page 85
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are discretized in
advance)
– Examples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning – majority voting
is employed for classifying the leaf
– There are no samples left
Page 86
Information Gain (ID3/C4.5)
• Select the attribute with the highest information gain
• Assume there are two classes, P and N
– Let the set of examples S contain p elements of class P and n
elements of class N
– The amount of information, needed to decide if an arbitrary example
in S belongs to P or N is defined as
p
p
n
n
I ( p, n)  
log 2

log 2
pn
pn pn
pn
Page 87
Information Gain in Decision Tree
Induction
• Assume that using attribute A a set S will be partitioned
into sets {S1, S2 , …, Sv}
– If Si contains pi examples of P and ni examples of N, the
entropy, or the expected information needed to classify objects
in all subtrees Si is
pi  ni
E ( A)  
I ( pi , ni )
i 1 p  n
• The encoding information
that would be gained by
branching on A

Gain( A)  I ( p, n)  E ( A)
Page 88
Attribute Selection by Information Gain
Computation

Class P: buys_computer =
“yes”

Class N: buys_computer =
“no”

I(p, n) = I(9, 5) =0.940

Compute the entropy for age:
5
4
I ( 2,3) 
I ( 4,0)
14
14
5

I (3,2)  0.69
14
E ( age) 
Hence
Gain(age)  I ( p, n)  E (age)
Similarly
age
<=30
30…40
>40
pi
2
4
3
ni I(pi, ni)
3 0.971
0 0
2 0.971
Gain(income)  0.029
Gain( student )  0.151
Gain(credit _ rating )  0.048
Page 89
Gini Index (IBM IntelligentMiner)
• If a data set T contains examples from n classes, gini index,
gini(T) is defined as
n
2
gini(T ) 1  p j
j 1
where pj is the relative frequency of class j in T.
• If a data set T is split into two subsets T1 and T2 with sizes
N1 and N2 respectively, the gini index of the split data
contains examples from n classes, the gini index gini(T) is
defined as
N 1 gini( )  N 2 gini( )
(
T
)

gini split
T1
T2
N
N
• The attribute provides the smallest ginisplit(T) is chosen to
split the node (need to enumerate all possible splitting points
for each attribute).
Page 90
Extracting Classification Rules from Trees
•
•
•
•
•
•
Represent the knowledge in the form of IF-THEN rules
One rule is created for each path from the root to a leaf
Each attribute-value pair along a path forms a conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand
Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40”
THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”
Page 91
Avoid Overfitting in Classification
• The generated tree may overfit the training data
– Too many branches, some may reflect anomalies due to noise
or outliers
– Result is in poor accuracy for unseen samples
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early—do not split a node if
this would result in the goodness measure falling below a
threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees
• Use a set of data different from the training data to decide
which is the “best pruned tree”
Page 92
Approaches to Determine the Final Tree
Size
• Separate training (2/3) and testing (1/3) sets
• Use cross validation, e.g., 10-fold cross
validation
• Use all the data for training
– but apply a statistical test (e.g., chi-square) to estimate
whether expanding or pruning a node may improve
the entire distribution
• Use minimum description length (MDL) principle:
– halting growth of the tree when the encoding is
Page 93
minimized
Scalable Decision Tree Induction
Methods in Data Mining Studies
• SLIQ (EDBT’96 — Mehta et al.)
– builds an index for each attribute and only class list and the current
attribute list reside in memory
• SPRINT (VLDB’96 — J. Shafer et al.)
– constructs an attribute list data structure
• PUBLIC (VLDB’98 — Rastogi & Shim)
– integrates tree splitting and tree pruning: stop growing the tree
earlier
• RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
– separates the scalability aspects from the criteria that determine the
quality of the tree
– builds an AVC-list (attribute, value, class label)
Page 94
Bayesian Theorem
• Given training data D, posteriori probability of a
hypothesis h, P(h|D) follows the Bayes theorem
P(h | D)  P(D | h)P(h)
P(D)
• MAP (maximum posteriori) hypothesis
h
 arg max P(h | D)  arg max P(D | h)P(h).
MAP hH
hH
• Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
Page 95
Bayesian classification
• The classification problem may be formalized
using a-posteriori probabilities:
• P(C|X) = prob. that the sample tuple
X=<x1,…,xk> is of class C.
• E.g. P(class=N | outlook=sunny,windy=true,…)
• Idea: assign to sample X the class label C such
that P(C|X) is maximal
Page 98
Estimating a-posteriori probabilities
• Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
• P(X) is constant for all classes
• P(C) = relative freq of class C samples
• C such that P(C|X) is maximum =
C such that P(X|C)·P(C) is maximum
• Problem: computing P(X|C) is unfeasible!
Page 99
Naïve Bayesian Classification
• Naïve assumption: attribute independence
P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)
• If i-th attribute is categorical:
P(xi|C) is estimated as the relative freq of
samples having value xi as i-th attribute in class
C
• If i-th attribute is continuous:
P(xi|C) is estimated thru a Gaussian density
function
• Computationally easy in both cases
Page 100
Play-tennis example:
estimating P(xi|C)
outlook
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
Temperature Humidity Windy Class
hot
high
false
N
hot
high
true
N
hot
high
false
P
mild
high
false
P
cool
normal false
P
cool
normal true
N
cool
normal true
P
mild
high
false
N
cool
normal false
P
mild
normal false
P
mild
normal true
P
mild
high
true
P
hot
normal false
P
mild
high
true
N
P(p) = 9/14
P(n) = 5/14
P(sunny|p) = 2/9
P(sunny|n) = 3/5
P(overcast|p) = 4/9
P(overcast|n) = 0
P(rain|p) = 3/9
P(rain|n) = 2/5
temperature
P(hot|p) = 2/9
P(hot|n) = 2/5
P(mild|p) = 4/9
P(mild|n) = 2/5
P(cool|p) = 3/9
P(cool|n) = 1/5
humidity
P(high|p) = 3/9
P(high|n) = 4/5
P(normal|p) = 6/9
P(normal|n) = 2/5
windy
P(true|p) = 3/9
P(true|n) = 3/5
P(false|p) = 6/9
P(false|n) = 2/5
Page 101
Play-tennis example: classifying X
• An unseen sample X = <rain, hot, high, false>
• P(X|p)·P(p) =
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) =
3/9·2/9·3/9·6/9·9/14 = 0.010582
• P(X|n)·P(n) =
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) =
2/5·2/5·4/5·2/5·5/14 = 0.018286
• Sample X is classified in class n (don’t play)
Page 102
Association-Based Classification
• Several methods for association-based classification
– ARCS: Quantitative association mining and clustering of
association rules (Lent et al’97)
• It beats C4.5 in (mainly) scalability and also accuracy
– Associative classification: (Liu et al’98)
• It mines high support and high confidence rules in the form of
“cond_set => y”, where y is a class label
– CAEP (Classification by aggregating emerging patterns) (Dong et
al’99)
• Emerging patterns (EPs): the itemsets whose support increases
significantly from one class to another
• Mine Eps based on minimum support and growth rate
Page 103
What Is Prediction?
• Prediction is similar to classification
– First, construct a model
– Second, use model to predict unknown value
• Major method for prediction is regression
– Linear and multiple regression
– Non-linear regression
• Prediction is different from classification
– Classification refers to predict categorical class label
– Prediction models continuous-valued functions
Page 104
Regression Analysis and Log-Linear
Models in Prediction
• Linear regression: Y =  +  X
– Two parameters ,  and  specify the line and are to be estimated
by using the data at hand.
– using the least squares criterion to the known values of Y1, Y2, …,
X1, X2, ….
• Multiple regression: Y = b0 + b1 X1 + b2 X2.
– Many nonlinear functions can be transformed into the above.
• Log-linear models:
– The multi-way table of joint probabilities is approximated by a
product of lower-order tables.
– Probability: p(a, b, c, d) = ab acad bcd
Page 105
General Applications of Clustering
• Pattern Recognition
• Spatial Data Analysis
– create thematic maps in GIS by clustering feature spaces
– detect spatial clusters and explain them in spatial data mining
• Image Processing
• Economic Science (especially market research)
• WWW
– Document classification
– Cluster Weblog data to discover groups of similar access patterns
Page 107
Examples of Clustering Applications
• Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
• Land use: Identification of areas of similar land use in an
earth observation database
• Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
• City-planning: Identifying groups of houses according to
their house type, value, and geographical location
• Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
Page 108
What Is Good Clustering?
• A good clustering method will produce high quality
clusters with
– high intra-class similarity
– low inter-class similarity
• The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
• The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns.
Page 109
Types of Data in Cluster Analysis
• Data matrix
• Dissimilarity matrix
 x11

 ...
x
 i1
 ...
x
 n1
...
x1f
...
...
...
...
xif
...
...
...
...
... xnf
...
...
 0
 d(2,1)
0

 d(3,1) d ( 3,2) 0

:
:
 :
d ( n,1) d ( n,2) ...
x1p 

... 
xip 

... 
xnp 







... 0
Page 110
Measure the Quality of Clustering
• Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, which is typically metric:
d(i, j)
• There is a separate “quality” function that measures the
“goodness” of a cluster.
• The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
and ratio variables.
• Weights should be associated with different variables
based on applications and data semantics.
• It is hard to define “similar enough” or “good enough”
– the answer is typically highly subjective.
Page 111
Similarity and Dissimilarity Between
Objects
• Distances are normally used to measure the similarity or
dissimilarity between two data objects
• Some popular ones include: Minkowski distance:
d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q )
i1 j1
i2
j2
ip
jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two pdimensional data objects, and q is a positive integer
• If q = 1, d is Manhattan distance
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2
i p jp
Page 112
Similarity and Dissimilarity Between
Objects
• If q = 2, d is Euclidean distance:
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1
j1
i2
j2
ip
jp
– Properties
• d(i,j)  0
• d(i,i) = 0
• d(i,j) = d(j,i)
• d(i,j)  d(i,k) + d(k,j)
• Also one can use weighted distance, parametric Pearson
product moment correlation, or other disimilarity
measures.
Page 113
Binary Variables
• A contingency table for binary data
Object j
Object i
1
0
1
a
b
0
c
d
sum a  c b  d
sum
a b
cd
p
• Simple matching coefficient (invariant, if the binary
bc
variable is symmetric):
d (i, j) 
a bc  d
• Jaccard coefficient (noninvariant if the binary variable is
asymmetric):
d (i, j) 
bc
a bc
Page 114
Dissimilarity between Binary
Variables
• Example
Name
Jack
Mary
Jim
Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
– gender is a symmetric attribute
– the remaining attributes are asymmetric binary
– let the values Y and P be set to 1, and the value N be set to 0
01
 0.33
2 01
11
d ( jack , jim ) 
 0.67
111
1 2
d ( jim , mary ) 
 0.75
11 2
d ( jack , mary ) 
Page 115
Major Clustering Methods
• Partitioning algorithms: Construct various partitions and
then evaluate them by some criterion
• Hierarchy algorithms: Create a hierarchical decomposition
of the set of data (or objects) using some criterion
• Density-based: based on connectivity and density functions
• Grid-based: based on a multiple-level granularity structure
• Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that model to
each other
Page 116
Partitioning Algorithms: Basic Concept
• Partitioning method: Construct a partition of a database D
of n objects into a set of k clusters
• Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic methods: k-means and k-medoids algorithms
– k-means (MacQueen’67): Each cluster is represented by the
center of the cluster
– k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
Page 117
The K-Means Clustering Method
• Given k, the k-means algorithm is implemented in 4
steps:
– Partition objects into k nonempty subsets
– Compute seed points as the centroids of the clusters of the
current partition. The centroid is the center (mean point) of
the cluster.
– Assign each object to the cluster with the nearest seed point.
– Go back to Step 2, stop when no more new assignment.
Page 118
The K-Means Clustering Method
• Example
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
1
2
3
4
5
6
7
8
9
10
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
Page 119
Comments on the K-Means Method
• Strength
– Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t
is # iterations. Normally, k, t << n.
– Often terminates at a local optimum. The global optimum may be
found using techniques such as: deterministic annealing and genetic
algorithms
• Weakness
– Applicable only when mean is defined, then what about categorical
data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes
Page 120