Download Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
By Paweł Olszewski
Krzysztof Bryś
by Paweł Olszewski
Data Mining
- projects in pairs
- raport (user, tech, experiments, next time more info about it.)
mark

lab - project (next week more information)
lecture – test (a,b,c,d... multiple choose ~20 questions)

pass
If I pass test I will have mark from labs.
STATGRAPHICS
1. Introduction
Data Mining model
Data Mining techniques
Data warehousing
Potential applications
2. Clustering
3. Association rule discovery
4. Classification algorithms
5. Estimation and regression techniques
6. Deviation detection techniques
7. Visualization of Data Mining results
Books:
1. Lecture notes :)
2. J. Hom, H. Komber – “Data Mining concepts and
Techniques” Morgan Kaufman 1996
3. N. Indurknya, S.M.Weiss – “Predictive Data Mining : a
practical guide” M.K. 1997
4. I. Witten, E. Frank – “Data Mining” M.K. 2000
5. M. Berry, G. Linoff – “Mastering Data Mining”
6. U.Fayyad, G.Piatetsky-Shapiro, P.Smyth, R.Uthumrsamy
– “Advances In Knowledge Discovery and Data Mining”
AAAI Press 1996
2
7. P. Cichosz – “Systemy uczące się” WNT 2000
Data mining = odkrywanie wiedzy
Data mining is the process of exchanging useful (useful for the
user) patterns and regularities from largebodies (cannot be
measured by hand) of data.
Remark:
Since data sets, which we can use are very large, then simple
methods should be used.
Potential applications of data mining methods.
 data compression
 marketing
 internet (e-business)
 banking (financial market)
 medicine
1) Regularities found in a data set are used for compact
encoding and approximation.
2) Sales dates and the aim is discovering patterns which
might enable purchasing behaviour.
For example:
Set of items:
I = {Milk, Coke, Orange juice}
Set of transactions:
M,O | C,O | M | C,O |
1000 1000 1000 1000
M
M
C
O
X
0
1000
C
X
X
2000
O
X
X
X
It’s symmetric so
only lower parts of
graph are used.
C (M) O
└──►──┘
3
3) Instead of Milk, Orange juice and Coke we may have web
pages and find if a person that like cars also like pets for
example.
66% customers buy Coke and Orange juice together
putting these products nearly we may increase salary of
them; and putting something like Milk between them,
possibly we may increase sallary of the added product
because its in the middle of items needed by a customer.
4) Identify stock trading rules from historical market data.
120
100
80
60
SALARY
40
20
0
If such situation will
happen many times,
then we may expect that
here salary will
increase. (sec. picture)
5) Characterise patient behaviour identify succesfull thephy.
Classes of Data Mining methods
The classification of data mining methods:
1) Classification
2) Clustering (Segmentation)
3) Associations
4) Sequential (temporal) patterns
5) Dependency modeling (regression)
4
ad.1)
The user defines some classes in the training data set.
Matching must find set of questions that have to be answered.
Q1
A 
The data mining system
A2
1
constructs descriptions for
Q2
Q3
the classes;


methods, that after
C1 C2 C3 C4 C5 C6
answering several questions
allows us to choose the proper class.
The system should find the best rule for each class.
a rule
L H S => R H S
(eft) (and) (ide)
(ight) (and) (ide)
The cathegories of rules:
- exact rules (no exception) = must be satisfied by each
member of a class
- strong rules (allows some exceptions) = must be satisfied
by “almost” all members of a class - limited number
- probabilistic rules (limited exceptions gived by probability)
= relates P(RHS|LHS) to P(RHS), when they are almost
equal it means P(RHS|LHS) ~ P(RHS) [it operate on
probability]
ad. 2)
Clustering is a process of creating a partition of the whole
data set such that all the members of each class of the partition
are similar according to some distance measure. If we change
measure we also change partition. Classes are characterised
by measure.
5
centers of the classes data set
new record is putted to the nearest class. We
measure distance from centers of nearest classes.
ad. 3)
We find rules IF LHS THEN RHS by using some
associations function, which returns patterns that exist among
collection of items.
We use for the data set which consists of records each of
them contains some number of items. (given a set of items and
each of the records in the data set is a subset of this set.)
example:
{A,B,C,D,E} – the set of items
Data Set
{A, C}
{A, C, D}
{A, B, C}
{B, C, D}
{A, B, D}
{A, C, E}
80% of records containing A contain also C
so we find a rule:
P(if A then C) = 0.8
P(if C then A) = 0.8
but it’s a probabilistic rule.
ad. 4)
sequential patterns (temporal)
6
given a set of sequences, of records over a period of time.
example:
DAY
1.
2.
3.
4.
5.
SMITH
Milk
Beer
Milk
Beer
Milk
GATES
Coke, Milk
Beer, Juice
Coke
Beer
Milk
I.JONES
Milk
Milk
Milk
Milk
Milk, Juice

a sequence of records for Mr. Smith
Milk
↓
Beer
└─────────┘
Mr. Smith buys Milk and Beer so
we put other drinks between them,
maybe he also want to buy them :)
ad. 5)
The goal is to find a model which describes important
dependicies between the variables. Using the association rulee
method we can find for example that 80% of all cases contain A
and C.
A <=> C ?
A => C


independent
variable
dependent
variable
The most important DM methods
- statistical methods (regression) (1) (4) (5) (2)
- probabilistic methods (bayesian methods, apriori
algorithm) (2)
- neural networks (2)
- genetic algorithms (2)
- decision trees (2)
- nearest neighbour method (3)
- data visualisation (2) (4)
- rule induction (1)
7
Knowledge
Discovery
Process
DATA ─────► ANSWER
SET
▲
│
QUESTION
DATA
SET ─────►
Answer
▲
|
DM

QUESTION
1) Creating a target data set (for example selection from the
larger data set)
2) Clean and preprocess data (eliminate errors noise, fill
missing data)
e.g:
0 – Yes D1
and we have to correct
1 – No

answer so Yes=0 & No=1
0 – No

in both cases
1 – Yes D2
3) Data reduction (delete attributes which are not usefull)
4) Pattern extraction and discovery = data mining
a) choose data mining goal
b) choose data mining algorithm(s)
c) search for patterns of interest
5) Visualization of the data
6) Interpretation of discovered patterns
7) Evaluation of discovered knowledge (how we may use it?)
Decision trees
X – a training set of cases (examples)
8
Each case is described by n attributes and the class to which
belongs.
values of attributes
│
▼
decision
(number of the class)
X
Outlook
1
2
3
4
5
6
7
8
9
10
11
12
13
14
sunny
sunny
sunny
sunny
sunny
overcast
overcast
overcast
overcast
rain
rain
rain
rain
rain
Temp.(F) Humidity Windy? Class
75
80
85
72
69
72
83
64
81
71
65
75
68
70
70
90
85
95
70
90
78
65
75
80
70
80
80
96
T
T
F
F
F
T
F
T
F
T
T
F
F
F
P
N
N
N
P
P
P
P
P
N
N
P
P
P
An attribute is a function
a: X ─► A
A – the set of attribute values
a – outlook
A = {sunny, overcast, rain}
X is partitioned into some classes
C1, ..., Cl
l - categories
By a test we mean a function
t: X ──►R
A decision tree is a classifier which consists of – leafs which
correspond to classes
- decision nodes which correspond to the tests
( each branch and subtree starting in a decision node
9
corresponds to one possible outcome of the test )
Remark: we will consider one attribute tests
Examples of tests:
if x belongs to some set
equality test
inequality test
if outlook=overcast then play
next week – algorithms, how to construct decision trees and
how to choose tests (training set)
X – the set of examples
a1 : X ──► A1
.
.
attributes
.
an : X ──► An
10
c : X ──► C – the set of Classes

Classifying function
the set of test values
(outcomes)

X = C1...Ck
classes
a test: t:X--> Rt R
(we deal with
one attribute test)
classification of test:
1º Identity test
t(x) = a(x) , xX
2º Equality test
1
t(x) =
2
if a(x) = V
if a(x)  V
3º Membership test
1
if a(x)  V
t(x) =
2
if a(x)  V
4º Partition test
1
if a(x)  V1
2
if a(x)  V2
t(x) =
:
:
:
:
n
if a(x)  Vn
5º Inequality test
V = (-∞, v)
11
1
if a(x) ≤ V
2
if a(x) > V
t(x) =
Algorithm of constructing a decision tree
1º
2º
3º
If X contains one or more examples belonging to the same
class Cj then the decision tree for the set X is a leaf
identifying the class Cj .
If X contains m examples then the decision tree in this
node is a leaf, but the class to be associated with this leaf
must be determined from information other than X (for
example the most frequent class in the parent node)
If X contains a mixture of classes then choose a test based
on a single attribute with possible outcomes o1, ..., on.
X is partitioned into subsets X1, ..., Xn, where Xi contains all
cases in t that have outcome ai of the chosen test.
The decision tree for X consists of a decision node identifying
the test and one branch for each possible outcome
we choose t in each node
X  x1  ...  x k
X d  {x  X : c ( x )  d }
X tr  {x  X : t ( x)  r}
X trd  {x  X : c( x)  d  t ( x)  r}
(X – the set of examples
d – the label of the class
S – the set of possible tests)
for each attribute we may use many tests
binary attributes has only one test
function build_tree(X,d,S)
12
IF STOP(X,S) THEN form a leaf L
d l  class ( X , d )
RETURN(L)
ELSE
form a node n
t n  choose _ test ( X , S )
d  class ( X , d )
FOR each r  R t n
n(r)  build_tree (X t n r , d , S \ {t n })
RETURN (n)
1  if, for example possible stop criteria:
- the set X is empty
or
STOP ( X , S )  
- the set S is empty
or
0
- all examples in X belongs to the
same class (“almost” all)
...if one of these conditions is satisfied algorithm stops.
CLASS(X, d) = the most frequent class of examples in the set X
or d (d in case if X=0). (it returns d if the set is empty)
CHOOSE_TEST(X, S)
For the set X, the expected information
Xd
X
I(X )  
 log d
X
d C X
where
13
X d  set of those examples that belongs to class d
X
X
d
 average number of classes
X  cardinalit y of X
Xd
X
 the probabilit y that an example belongs to the class d
X  X1   X n
Entrophy is large if the cardinalit y differs
is large if X 1    X n
X 1  X n
example:
X  1000
X  X1  X 2
X 1  X 2  500
small entropy
X1  1
large entropy
X 2  999
For such test t, the expected entropy of the test t
Et  X   
rRt
where
X tr
X
 Etr  X 
Etr  X   
d C
X trd
X tr
- Entropy
 log
X tr
X trd
14
- the expected
information in the
set Xtr
Entropy measures average amount of information (number of
steps) which is needed to identify class of an example when the
test t is used.
The gain of the information , when the test is used
g t  X   I  X   Et  X 
we choose from the set S the test t with the largest gt(X)
I X   
d C

Xd 

   log
X 
X 


Xd

how many steps we need
(in avg) to find a leaf
node, to find the class
of the object.)
the expected info log
X
Xd
For continuous attribute (value) we need many tests.
Example:
t = outlook
(playing tennis)
Outcomes (test values)→ sunny
Decisions (classes)
overcast
rain
15
PLAY
DON’T PLAY
I X  
2
3
5
4
0
4
3
2
5
9
5
9 
9 5 
5
   log      log   0.940
14 
14  14 
14 
Et  X  
5 2 
     log
14  5 
2 3 
3 



log


 
5 5 
5 

4 4 
     log
14  4 
4 0 
     log
4 4 

5 3 
3 2 
     log      log
14  5 
5 5 
g t  X   I  X t  0.940  0.694  0.246
0 
 
4 
2 
   0.694
5 
Remark
The gain of the information proffers tests with many outcomes.
The information value.
X 
X 
IVt  X    tr    log tr 
X 
rRt X

X
 the average number of subtrees
X tr
gt  X 
IVt  X 
we chose the test with this ratio t(X) = a(X)
The gain ratio :
P(t) – the cost of the test t.
The test value denoted by
g t2  X 
Vt  X  
Pt  X 
Remark: For the discrete attribute one test with outcomes as
many as the number of distinct values of the attribute is
considered. For the continuous attribute the data should be
16
sorted (with respect to the attribute) and the entropy gains
should be calculated for each possible binary test (if a(x) < Z)
a(X) < Z
a(X) ≥ Z
each Z = one test
if a(x) < 81
P processors
N data items (examples)
Parallel algorithm for constructing a decision tree.
P(n) – the set of processors which handle the
node n.
If the node n is partitioned into child nodes n1, n2, ..., nk then the
processor group P(n) is also partitioned into k-groups P1, ..., Pk
such that Pi is handled by ni.
ALGORITHM
1) expand a node n
IF the number of child nodes < |P(n)|
THEN 2) Assign a subset of processors to each childnode in
such a way that the number of
processors assigned to a childnode is
proportional to the number of data
items contained in the node.
2) Follow the above steps 1-3 recursively
ELSE
17
2) Partition the child nodes into |P(n)| groups such that each
group has about the equal number of data items. Assign each
processor to one node group.
3) Follow the computation for each processor independently.
NEXT WEEK: Association rules discovery
rule LHS → RHS
in a decision tree
LHS = the outcomes of the tests
RHS = the class
In each node we have information about the answer
Example:
5 rules. Each path from a leaf to top is a rule.
5: if outlook = overcast
then play
1: if outlook = sunny
and humidity ≤ 75
then play
18
Association Rules Discovery
Example:
1000 x {beer, milk, juice}
1000 x {beer, milk, water}
1000 x {beer, milk}
1000 x {beer, water}
P( if b then m ) < P( if m than b)
||
||
0.75
1
3
P(m  b) 4 3
P ( m b) 
 
P(b)
1 4
association rule = a rule which implies some association
relationship between attributes values in the data set.
We may remove windy in this case.
I = {s,r,o,t,f,p,d,}
T = { (1,s), (1,r), (1,o),
(2,t), (2,f),
(3,p), (3,d) }
19
IF ( windy = false, outlook = sunny )
THEN Decision = play
T – a set of transactions (sets of items)
I – a set of items (attributes, values, decisions)
I = {s, o, r, t, f, p, d}
T= { {s,t,p}, {s,t,d}, ... }
an association rule “if X then Y” denoted by X,Y where X,Y  I
For each X,Y  I
ST  X , Y  
t  T : X  t , Y  t
T
the support of the rule X, Y in the set T
ConfT  X , Y  
t  T : X  t , Y  t
t  T : X  t
the confidence of the rule X,Y in the set T
20
X = {car, cigarettes}
Y = {plane}
|T|=1000
large means >>
1
2
ST  X , Y  
1
1000
Conf  X , Y   1
TXY  1
TX  1
t  car , cigarretes , plane
TX  t  T : X  t
XY if X then Y
TX ,Y  t  T : X  t , Y  t
ST  X , Y  
TX ,Y
ConfT  X , Y  
T
TX ,Y
TX
if St is large
the support of the item set X
ST  X  
TX  TX ,Y
TX
T
and
TY  TX ,Y
X  car , plane
ST  X , Y   ST  X 
ST  X , Y   ST Y 

X ,Y  I
large conf and large support => rule is good
TX , T  2 I - the family of all subsets of I
ConfT  X , Y   r
ST  X , Y   s
21
for each tT tI
The number of all possible pairs X,Y, where X and Y are
2 I

subsets of I is  2





SUPERSET of the set X is the set containing X.
Apriori Algorithm
S = minimum support
r = minimum confidence
1) Find all combinations (sets) of items, that have support
above minimum support s.
Call those combinations frequent items sets.
2) Use the frequent itemsets to generate the desired rules.
“if A,B then C,D” is a frequent rule if ABCD is frequent itemset
ST  X , Y   s
ConfT  ABCD  
and
ST  ABCD 
r
ST  AB 
DEF: A rule X,Y is frequent if
ST  X , Y   s
confT  X , Y   r
and
A set X is frequent if ST  X   s
AprioriAlg()
L1 = { frequent 1-element itemsets} // for each item we check
// the support
22
FOR(k=2, Lk-1=0, k++)
DO Ck = apriori_gen(Lk-1)
// new candidates //
Remark
Each subset of the frequent set is also frequent.
For all t in the dataset do
{
for all candidates c in Ck contained in t do
{
c.count++
}
}
LK  c  C K : c.count  s  T 
return(k , LK )
}
next tine more about apriori_gen(Lk-1)
Weka system – collection of math
http://www.cs.waikato.ac.nz/ml/weka
Databases / Databases generator
http://www.datgen.com
Links to the datasets in the net
http://mainseek.pl/ca/557351/datasets
23
apriori
I
I – set of items
T – a set of transactions
Transaction – a set of items (subset of I)
For each set AI
ST  A 
t  T : A  t
T
a set A is frequent <=> ST(A) ≥ ... ≥ s (minimal support)
Ck = apriori_gen(Lk-1)
Lk- the set of all k-element
frequent itemsets
Ck- the set of k-element
candidates
Remark:
a set A is frequent
each subset of A is frequent
Ck = apriori_gen(Lk-1)
24
1) Join Lk-1 and Lk-1, the joining condition is that (k-2) items
are the same and obtain the set Ck
For example:
Lk 1


A  V1 ,V2 , ,Vk  2 , Vk'1


B  V1 ,V2 , ,Vk  2 , Vk''1


A  B  {V1 , V2 , , Vk  2 ,Vk'1 ,Vk''1}
A B  k
2) Delete from Ck those itemsets that have some (k-1) –
element subset not in Lk-1.
Remark:
If there is some subset of A which is not frequent then there is
some (k-1) – element superset (subset of A) of this subset
which is not frequent.
Apriori_gen(Lk-1)
1) for each ALk-1 do
{
for each BLk-1 do
{
if |AB| = k-2
then
Ck = Ck  {AB}
}
}
2) for each DCk do
{
repeat
d = new_element(D)
25
until ( d = null or D \ d Lk-1)
if d ≠ null then Ck = Ck \ {D}
}
}
if c  t
then c.count++
Frequent itemsets can be found in the following way:
1) Take a sample of the data (main memory sized)
2) Run apriori algorithm for this data (find frequent itemsets in
the sample)
3) Verify that the frequent itemsets of the sample are
frequent in the whole dataset
Remarks:
1. It will miss sets that are frequent in the whole dataset but
not in the sample.
2. the minimum support in the sample should be lower than
the minimum support in the whole dataset. (risk : there will
be too many “candidates”)
Paweł Olszewski
26
Sequential analysis
Example:
Customers:
Sequence
for Mr X
sequence
for Mr Y
X
Y
Z
sequence
for Mr Z
We are looking for sequential patterns
2)
Pb, m, j 
1
2
b, j, m, j, w - pattern
 j, m, b - pattern contained
in no sequences
Input data: a set of sequences called data sequences (each
sequence is an ordered list of transactions/itemsets)
Typically : there is time associated with each
transaction
27
A sequential pattern = a sequence of sets of items(NOT
NECESSARILY TRANSACTIONS)
b  b1 ,  , bt 
s.t. b1 ,  , bt  I
T
I – the set of items
T – the set of transactions
X – the set of sequences
X  a1 ,, ał  : a1 ,, ał  T 
Problem:
Find all sequential patterns which are “frequent” (with a
user-specified minimum support)
The support of a sequential pattern b in the set X:
b 
x  X : b  x
x
the length of sequence = the number of itemsets in the
sequence
k-sequence = the sequence of length k
1-sequence = itemset
frequent 1-sequence = frequent itemset
NEXT TIME: 2 methods for sequential patterns...
28
13-11-2002
1-sequence = itemset
frequent set = large set = large itemset = litemset = the set of
items with minimum support. The support of sequence bin the
dataset X
S x b  
x  X : b  x
x
Example:
Let A,B,C – sets
X = { (A,C,B,A), (B,C,B,B) }
customer sequences
S x B  
2
1
2
X – the set of customer sequences
transaction = the set of items
customer sequence = order of transactions
T – the set of transactions
T = { A,C,B,A,B,C,B,B }
ST  B  
4 1

8 2
DEF:
A sequence b is large (frequent) if S x b   r , where r is the
user defined minimum support.
29
Remark:
If a sequence is large then each itemset contained in this
sequence is also large.
b  b1 ,, bk  & S x b   r

 S x bi   r
i 1,, k
Moreover each subsequence of b is also large.
Apriori All
1, 2, 3, 4, 5, 6
Ck Lk
Apriori Some (avoids counting non maximal large sequences)
Ck-1
1, 11, 21, 31
... 28, 29, 30
◄────────
The solution of the problem of mining sequential patterns
consists of the following phases:
1. Sort Phase (the set of transactions -> the set of customer
sequences)
2. Large Itemsets Phase (we find all large 1-sequences
(large itemsets) using Apriori)
3. Transformation Phase
transaction
↓
the set of large itemsets which are contained in t
example:
A, B, C, D
t={a,d,f,g}
A={a,d,g}
T={A}
customer sequence
↓
the ordered list of transactions (of sets of itemsets)
30
if there is no large itemset contained in the transaction or
customer sequence, then we delete it. (but the number of
customers is not changed)
4. Sequence Phase (we find large sequences of each
length)
5. Maximal Phase (we find the maximal sequences among
the set of all large sequences)
Families of algorithms for finding patterns.
Count All = count all large sequences
Count Some = Count all maximal large sequences so first
we count longer sequences which are contained in some longer
large sequence)
Remark:
The time saved by not counting sequences contained in a
longer sequences may be less than the time wasted counting
sequences without minimum suport that would never have been
counted, because their subsequences were not large.
31
Apriori All
// Forward Phase
L1 = the set of frequent itemsets, k is the length
fork  2, Lk 1  0, k   
{
begin
Ck = the set of new candidates generated from Lk-1
For each customer sequence c in the dataset
{
for each candidate d in Ck contained in c
{
d.count++
}
}
s=the minimum support
L  d  C : d .count  s
k
k
last=k
end
// Return(Maximal frequents sequence in Lk)
// Backward Phase
for(k=last, k≥2, k--)
{
delete all sequences in Lk contained in some Li, i>k
return (Lk)
}
32
Apriori Some
// Forward Phase
L1 = the set of frequent itemsets
C1 = L1
fork  2, Ck 1  0 and LLast  0, k   
{ begin
if (Lk-1 is known)
Ck – the set of new candidates generated from Lk-1
else
Ck – the set of new candidates generated from Ck-1
if (k=next(last)) then
{begin
for each customer sequence c
{
for each candidate d in Ck contained in c
{ d.count++}
}
Lk  d  Ck : d .count  s
last = k
end}
end}
// Backward Phase
for(k=last, k≥2, k--)
{
Lk+1,...,Llast
if (Lk not found in foreward phase) then
{begin
for each customer sequence c
{
for each candidate d in Ck contained in c
{d.count++}
}
Lk  d  Ck : d .count  s
33
delete all sequences in Ck for some i > k
else
{
delete all sequences in Lk contained in some Li, i>k
}
return(Lk)
} //of for
Last = 1
Remark:
For “lower” minimum-supports there are longer longe
sequences and hence more non-maximal large sequences are
generated. In this case Apriori.
Paweł Olszewski
How generate candidates in Ck from Lk-1
1) Join Lk-1 and Lk-1 (Ck-1 and Ck-1)
for each A,B  Lk 1 (or Ck 1 ) we find A  B
2) Select those unions of sequences which have k-2 common
itemsets A  B  k  1
3) Delete all sequences c such that some subsequence of
length k-1 of c is not in L k-1 ( or Ck-1)
Example:
L1 = { 1, 2, 3, 4, 5 }
T = { (1,5,2,3,4)
(1,3,4,3,5)
(1,2,3,4)
(1,3,5)
(4,5) }
11
S = 0,4
12
13
K=2
14
L2
15
21
22
23
24
25
31
32
33
34
35
// 1,2,3,4,5 – sets, the frequent itemsets
0
0,4
0,8
0,6
0,6
0
0
0,4
0,4
0
0
0
0
0,6
0,4
(40%)
41
42
43
44
45
51
52
53
54
55
0
0
0,2
0
0,4
0
0,2
0,2
0,2
0
|T| = 5
S*|X| = 2
34
(1,2)  C if (...,1,...,2,...)
Now we join these not removed together by common digit
(eg. 12 + 24 => 124) we don’t use these which have support
less than our S=0,4.
L3
L4
123
124
134
135
145
234
235
245
345
0,4
0,4
0,6
0,4
0,2
0,4
0
0
0
1234 0,4
STOP
.
And now we go back
Answer: (4,5), (1,3,5), (1,2,3,4)
Clustering
Given: points in some space X
Goal: Group these points into some number of clusters, each
cluster consists of points which are “near” (“similar”)
we have set of points
we want to divide it into some clusters
such that if we take point from cluster A,
then distance from this point to any
other point in the cluster A is smaller
than distance between this point and
any point that do not belong to cluster A
35
X
mk 
X i C p
ik
Cp
A distance measure d is any function d: XxX  R which
satisfies the following conditions:
1° d(x,x) = 0
for each xX

2° d(x,y) = d(y,x)
(symmetric)
3°

d(x,z) ≤ d(x,y) + d(y,z)
x , yX
x , y , zX
Examples of distance measures
1) Euclidean space
k – dimensional
R k  x1 ,, x k  : x1 ,, x k  R
a. Euclidean distance
d  x, y  
k
 x  y 
i 1
where
2
i
i
x  x1 ,, xk 
y   y1 ,, yk 
b. Manhattan distance
36
k
d  x, y    xi  yi
i 1
c. M
a
x
i
m
u
m
o
f
dimensions
d x, y   max xi  yi
i 1k
d. Harming distance
d x, y   i : xi  yi 
2) X- space of all strings
Distance between two strings x,y
d(x,y) = |x| + |y| - 2LCS(x, y)
where
|x|, |y| - lengths of x, y LCS(x, y)
- the length of the largest common subsequence
for example:
x = abcdef
y = bababcdfe
d(x, y) = 6 + 9 – 2*4 = 7
common subsequence
LCS( abcde, abe )
LCS( x, y ) = the largest such sequence z that z x, and z and
the elements of z are in x and y not necessary consecutive.
37
d(x, y) = |x| + |y| -2LCS(x ,y)
Clustering of data sets :
Given : S – the set of cases, each case consists of l values
(corresponding to l variables) so
S = { (xi1, xi2, ..., xil) : i = 1, ..., n }
|S| = n cases // variables
n cases each of k variables
xik – the value of the k’th variable in the i’th case
1 2 ... k ... l
1
2
:
i
xik
:
n
W – the set of weights of variables
W = {wk : k = 1, ..., l}
wk – the weight of the comparison of the k-th variable
l
wk = 1 for each k = 1, ..., l
 wk  l
k 1
1
wk  
0
Mr. Smith
Mr. X
if comparison of the k-th variable is valid
if not.
weight (kg)
200
65
height (mm)
1500
2100
The Euclidean Distance between two cases.
38
 x
ik  x jk   wk
l
d ij  d xi , x j  
k 1
2
l
w
k
k 1
For each cluster Cp we define :
- the mean of Cp
m p  m p1 ,, m pl 
where
for k  1,, l
x
m pk 
i:xi C p
ik
Cp
- the standard deviation of Cp
 x
l
Cp 
1
Cp

k 1
ik
 m pk 
l
w
i:xC p
k
k 1
d1  d 2  d 3
 Cp
3
The distance between the i’th case xi and p-th cluster Cp
d xi , C p   d xi , mp 
The distance between clusters Cp and Cq
def
D pq  d C p , Cq   d m p , mq 
def
Approaches to clustering
39
1. Centroid approaches (the number of clusters is fixed)
2. Hierarchical approaches (the number of cluster
changes)
- agglomerative
- divisive
S  xi1 ,, xil  : i  1n
xik – the value of the k’th variable in the i’th case
xi, xj S
d(xi, xj)
a cluster = the group of cases
for cluster Cp
d(xi, Cp) = d(xi, mp)
for each k, wk=1.
mp = (mp1, ..., mpl) – the mean of the cluster Cp
if each xik, for i=1, ..., n, k=1, ..., l is real number then
 x
l
d xi , x j  
k 1
 y jk   wk
2
ik
l
w
k 1
k
wk – weight of companion of k-th variable
the mean does not have to be one of the points
mp is called the centroid of the duster Cp (mp is not necessary
an element of S) mp is any point in RL
for k=1, ..., l
x
m pk 
i:xi C p
ik
Cp
otherwise (if values of some variables are not numbers)
if some distance measure d(xi, xj) is given, then the mean of the
cluster Cp is called clustroid [not necessarily center, one of
elements belonging to cluster] and is that element of the
40
dataset belonging to Cp that minimizes the size of the distances
to the other points of this cluster.
Standardization of the data
All cases consists of real numbers
Remark
The values of each variable should be standardize, since
we wish the variables to be treated equally. Otherwise
clustering would be dominating by the diversity of only some
variables. With the largest standard deviation.
The standardized value of xik:
xik 
xik  mk
k
n
where mk 
x
i 1
n
ik
is the mean of the k - th variabl e
1 n
2


k 
x

m
is the standard deviation

ik
k
n i 1
of the k - th variabl e
Remark:
The standardized variables have a mean of 0 and standard
deviation of 1.
the standardized value = standard score, z-score
example:
41
x1
x2
x3
x4
x5
v1
v2
v3 v4 v5
65,9 10,4 19,7 2,6 1,4
90,5 1,7 1,4 6,2 0,4
71,3 12,3 13,1 1,9 2,3
46,4 9,7
42
0 0,85
86,2 3,0 4,8 5,2 0,7
v1 v2 v3 v4 v5
x1
x2
x3
x4
x5
x6
sum
-0,95 -1,15
0,89 -0,84
0,95 -1,12 -0,85
1,06
1,13
0,92
-0,53 -1,67
0,26 -1,22
2,85
-2,47
3,01 -2,25
0,03
0,95
0,63 -0,88 -0,52
0
0
0
0,64 -0,32
0
0
Distance Matrix
x1
x2
x3
x4
x5
x1
0 4,02 0,83 2,01 2,45
x2
x
0 6,30 8,73 0,19
x3
x
x
0 4,23 4,39
x4
x
x
x
0 6,79
x5
x
x
x
x
0
Possible clustering
Clustering methods:
1) k-means algorithms (the number of clusters is fixed)
2) Hierarchical clustering
1) k-means algorithms (the number of clusters is fixed)
a. Take k-cases
Each case is the centroid of its own
cluster.
42
b. Each other case is assigned
to the cluster that has the
nearest centroid
c. Calculate the new position of
the centroid of each cluster
(if xi is assigned to Cp then
m pk  m pk 
x
ik
 m pk 
np
for k=1, ..., l , where np = |Cp|
d. repeat steps b and c until all centroids don’t change
the positions (when calculated at step 3)
2) Hierarchical clustering
- will be described next time, and also will be given methods
for use in practic
k-means algorithm
Possible modifications of the original k-means algorithm:
1. At the begining we can choose the k clusters by picking k
points (cases) which are sufficiently far away from any
other.
2. During the computation the number of clusters can be
reduced (by joining two clusters, if the distance between
these clusters is smaller than some user-defined value r) or
increased (by splitting one cluster into two new clusters, if it
is sufficiently large)
3. We can split “the largest” cluster into two new clusters and
merge two other (the closest two) to keep the number of
clusters at k.
The standard deviation of the cluster Cp.

1
Cp

i:xi C p
 xik  m pk 2 wk
k
l
 wk
k 1
43
where m p (m p1 , m p 2 , , m pi ) is the centroid of the cluster Cp
 p q   p
 q
The increase of the sum of standard deviations (obtained by
splitting one cluster Cp Cq into Cp and Cq ):
I p ,q   p   q   pq
|Ik| is small enough where Ik is the increase of standard
deviation where the number of clusters is increased from k to
left.
k=0
k=1
.
:
k=ko-1
k=ko
Hierarchical Clustering
1) Agglomerate Clustering
We start with n clusters (each cluster contains only one
case)
a. Compare all pairs of clusters and find the nearest pair
b. The distance between this closest pair (denote it by
D) is compared to some user-defined value r if D<r
then we join the nearest two clusters into one and
RETURN to a.
else STOP
Possible measures of “closeness” of two clusters:
- distance between their centroids
- maximum distance between nodes in the compared
(minimum)
clusters (one point from one cluster and the
(average)
second from the other one)
- the increase of the standard deviation of the joining two
clusters into one
44
2) Divisive Clustering
We start with one cluster containing all points
a. The distances between all pairs of cases within the
same cluster Cp are calculated and the pair with the
largest distance is selected.
d xi , x j  between two
b. The maximal distance D p  xmax
, x C
i
j
p
cases in Cp is compared to some user-defined value
s. If Dp>s then Cp is divided into two new clusters, the
points xi and xj are seed points of these new clusters
and each case in Cp is placed into the new cluster
with the nearest seed point and RETURN to a.
ELSE STOP
 Mixed Clustering
For assigning the new object to the clusters we use 4 new
operators:
1) assign the new object to one of the existing clusters
2) we form the new own cluster for the new object (if all
existing clusters are far away from the new object)
3) Split one cluster into 2 new clusters and assign the new
object to one of them (the nearest) (if the standard
deviation at cluster would be too large)
4) Join two existing clusters into one and assign the new
object to this cluster (if after adding the new object to one
of these clusters we get two clusters which are two large)
(obtained cluster must not be too large)
Thm Bayes
P Ai | B  
PB | Ai   P Ai 
P B 
xCi x1=v1
x=(x1, ..., xL)
45
{ outlook = sunny
temp = 75C
humidity = 101% }
P Ai | B   max  PB | Ai   P Ai   max
j : PA j | B   max P Ai | B   r
1
 Px1  v1 , , xi  vi | x  Ci   Px  Ci 
d x, Ci 
...P( xL  vL | x  Ci )  Px  Ci 
Cx – case
Ci – i’th cluster
Px  Ci  
Ci
x
Pxk  vk | x  Ci  
yCi : yk vk 
Ci
The value of the clustering into k-clusters Ci, ..., Ck
V C  
k
1
k

p 1
Cp
x
l
 l

   P xk  v | x  Ci     P 2  xk  v 
k 1 vVk
 k 1 vVk

prob. after
clustering
Vk – the set of all values of the k-th argument
& before
clustering
Remark:
C1 is better than C2 if V(C1) > V(C2)
Application of clustering for finding missed values:
xi  xi1 ,, xik ,, xil 
xi  C p

m p  m p1 , , m pk , , m pl 
mp – the centroid of the cluster Cp
46
Use only if at most 50% of all values of the clusters is missed.
Fuzzy Clustering
Pxi  C p   1

case

cluster
S= {x1, ..., xn}
the set of cases
given clustering into
cluster C1, ..., Ck
d(Xi, Cp) – the
distance between xi
and the cluster Cp
The membership function
mip  Pxi  C p  
d x , C  
2  1
i
p
p

 d x , C 
2
i
k 1
1
k
=2
k=3
mi1 
mi 2 
mi 3 
=1
for each i mip  1p
(hard clustering)
d12
d12  d 22  d 32
d 22
d12  d 22  d 32
d 32
d12  d 22  d 32
y   mip  d 2 xi , C p 
n
n

i 1 p 1
47
for   1

if
we are sure x
belongs to this cluster
because it gives
a very small number
y   1p d 2 xi , C p 
n
k
i 1 p 1
  the degree of fuzziness
  1;   
Paweł Olszewski
if d1  d 2  d 3 we don’t know where to put it (xi). But in Fuzzy
clustering we get big difference in d1, d2, d3 since we take the
power of these values.
k
m
p 1
ip
if   
m1  m2  m3  13
1
mi1  mi 2  0.01  the degree of fuziness
m

i1

 mi2  0.01

smaller
Remarks:
1) If =1 then we get the “hard” clustering
2) As  approaches to infinity the differencies between
membership
mi1  0.1
functions are
mi 2  0.9
very small
mi1 3  0.001
mi 2 3  0.729
mi 2   mi1 
if x is in large distance from each cluster then we
may choose a cluster we want :), because for large 
48
difference in distances will be very small (see Remark 2.)
We choose such clustering for which the objective function is
minimum.
Naive Bayes Classification
Bayes Theorem
PB | A 
P  B P  A| B 
P A
P(B) – a priori probability
P(B|A) – a posteriori probability
Naive Bayes assumptions
evidence can be splitted into independent parts (sttributes of
the instance)
A1    An  A
and Ai  A j  0 for i  j
P A | B   P A1 | B   P A2 | B    P An | B 
P  B | A 
P  B  P  A1 | B  P  A2 | B  P  An | B 
P A
(sunny,
(hot, mild, cool) (high,normal) (true,false)
overcast,rainy)
Outlook
S
S
O
R
R
R
O
S
S
Temperature Humidity
h
h
h
m
c
c
c
m
c
H
H
H
H
N
N
N
H
N
(Yes, No)
Windy
Decision
F
T
F
F
F
T
T
F
F
N
N
Y
Y
Y
N
Y
N
Y
49
R
O
O
R
S
m
m
h
m
m
N
H
N
H
N
F
T
F
T
T
Y
Y
Y
N
Y
A={outlook=s, temp=c, humid=h, windy=true}
a) Pdecision  Y | A  P  B P  A|B  
P A
1
b) Pdecision  N | A 
P  A 
1
15
1
P  B2  P  A| B2 
P A
9 2333
14 9 9 9 9
1
15

5 31  4 3
14 5 5 5 5
1
15
 0,005
1
15
 0,102
15
15 (14  1, this new example)
PB1   149
P A | B1   Poutlook  s | B1   Ptemp  c | B1   Phumi  h | B1   Pwind  true | B1  
 92  93  93  93
P A | B2   Poutlook  s | B2   Ptemp  c | B2   Phumi  h | B2   Pwind  true | B2  
 53  15  54  53
decision = N
data warehouse – a decision support database that is
maintained separately from the operational
database.
If support processing by providing a solid platform for data
analysis.
OLTP – On Line Transaction Processing
OLAP – On Line Analysis Processing
(for finding rules)
OLAP
users
function
DB design
Data
Queries
Nb of users
DB size
knowledge worker
decision support
subject oriented
Historical
Complex
Hundreds
100GB - ..TB
OLTP
clerk, IT programmer
day to day operation
application oriented
current
simple
thousands
100MB - ..GB
50