Download Microsoft PowerPoint - AIiFE4-DataMining [tryb

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
J.Korczak
Contents
Data Mining – definitions
Inductive Decision Trees
- Algorithms ID3 and C4.5
- Tools : WEKA, Orange, Cader, Statistica
Discovery of Association Rules
- Apriori and Hash Trees
- Tools: WEKA, Orange, Associator, Merja
Clustering algorithms
Classical algorithms: K-means
Scalable algorithms: CURE, DBSCAN,
Future Research
IA IN FINANCE AND ECONOMICS
Data Mining
Jerzy KORCZAK
email: [email protected]
http://www.korczak-leliwa.pl
http://citi-lab.pl
1
J. Korczak, UE
What is Data Mining?
2
J. Korczak, UE
Data Mining
Many definitions
Non-trivial extraction of implicit, previously unknown and potentially
useful information from data
Exploration & analysis, by automatic or semi-automatic means, of large
quantities of data in order to discover meaningful patterns
Data mining (knowledge discovery in databases, KDD)
Extraction of interesting, non-trivial, implicit, previously unknown
and potentially useful information (knowledge) or patterns from
data in large databases or other information repositories
Scientific point of view: data abstraction and KDD
Commercial point of view: competitive pressure
Necessity is the mother of invention
Data is everywhere — data mining should be everywhere, too!
Understand and use data — an imminent task!
3
MBA-BA-DataMining
Classification and rule extraction
Data Mining Tasks...
Classification has been an essential theme in data
mining, and statistics research
Data
Decision trees, Bayesian classification, neural networks, knearest neighbors, etc.
Tree-pruning, Boosting, Bagging techniques
Tid Refund Marital Taxable
Status Income Cheat
Milk
MBA-BA-DataMining
MBA: Data Mining
10
4
MBA-BA-DataMining
1 Yes
Single 125K
No
2 No
Married 100K
No
3 No
Single 70K
No
4 Yes
Married 120K
No
5 No
Divorced 95K
Yes
6 No
Married 60K
No
7 Yes
Divorced 220K
No
8 No
Single 85K
Yes
9 No
Married 75K
No
10 No
Single 90K
Yes
11 No
Married 60K
No
12 Yes
Divorced 220K
No
13 No
Single 85K
Yes
14 No
Married 75K
No
15 No
Single 90K
Yes
Efficient and scalable classification methods
Exploration of attribute-class pairs
CURE, DBSCAN, SLIQ, SPRINT, RainForest, BOAT, etc.
Classification of semi-structured and non-structured data
Classification by clustering association rules
Association-based classification
Web document classification
5
J. Korczak, UE
6
1
J.Korczak
Statistics vs Data Mining
Classification
Given:
Database of tuples, each assigned a class label
Develop a model/profile for each class
Example profile (good credit):
(25 <= age <= 40 and income > 40k) or (married = YES)
Statistics: a discipline dedicated to data analysis
What are the differences?
Huge amount of data—in Giga to Tera bytes
Fast computer—quick response, interactive analysis
Multi-dimensional, powerful, thorough analysis
High-level, “declarative”—user’s ease and control
Sample applications:
Credit card approval (good, bad)
Bank locations (good, fair, poor)
Treatment effectiveness (good, fair, poor)
Automated or semi-automated—mining functions hidden or built-in
in many systems
7
MBA-BA-DataMining
Bayesian Classification
8
J. Korczak, UE
Naive Bayesian Classification
1) Let D be a training set of tuples associated class labels, C1, C2,…, Cm,
where each tuple is X= (x1, x2,…, xn)
2) Given a tuple X, the classifier will predict that X belongs to the class
having the highest posterior probability conditioned on X.
3)
P(Ci |X) = P(X|Ci) P(Ci)/ P(X)
4) P(X) is is constant for all classes, only P(X|Ci)P(Ci ) need to be
maximized. If the class prior probabilities are not known, it is assumed
that the classes are equally likely, so we maximize P(X|Ci)
5) In order to reduce computation P(X|Ci) (class cond.indep.)
B.C is based on Bayes’ theorem.
Naive B.C. assume the class conditional independence
(an attribute value on a given class is independant of the values of
the other attributes)
Bayesian belief networks (allow dependencies)
Bayes Theorem
P(H|X) is a posteriori probability of H conditioned on X.
P(H) is a priori probabilty of H.
P(X|H) is the p-ty of X conditioned on H.
P(X) is the p-ty of X.
P(X|Ci) =
Π P(x |C ) = P(x |C ) *P(x |C ) *…* P(x |C )
k
i
1
i
2
i
n
i
where P(x1|Ci),P(x2|Ci)…P(xn|Ci) can be estimated from the training set.
6) In order to predict the class label of X, P(X|Ci) P(Ci) is evaluated for
each class Ci .
P(H|X) = P(X|H)P(H) / P(X)
Performance: comparable with decision tree, neural network classifiers.
9
J. Korczak, UE
Decision Tree Construction
Data
Decision Tree
Decision Rules
10
J. Korczak, UE
Discovery of decision rules: GOLF example
Outlook
sunny
sunny
overcast
rainy
rainy
rainy
overcast
sunny
sunny
rainy
sunny
overcast
overcast
rainy
Prediction
on unseen data
Which attribute to select? Attribute branching?
ID3, C4.5 [Quinlan]
- Gain function based on entropy
- Normalized gain
Temp.
85
80
83
70
68
65
64
72
69
75
75
72
81
71
Humidity
85
90
78
96
80
70
65
95
70
80
70
90
75
80
Windy
F
T
F
F
F
T
T
F
F
F
T
T
F
T
Play ?
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Outlook
rainy
sunny
overcast
Humidity
<=75
Yes
>75
No
Windy
Yes
T
F
No
Yes
Entropy characterizes the (im)purity, or homogeneity, of
an arbitrary collection of instances.
IF Outlook=sunny & H<=75 THEN we play
IF Outlook=sunny & H>75 THEN we don’t play
J. Korczak, UE
MBA: Data Mining
11
Data Mining, J. Korczak
12
2
J.Korczak
Attribute selection measures: Shannon entropy
Constructing decision trees
«divide and conquer»
Heuristic for the „best” splitting criterion
…minimization of the expected number of tests needed
to classify a given tuple
while a leaf node does not contain objects of a one class
split the set of objects E into E1,E2,…,En partitions
Entropy of S :
Splitting criterion:
H(S)= -Σ ( ni /card(S) * log2( ni /card(S) )
X=a (for qualitative attributes)
X<=a (for quantitive attribute)
where ni represents the number of tuples of S
that belongs to class i.
Discrete- valued and continuous-valued attributes
13
14
Attribute selection measure: Information gain
Entropy
Interpretation :
• Shannon entropy is a
measure of the average
information content
• min number of bits to
encode a message S • optimal code uses –log2p
bits for message with
probability p
• Let H be the entropy of the population «non partitioned»
• Let H(i,j) be the entropy of the population having the
value j of the attribute i.
• Let G(i) be the information gain of the attribute i
G(i)= H - Σj ISjI/S H(i,j)
The attribute with the highest Gain(i) is chosen as the
splitting at node N.
15
16
Example: GOLF
Outlook Temp.Humidity Windy
sunny
85
85
F
sunny
80
90
T
overcast 83
78
F
rainy
70
96
F
rainy
68
80
F
rainy
65
70
T
overcast 64
65
T
sunny
72
95
F
sunny
69
70
F
rainy
75
80
F
sunny
75
70
T
overcast 72
90
T
overcast 81
75
F
rainy
71
80
T
GOLF: Discretization
Play
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Outlook Temp.Humidity Windy
sunny
85
85
F
sunny
80
90
T
overcast 83
78
F
rainy
70
96
F
rainy
68
80
F
rainy
65
70
T
overcast 64
65
T
sunny
72
95
F
sunny
69
70
F
rainy
75
80
F
sunny
75
70
T
overcast 72
90
T
overcast 81
75
F
rainy
71
80
T
Two classes: {Yes,No}
5 attributes
14 tuples
H(S) = -9/14*log2(9/14)5/14*log2(5/14) = 0,94 bits
Play
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Outlook Temp.Humidity Windy
sunny
hot
high
F
sunny
hot
high
T
overcast hot
normal F
rainy
cool
high
F
rainy
cool
normal F
rainy
cool
normal T
overcast cool
normal T
sunny
cool
high
F
sunny
cool
normal F
rainy
hot
normal F
sunny
hot
normal T
overcast cool
high
T
overcast hot
normal F
rainy
cool
normal T
Play
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Simple discretization: Breakpoint (Temp)= (64+85)/2 =74,5
IF Temp<74,5 THEN cool ELSE hot
IF Humidity<80.5 THEN normal ELSE high
J. Korczak, UE
MBA: Data Mining
17
J. Korczak, UE
18
3
J.Korczak
Example: GOLF
Example : Golf
G(Outlook)=H – 5/14*H(Outlook,sunny) – 4/14*H(Outlook,overcast) – 5/14*H(Outlook,rainy) =
Outlook Temp.Humidity Windy
sunny
85
85
F
sunny
80
90
T
overcast 83
78
F
rainy
70
96
F
rainy
68
80
F
rainy
65
70
T
overcast 64
65
T
sunny
72
95
F
sunny
69
70
F
rainy
75
80
F
sunny
75
70
T
overcast 72
90
T
overcast 81
75
F
rainy
71
80
T
Similarly, we compute
Play
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
H(Outlook,sunny)=
-2/5*log2(2/5)-3/5*log2(3/5) =
= 0,971
G(Temp)=0,029 G(Humidity)=0,151 et G(Windy)=0,048.
MAX (G(Outlook),G(Temp),G(Humidity),G(Windy)) = G(Outlook)
Outlook has the highest information gain
H(Outlook,overcast) =
-4/4*log2(4/4)-0 = 0
Outlook
H(Outlook,rainy)=
-3/5*log2(3/5)-2/5*log2(2/5) = 0,971
sunny
overcast
rainy
G(Outlook) = 0.246
Branches will grow for each outcome of Outlook.
19
J. Korczak, UE
20
Selection the next attribute
GOLF: Computing
Outlook Temp.Humidity Windy
sunny
85
high
F
sunny
80
high
T
overcast 83
78
F
rainy
70
96
F
rainy
68
80
F
rainy
65
70
T
overcast 64
65
T
sunny
72
high
F
sunny
69
normal F
rainy
75
80
F
sunny
75
normal T
overcast 72
90
T
overcast 81
75
F
rainy
71
80
T
Play
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Outlook Temp.Humidity Windy
sunny
85
85
F
sunny
80
90
T
overcast 83
78
F
rainy
70
96
F
rainy
68
80
F
rainy
65
70
T
overcast 64
65
T
sunny
72
95
F
sunny
69
70
F
rainy
75
80
F
sunny
75
70
T
overcast 72
90
T
overcast 81
75
F
rainy
71
80
T
Play
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Outlook
rainy
overcast
?
[2+,3-]
Yes
[4+,0-]
?
[3+,2-]
G(Ssunny,Humidity) = 0,970 - (3/5)*0 - (2/5)*0 = 0,970
G(Ssunny,Temp) = 0,970 - (2/5)*0 - (2/5)*1 - (1/5)*0 = 0,570
G(Ssunny,Windy) = 0,970 - (2/5)*1 -(3/5)*0,918 = 0,019
21
J. Korczak, UE
[9+,5-]
sunny
22
Next attribute for «Outlookrainy?»
Decision tree: Choice «OutlooksunnyHumidity»
Outlook
Outlook
[9+,5-]
sunny
rainy
sunny
overcast
Yes
Humidity
?
Humidity
high
Yes
[2+,0-]
No
[0+,3-]
23
MBA: Data Mining
yes
?
[3+,2-]
[3+,2-]
normal
rainy
overcast
G(Srainy,Humidity) = 0,970 - (2/5)*1 - (3/5)*0,918 = 0,019
G(Srainy,Temp) = 0,970 - (0/5) - (3/5)*0,918 - (2/5)*1 = 0,019
G(Srainy,Windy) = 0,970 - (2/5)*0 -(3/5)*0= 0,970
24
4
J.Korczak
Example GOLF – The final decision tree
Rule extraction
Outlook
Parcours of decision tree using the depth-first strategy.
sunny
(expands one of the node at the deepest level of the tree)
Yes
Humidity
Decision rule: a path from the root to the leaf node.
normal
Windy
high
Yes
rainy
overcast
false
No
true
Yes
No
Decision rules:
IF Outlook=sunny AND Humidity=normal THEN Yes
IF Outlook=sunny AND Humidity=high THEN No
IF Outlook=overcast THEN Yes
IF Outlook=rainy AND Windy = false THEN Yes
IF Outlook=rainy AND Windy = true THEN No
25
26
Overfitting
Why we should try to find a concise tree?
Ockham razor
The most likely hypothesis is the simplest one that is
consistent with all observations.
more generic hypothesis, more chances to refute
size of a hypothesis depends on a KR language
Effect of noisy examples : (sunny, >75, normal, true, No)
Experimental error vs. real error
27
J. Korczak, UE
Problems of quality of decision rules
28
Extensions
Avoid overfitting
Stop growing the tree while the data are not statistically significant.
Generate the tree, then do pruning.
Handling continuous attributes
Gain ratio
Missing attribute values
Cost of attributes
Software: Statistica, SAS, WEKA, Sipina, …
Selection of the best decision tree
Evaluate performances on distinct data
Carry out statistical tests
MDL : minimise size(tree) + size(classification error(tree))
Rule post-pruning (C4.5)
Rule precision
Precision-based rule ordering
Cost complexity
J. Korczak, UE
MBA: Data Mining
29
30
5
J.Korczak
C4.5
Gain Ratio
C4.5 is an extension of the basic ID3 [Quinlan, 93]
The information gain is biased toward tests with many
outcomes! Sometimes such a partitioning is useless for
classification.
C4.5 applies a kind of normalization:
Avoiding overfitting the data
Reduced error pruning
Rule post-pruning
Handling continuous attributes
Choosing an appriopriate attribute selection measure
SplitInfo (S,A) = - Σ ISiI/ISI log2 ISiI/ISI
GainRatio(G, A) ≡
GainRatio (S,A) = Gain(S,A)/SplitInfo(S,A)
Handling training data with missing attribute values
Improving computational efficiency
Handling attributes with differing costs
Gain2(S,A) /Cost (A)
Gain( S , A)
SplitInformation( S , A)
c
SplitInfor mation ( S , A) ≡ − ∑
i =1
J. Korczak, UE
31
CADER
Si
S
log 2
Si
S
32
Decision Trees
Pros
Fast execution time
Generated rules are easy to interpret by humans
Scale well for large data sets
Can handle high dimensional data
Cons
Cannot capture correlations among attributes
Consider only axis-parallel cuts
J. Korczak, UE
33
Mining Association Rules
Association Rules - Sample Applications
Given:
A database of customer transactions
Each transaction is a set of items
Market basket analysis
Attached mailing in direct marketing
Fraud detection for medical insurance
Department store floor/shelf planning
Find all rules X => Y that correlate the presence of one set
of items X with another set of items Y
Example: 98% of people who purchase diapers and
baby food also buy beer.
Any number of items in the consequent/antecedent of a
rule
Possible to specify constraints on rules (e.g., find only
rules involving expensive imported products)
J. Korczak, UE
MBA: Data Mining
34
J. Korczak, UE
35
J. Korczak, UE
36
6
J.Korczak
MinSupport and MinConfiance
Confidence and Support
Customers that buy both
A rule must have some minimum user-specified confidence
1 & 2 => 3 has 90% confidence if when a customer
bought 1 and 2, in 90% of cases, the customer also
bought 3.
A rule must have some minimum user-specified support
1 & 2 => 3 should hold in some minimum percentage of
transactions to have business value
Find rules X & Y ⇒ Z with support > s and
confiance >c
Customer
buying chips
support, s, probability that a transaction
contains {X, Y, Z}
confidence, c, conditional probability of the
consequent Z of the rule given its antecedent
{X, Y}
Customer buying beer
Confidence=support(X,Y,Z)/support(X,Y)
ID Transaction Items
2000
A,B,C
1000
A,C
4000
A,D
5000
B,E,F
37
J. Korczak, UE
38
J. Korczak, UE
Apriori — Example
Algorithm Apriori
min_support=2
Join step: Ck is joined with Lk-1 itself
Prune step: Each (k-1)-itemset that is not frequent cannot be a
subset of frequent k-itemset
Base D
TID
100
200
300
400
Fk : Set of frequent itemsets of size k
Ck : Set of candidate itemsets of size k
F1 = {large items}
for ( k=1; Fk != 0; k++) do {
Ck+1 = New candidates generated from Fk
for each transaction t in the database do
Increment the count of all candidates in Ck+1 that
are contained in t
Fk+1 = Candidates in Ck+1 with minimum support
}
Answer = ∪k Fk
J. Korczak, UE
{1 3}
{2 3}
{2 5}
{3 5}
C2 itemset sup
C2 itemset
{1 2}
Scan D
Scan D
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
{1}
{2}
{3}
{5}
2
3
3
3
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
L3 itemset sup
{2 3 5} 2
Koniec 22/10
40
J. Korczak, UE
Computing Association Require Exponential Computation
Frequent itemsets ≠ association rules
{a}
For every non empty subset A of X
1. Let B=X-A
2. A →B is an association rule if
confidence (A → B) ≥ minConf
where confidence (A → B) support (AB)/support(A)
and support (A → B)=support (AB)
{b}
{a,c}
{a,b}
{a,b,c}
Example : X=2,3,5 is a frequent item set with minSupp=50%
The association rules are:
(2,3) → 5
confidence 100%
(2 → (3,5) confidence 67%
…
MBA: Data Mining
L1 itemset sup.
{1
{1
{1
{2
{2
{3
2
2
3
2
C3 itemset
{2 3 5}
39
itemset sup.
{1}
2
{2}
3
{3}
3
{4}
1
{5}
3
C1
Items
134
Scan D
235
1235
25
L2 itemset sup
APRIORI’s Rule Derivation
J. Korczak, UE
For support minimum 50%, and
confidence minimum 50%:
A ⇒ C (50%, 66.6%)
C ⇒ A (50%, 100%)
{a,d}
{a,b,d}
{c}
{b,c}
{a,c,d}
{d}
{b,d}
{c,d}
{b,c,d}
{a,b,c,d}
Given m items, there are 2m-1 possible item combinations
41
J. Korczak, UE
42
7
J.Korczak
Drawbacks of Apriori
Handling Exponential Complexity
Given n transactions and m different items:
number of possible association rules:
computation complexity:
Principle:
Use (k – 1) frequent itemsets to generate k-itemsets candidats
Database scan can take a prohibitive amount of time
Ways of improvements of the process of candidate generation
Counting candidat sets
104 1-item frequent sets generates 107 2-itemsets candidats
To find100-itemsets one has to generate 2100 ≈ 1030 candidats.
Many database scans :
One has to do (n +1 ) scans to find n-item frequent set
not efficient data structure used to store the candidate and
frequent sets
O ( m2 m −1 )
O( nm 2 m )
Systematic search for all patterns, based on support constraint [Agarwal &
Srikant]:
If {A,B} has support at least α, then both A and B have support at least α.
If either A or B has support less than α, then {A,B} has support less than
α.
Use patterns of n-1 items to find patterns of n items.
44
43
J. Korczak, UE
Improving the efficiency of Apriori
Algorithm: Hashing-based itemset counting
Hash-based itemset counting
Base D
Transaction reduction (scanned in future iterations)
TID
Partitioning (to find cadidate itemsets)
List of I1, I2, I2, I4
items I5
Sampling (mining on a subset of the given data)
Dynamic itemset counting (adding candidate items at different point during
a scan)
T100
T200
T300
T400
T500
T600
T700
T800
T900
I2,I3
I1,I2,I
4
I1,I3
I2,I3
I1,I3
I1,I2,I
3,I5
I1,I2,I
3
Table of hashing, h(I,J)=(num(I)*10 + num(J)) mod 7
Adress
0
1
2
3
4
5
Count
2
2
4
2
2
4
6
4
Content
{I1,I4}
{I3,I5}
{I1,I5}
{I1,I5}
{I2,I3}
{I2,I3}
{I2,I3}
{I2,I3}
{I2,I4}
{I2,I4}
{I2,I5}
{I2,I5}
{I1,I2}
{I1,I2}
{I1,I2}
{I1,I2}
{I1,I3}
{I1,I3}
{I1,I3}
{I1,I3}
45
46
J. Korczak, UE
Multilevel exploration
Multilevel association rules
Product
Top-down strategy:
The concept hierarchy for the items
Software
Computer
Data can be generalized by
replacing low-level concepts by
their ancestors
Laptop
HP
Top-down strategy is employed
TID
The items on the inferior level have T1
T2
tje inferior support rate
T3
T4
T5
Desktop
Office
AntiVir
Dell
Variations:
Items
{111, 121, 211, 221}
{111, 211, 222, 323}
{112, 122, 221, 411}
{111, 121}
{111, 122, 211, 221, 413}
47
MBA: Data Mining
Find „strong” rules on superior levels :
computer software [20%, 60%].
– then pass to lower levels (lower support thresholds):
laptop Office [6%, 50%]
Level crossing
laptop MS Office
48
8
J.Korczak
Sequential Patterns
Sequential Patterns and Time-Series Analysis
Trend analysis
[Agrawal, Srikant 95], [Srikant, Agrawal 96]
Given:
A sequence of customer transactions
Each transaction is a set of items
Find all maximal sequential patterns (sequential
dependencies) supported by more than a user-specified
percentage of customers
(A B) C(D E)
Trend movement vs. cyclic variations, seasonal variations and
random fluctuations
Similarity search in time-series database
Handling gaps, scaling, etc.
Indexing methods and query languages for time-series
Sequential pattern mining
Various kinds of sequences, various methods
From GSP to PrefixSpan
Example: 10% of customers who bought a PC did a memory
upgrade in a subsequent transaction
* 10% is the support of the pattern
Periodicity analysis
Full periodicity, partial periodicity, cyclic association rules
Apriori style algorithm can be used to compute frequent sequences
49
J. Korczak, UE
Efficient Methods for Mining Association Rules
Partitioning et Clustering
Apriori algorithm [Agrawal, Srikant 94]
DHP (Aprori+Hashing) [Park, Chen, Yu 95]
A k-itemset is in Ck only if it is hashed into a bucket satisfying minimum
support
[Savasere, Omiecinski, Navathe 95]
Any potential frequent itemset appears as a frequent itemset in at least
one of the partitions
Random sampling [Toivonen 96]
Dynamic Itemset Counting [Brin, Motwani, Ullman, Tsur 97]
During a pass, if itemset becomes frequent, then start counting support
for all supersets of itemset (with frequent subsets)
FUP [Cheung, Han, Ng, Wang 96]
Incremental algorithmPDM [Park, Chen, Yu 95]
Use hashing technique to identify k-itemsets from local database
Parallel and Distributed Methods : PDM [Agrawal, Shafer 96]
Count distribution
FDM [Cheung, Han, Ng, Fy, Fu 96]
J. Korczak, UE
50
J. Korczak, UE
s = 50
p=2
s/p = 25
s/pq = 5
y
y
y
x
y
y
x
x
51
x
x
52
J. Korczak, UE
Traditional Algorithms
Clustering
Hierarchical clustering (agglomerative and divisive)
Given:
Data points and number of desired clusters K
Group the data points into K clusters
Data points within clusters are more similar than across
clusters
Nested Partitions
Tree structure
Sample applications:
Customer segmentation
Market basket customer analysis
Attached mailing in direct marketing
Clustering companies with similar growth
J. Korczak, UE
MBA: Data Mining
53
J. Korczak, UE
54
9
J.Korczak
K-Means :Example (cdn)
K-Means :Example
A={1,2,3,6,7,8,13,15,17}. Create 3 clusters in A.
dist(3,M2)<dist(3,M3)3 moves to C2. The other objects stay in C3.
C1={1}, M1=1, C2={2,3}, M2=2.5,C3={6,7,8,13,15,17} et M3= 66/6=11
Take randomly 3 objects, e.g. 1, 2 and 3.
C1={1}, M1=1, C2={2}, M2=2, C3={3} et M3=3
dist(6,M2)<dist(6,M3)6 moves to C2. The other objects do not move.
C1={1}, M1=1, C2={2,3,6}, M2=11/3=3.67, C3={7,8,13,15,17}, M3= 12
dist(2,M1)<dist(2,M2)2 moves to C1.
dist(7,M2)<dist(7,M3) 7 moves to C2. The other objects do not move.
C1={1,2}, M1=1.5, C2={3,6,7}, M2=5.34, C3= {8,13,15,17}, M3=13.25
Each object is assigned to the closest cluster.
So 6 is assigned to C3 because dist(M3,6)<dist(M2,6)
and dist(M3,6)<dist(M1,6)
The result is:
C1={1}, M1=1
C2={2}, M2=2
C3={3, 6, 7,8,13,15,17}, M3=69/7=9.86
dist(3,M1)<dist(3,M2)3 moves to C1.
dist(8,M2)<dist(8,M3)8 moves to C2.
C1={1,2,3}, M1=2, C2={6,7,8}, M2=7, C3={13,15,17}, M3=15
Nothing change. End.
55
J. Korczak, UE
Algorithm K-Means
Clustering: Summary of Drawbacks of Traditional Methods
Example
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
Partition-based algorithms split large clusters
Centroid-based method splits large and non-hyperspherical
clusters
Centers of subclusters can be far apart
Minimum spanning tree algorithm is sensitive to outliers and
slight change in position
2
1
1
0
Exhibits chaining effect on string of outliers
0
0
1
2
3
4
5
6
56
J. Korczak, UE
7
8
9
10
0
1
2
3
4
5
6
7
8
9
10
Cannot scale up for large databases
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
0
1
2
J. Korczak, UE
3
4
5
6
7
8
9
10
57
Clustering
58
J. Korczak, UE
CURE (Clustering Using REpresentatives )
(from Database and Machine Learning Community )
Scalable Clustering Algorithms
CLARANS – sampling database
DBSCAN – density based method
BIRCH – partitions objects hierarchically using tree structure
CLIQUE – integrates density-based and grid-based method
CURE
ROCK – merges clusters based on their interconnectivity
COBWEB and CLASSIT
Neural networks: SOM, GNG
…
The classical methods generate the clusters (b)
CURE: (1998)
Stops on k cluster
Based on representative points
J. Korczak, UE
MBA: Data Mining
59
J. Korczak, UE
60
10
J.Korczak
Outlier Discovery
Cure: merging representative points
y
Given:
Data points and number of outliers (= n) to find
Find top n outlier points
y
Outliers are considerably dissimilar from the remainder of the data
Sample applications:
Credit card fraud detection
Telecom fraud detection
Customer segmentation
Medical analysis
x
x
shrinking representative points to the center by a factor α (outliers!).
points allow to define a shape of the cluster
61
J. Korczak, UE
Data Mining with WEKA: the software
62
J. Korczak, UE
Social networks analysis
Machine learning/data mining software written in Java
(distributed under the GNU Public License)
Used for research, education, and applications
Complements “Data Mining” by Witten & Frank
Main features:
Social network analysis
Comprehensive set of data pre-processing tools, learning
algorithms and evaluation methods
Graphical user interfaces (incl. data visualization)
Environment for comparing learning algorithms
Internet search engines
Which people are powerful?
Which people infuence other people?
How does information spread within the network?
Who is relattively isolated, and who is well connected?
…
Google search engine: PageRank algorithm
Marketing
Viral marketing: „word-of-mouth” advertising
Hotmail – free email service
Fraud detection
AML systems
…
63
J. Korczak, UE
Social Network Analysis – measures
64
Link analysis
Link analysis techniques are applied to data that can be
represented as nodes and links
SNA is the science of using network theory to construct,
view and analyze social networks
Degree Centrality – the number of direct relationships
that an entity has.
Closeness Centrality – how quickly an entity can
access more entities in the network.
Betweenness Centrality identifies an entity’s position
within a network in terms of its ability to make
connections to other pairs or groups.
Eigenvalue measure how close an entity is to other
highly close entities within a network
A node (vertice): person, bank account, document,…
A link: a relationship between two bank accounts
65
J. Korczak, UE
MBA: Data Mining
66
11
J.Korczak
Social networks
Degree Centrality
Large Graph Mining
[C.Faloutsos et.al., KDD2009]
67
68
J.Korczak, UE Wroclaw
69
70
The ruling coalition of Unity, Zatler Reform Party, and the
National Alliance —all majority ethnic Latvian
National Alliance
Future Research Issues
Incorporating constraints into existing data mining
techniques
Traditional algorithms
Disproportionate computational cost for selective users
Overwhelming volume of potentially useless results
Need user-controlled focus in mining process
Association rules containing certain items
Sequential patterns containing certain patterns
Unity
Tight-coupling with DBMS
Most data mining algorithms are based on flat file data (i.e. loosecoupling with DBMS)
A set of standard data mining operators (e.g. sampling operator)
Harmony Center
Greens/farmers
Zatler‘s Reform Party
71
MBA: Data Mining
J. Korczak, UE
72
12