Download Mining Frequent Patterns Without Candidate Generation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
-Clusters Capturing Subspace
Correlation in a Large Data Set
Authors: Yang Jiong, Wei Wang etc.(ICDE02)
Presenter: Xuehua Shen
[email protected]
May 22, 2017
Data Mining: Concepts and Techniques
1
Presentation Layout

Overview of Clustering

Related Work of -Clusters

-Clusters Model

FLOC algorithm
May 22, 2017
Data Mining: Concepts and Techniques
2
Clustering

Clustering: the process of grouping a set of
objects into classes of similar objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
May 22, 2017
Data Mining: Concepts and Techniques
3
Major Clustering Methods

Partition algorithm

Hierarchy algorithm

Density-based

Grid-based

Model-based
May 22, 2017
Data Mining: Concepts and Techniques
4
Similarity


Clustering: the process of grouping a set of
objects into classes of similar objects
But how to define similarity?
May 22, 2017
Data Mining: Concepts and Techniques
5
Similarity cont.

Traditional clustering model: based on distance functions

Some popular ones include: Minkowski distance:
d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q )
i1 j1
i2
j2
ip
jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two pdimensional data objects, and q is a positive integer

But strong correlations may still exist among a set of
objects even if they are far apart from each other as
measured by the distance function
May 22, 2017
Data Mining: Concepts and Techniques
6
Similarity cont.

-Clusters model: similar when exhibiting a
coherent pattern on a subset of dimensions

Can cluster objects which show shifting pattern or
scaling pattern
May 22, 2017
Data Mining: Concepts and Techniques
7
Similarity cont.

Example of Coherent Pattern:
Shifting Pattern
Scaling Pattern
May 22, 2017
Data Mining: Concepts and Techniques
8
Subspace Clustering

From high dimensional clustering (problematic)
To subspace clustering

Not restricted with fixed ordering of columns
contrasted with pattern in time-series data

Challenge: curse of dimensionality!
May 22, 2017
Data Mining: Concepts and Techniques
9
Subspace Clustering cont.

Example of subspace clustering
CH11
CH1B
CH1D
CH2I
CH2B
CTFC3
4392
284
4108
280
228
VPS8
401
281
120
275
298
EFB1
318
280
37
277
215
SSA1
401
292
109
580
238
FUN14
2857
285
2576
271
226
SP07
228
290
48
285
224
MDM1
0
538
272
266
277
236
CYS3
322
288
41
278
219
May 22, 2017
CH11
CH1D
CH2B
VPS8
401
120
298
EFB1
318
37
215
CYS3
322
41
219
Data Mining: Concepts and Techniques
10
Applications

Microarray Data Analysis in Biology

E-Commerce
May 22, 2017
Data Mining: Concepts and Techniques
11
Microarray Data Analysis


Matrix (Dense)
Rows: Genes
Columns: Various Samples
experiment conditions or tissues
Values in Matrix: expression level
relative abundance of the mRNA of a gene under
a specific condition
May 22, 2017
Data Mining: Concepts and Techniques
12
Microarray Data Analysis cont.

From Scaling Pattern to Shifting Pattern
dij  log(
Re dIntensity
GreenIntensity
)
Red: Interested Gene, Green: Controlled Gene

Investigations show that several genes contribute
to a disease, which motivates researchers to
identify a subset of genes whose expression levels
rise and fall coherently under a subset of
conditions
May 22, 2017
Data Mining: Concepts and Techniques
13
E-Commerce



Example: Rating of Movies (1: lowest rate, 10:
highest rate)
Movie
1
Movie
2
Movie
3
Movie
4
Viewer
1
1
2
3
6
Viewer
2
Viewer
3
2
3
4
7
4
5
6
9
Shifting Pattern
If a new movies and 1st viewer rate 7 and 3rd
viewer rate 9, 2nd viewer probably will like this
movie too
May 22, 2017
Data Mining: Concepts and Techniques
14
Presentation Layout

Overview of clustering

Related Work of -Clusters

-Clusters Model

FLOC algorithm
May 22, 2017
Data Mining: Concepts and Techniques
15
Related Work

CLIQUE, ORCLUS, PROCLUS (subspace clustering)

Can’t capture neither the shifting pattern nor the
scaling pattern

Bicluster model proposed as a measure of
coherence of genes and conditions in a submatrix
of a DNA array
May 22, 2017
Data Mining: Concepts and Techniques
16
Bicluster

Model: Mean squared residue score of submatrix:
H ( I , J )  |I ||1J | (
2
(
d

d

d

d
)
 ij iJ Ij IJ
iI , jJ
d iJ  |1J |  d ij , d Ij  |1I |  d ij , d IJ  |I ||1J |
jJ
iI
d
iI , jJ
ij
a submatrix AIJ is called a -biCluster if H(I,J)

Algorithm: A random algorithm to give an
approximate answer
May 22, 2017
Data Mining: Concepts and Techniques
17
Weakness of bicluster

Missing Values

Constraints
May 22, 2017
Data Mining: Concepts and Techniques
18
Presentation Layout

Overview

Related Work

-Clusters Model

FLOC algorithm
May 22, 2017
Data Mining: Concepts and Techniques
19
Occupancy Threshold

A parameter to control the percentage of missing
values in a submatrix
J i'
J


|J’i| is the specified attributes for object i in Clusters

|J| is the number of attributes in the -Clusters
May 22, 2017
Data Mining: Concepts and Techniques
20
Occupancy Threshold cont.

Similar occupancy threshold for attribute j in Clusters

Example =0.6
1
3
4
3
May 22, 2017
5
4
1
3
3
4
3
4
Data Mining: Concepts and Techniques
3
5
4
21
Volume


The volume of a -Clusters(I,J) is the number of
specified entries dij in (I,J)
Example
volume is 3*3=9
1
3
May 22, 2017
3
4
3
4
3
5
4
Data Mining: Concepts and Techniques
22
Base

Object Base
di,J



jJ '
d ij
J i'
Attribute Base
dI , j
May 22, 2017


iI '
'
j
dij
I
Data Mining: Concepts and Techniques
23
Base cont.

-Clusters Base
d IJ



iI , jJ
d ij
vIJ
For perfect -Clusters
d ij  d iJ  d Ij  d IJ
May 22, 2017
Data Mining: Concepts and Techniques
24
Residue

Entry Residue
if dij is specified
rij  d ij  d iJ  d Ij  d IJ
otherwise is 0
May 22, 2017
Data Mining: Concepts and Techniques
25
Residue cont.

-Clusters Residue

iI , jJ
rij
vIJ

r-residue -Clusters if -clusters residue is equal
to or smaller than r
May 22, 2017
Data Mining: Concepts and Techniques
26
Presentation Layout

Overview of Clustering

Related Work of -Clusters

-Clusters Model

FLOC algorithm(Flexible Overlapping Clustering)
May 22, 2017
Data Mining: Concepts and Techniques
27
Flow Chart
Generating initial clusters
Determine the best action
For each row and each
column
Perform the best action
sequentially
improved
N
May 22, 2017
Y
Data Mining: Concepts and Techniques
28
Initial Cluster

Randomly Generate k initial cluster

Different parameters  makes different size cluster
May 22, 2017
Data Mining: Concepts and Techniques
29
Choose best actions

For every object or attribute, there are k actions
which can be done,

Choose the best action among the k candidates
according to gain

Gain is the difference between original residue
and the residue assuming the action is done on
the cluster
May 22, 2017
Data Mining: Concepts and Techniques
30
Choose Best Actions cont.

Even if gain is negative sometimes
we do the action in order to get the global
optimum
May 22, 2017
Data Mining: Concepts and Techniques
31
Do the actions sequentially

Generate the actions sequence
1) the same order in all iterations
2) random order sequence
3) weighted random order sequence
May 22, 2017
Data Mining: Concepts and Techniques
32
Output the Best cluster

After some iterations, no improvement of
minimum residue, algorithm stops and k best
cluster is output
May 22, 2017
Data Mining: Concepts and Techniques
33
End

Thank you!
May 22, 2017
Data Mining: Concepts and Techniques
34