Download Applications

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Human genetic clustering wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Chapter 10. Prediction and Cluster Analysis
Prediction
Definition:
Models continuous-valued functions, i.e., predicts unknown or missing
values
Similarities to Classification:
 construct a model
 use model to predict continuous or ordered value for a given
input
Differences from Classification:
 Classification refers to predict categorical class label
 Prediction models continuous-valued functions
Major method for prediction: regression
 model the relationship between one or more independent or
predictor variables and a dependent or response variable
Regression analysis
 Linear and multiple regression
 Non-linear regression
 Other regression methods: generalized linear model, Poisson
regression, log-linear models, regression trees
Linear regression:
Involves a response variable y and a single predictor variable x
y = w0 + w1 x
where w0 (y-intercept) and w1 (slope) are regression coefficients
Example:
Mencari persamaan regresi dengan Excel:
1. Buat grafik berdasarkan data yang ada
2. Klik kanan pada data di dalam grafik dan pilih Add Trendline
3. Pada tab Type, pilih tipe regresi yang dikehendaki
4. Pada tab Options, klik pada Display equation on chart
5. Tekan OK
Menambah data dalam grafik:
1. Tuliskan data yang akan ditambah dibawah data yang sudah ada
2. Klik pada data di dalam grafik
3. Geser garis biru kebawah sampai melingkupi data baru
4. Gambar
grafik
akan
berubah
sesuai
dengan
data
yang
ditambahkan
Cluster Analysis
Definition 1:
Clustering is the classification of objects into different groups, or more
precisely, the partitioning of a data set into subsets (clusters), so that the
data in each subset (ideally) share some common trait - often proximity
according to some defined distance measure. Data clustering is a
common technique for statistical data analysis, which is used in many
fields, including machine learning, data mining, pattern recognition,
image analysis and bioinformatics.
Besides the term data clustering (or just clustering), there are a number
of terms with similar meanings, including cluster analysis, automatic
classification, numerical taxonomy, botryology and typological analysis.
Definition 2:
Cluster analysis, also called segmentation analysis or taxonomy
analysis, seeks to identify homogeneous subgroups of cases in a
population. That is, cluster analysis seeks to identify a set of groups
which both minimize within-group variation and maximize betweengroup variation.
Definition 3:
What is Cluster Analysis?

Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different
from (or unrelated to) the objects in other groups
Inter-cluster
distances are
maximized
Intra-cluster
distances are
minimized
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
2
Applications of Cluster Analysis
- Understanding
–
Group related documents for browsing, group genes and
proteins that have similar functionality, or group stocks with
similar price fluctuations
- Summarization
–
Reduce the size of large data sets
What is not Cluster Analysis?
- Supervised classification
–
Have class label information
- Simple segmentation
–
Dividing
students
into
different
registration
alphabetically, by last name
- Results of a query
–
Groupings are a result of an external specification
Notion of a Cluster can be Ambiguous
How many clusters?
Six Clusters
Two Clusters
Four Clusters
© Tan,Steinbach, Kumar
Introduction to Data Mining
Types of Clusterings
- A clustering is a set of clusters
- Partitional Clustering
4/18/2004
5
groups
–
A
division
data
objects
into
non-overlapping
subsets
(clusters) such that each data object is in exactly one subset
- Hierarchical clustering
–
A set of nested clusters organized as a hierarchical tree
Partitional Clustering
Original Points
© Tan,Steinbach, Kumar
A Partitional Clustering
Introduction to Data Mining
4/18/2004
7
Hierarchical Clustering
p1
p3
p4
p2
p1 p2
Traditional Hierarchical Clustering
p3 p4
Traditional Dendrogram
p1
p3
p4
p2
p1 p2
Non-traditional Hierarchical Clustering
© Tan,Steinbach, Kumar
p3 p4
Non-traditional Dendrogram
Introduction to Data Mining
4/18/2004
8
Clustering Algorithms
- K-means and its variants
- Hierarchical clustering
- Density-based clustering
K-means Clustering
- Partitional clustering approach
- Each cluster is associated with a centroid (center point)
- Each point is assigned to the cluster with the closest centroid
- Number of clusters, K, must be specified
- The basic algorithm is very simple
K-means Clustering – Details
- Initial centroids are often chosen randomly.
–
Clusters produced vary from one run to another.
- The centroid is (typically) the mean of the points in the cluster.
Example: The data set has three dimensions and the cluster has
two points: X = (x1, x2, x3) and Y = (y1, y2, y3). Then the centroid Z
becomes Z = (z1, z2, z3), where z1 = (x1 + y1)/2 and z2 = (x2 + y2)/2
and z3 = (x3 + y3)/2.
- ‘Closeness’ is measured by Euclidean distance, cosine similarity,
correlation, etc.
Euclidean distance is the most common distance measure. A given
pair of cases is plotted on two variables, which form the x and y
axes. The Euclidean distance is the square root of the sum of the
square of the x difference plus the square of the y distance.
- K-means will converge for common similarity measures mentioned
above.
- Most of the convergence happens in the first few iterations.
–
Often the stopping condition is changed to ‘Until relatively
few points change clusters’
Two different K-means Clusterings
3
2.5
Original Points
2
y
1.5
1
0.5
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
3
3
2.5
2.5
y
2
1.5
y
2
1.5
1
1
0.5
0.5
0
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
Optimal Clustering
© Tan,Steinbach, Kumar
-2
Introduction to Data Mining
Sub-optimal Clustering
4/18/2004
22
Limitations of K-means
- K-means has problems when clusters are of differing
–
Sizes
–
Densities
–
Non-globular shapes
- K-means has problems when the data contains outliers.
Limitations of K-means: Differing Sizes
K-means (3 Clusters)
Original Points
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
40
Limitations of K-means: Differing Density
K-means (3 Clusters)
Original Points
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
41
Limitations of K-means: Non-globular Shapes
Original Points
© Tan,Steinbach, Kumar
K-means (2 Clusters)
Introduction to Data Mining
4/18/2004
42
Overcoming K-means Limitations
Original Points
K-means Clusters
One solution is to use many clusters.
Find parts of clusters, but need to put together.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
43
Overcoming K-means Limitations
Original Points
© Tan,Steinbach, Kumar
K-means Clusters
Introduction to Data Mining
4/18/2004
44
Overcoming K-means Limitations
Original Points
© Tan,Steinbach, Kumar
K-means Clusters
Introduction to Data Mining
4/18/2004
45
The main advantages of this algorithm are its simplicity and speed which
allows it to run on large datasets. Its disadvantage is that it does not
yield the same result with each run, since the resulting clusters depend
on the initial random assignments.
Hierarchical Clustering
Produces a set of nested clusters organized as a
hierarchical tree
 Can be visualized as a dendrogram

– A tree like diagram that records the sequences of
merges or splits
5
6
0.2
4
3
4
2
0.15
5
2
0.1
1
0.05
3
0
1
© Tan,Steinbach, Kumar
3
2
5
4
1
6
Introduction to Data Mining
4/18/2004
46
Strengths of Hierarchical Clustering
- Do not have to assume any particular number of clusters
–
Any desired number of clusters can be obtained by ‘cutting’
the dendogram at the proper level
- They may correspond to meaningful taxonomies
–
Example in biological sciences (e.g., animal kingdom,
phylogeny reconstruction, …)
Limitation of Hierarchical Clustering
Hierarchical clustering is appropriate for smaller samples (typically <
250). When n is large, the algorithm will be very slow to reach a solution
Two main types of hierarchical clustering
–
Agglomerative (Forward Clustering):
 Start with the points as individual clusters
 At each step, merge the closest pair of clusters until
only one cluster (or k clusters) left
–
Divisive (Backward Clustering):
 Start with one, all-inclusive cluster
 At each step, split a cluster until each cluster
contains a point (or there are k clusters)
Traditional hierarchical algorithms use a similarity or distance matrix
–
Merge or split one cluster at a time
Agglomerative Clustering Algorithm
- More popular hierarchical clustering technique
- Basic algorithm is straightforward
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4.
Merge the two closest clusters
5.
Update the proximity matrix
6. Until only a single cluster remains
- Key operation is the computation of the proximity of two clusters
–
Different approaches to defining the distance between
clusters distinguish the different algorithms
How to Define Inter-Cluster Similarity (Proximity)
1. The minimum distance between elements of each cluster (also called
single linkage clustering)
Similarity of two clusters is based on the two most similar (closest)
points in the different clusters
2. The maximum distance between elements of each cluster (also called
complete linkage clustering)
Similarity of two clusters is based on the two least similar (most
distant) points in the different clusters
3. The mean distance between elements of each cluster (also called
average linkage clustering, used e.g. in UPGMA-unweighted pair-group
method using averages)
Proximity of two clusters is the average of pairwise proximity between
points in the two clusters.
4. Distance Between Centroids
Applications
- Biology
- Market research
- Image segmentation: Clustering can be used to divide a digital image
into distinct regions for border detection or object recognition.
- Data mining: Many data mining applications involve partitioning data
items into related subsets; the marketing applications discussed above
represent some examples. Another common application is the division of
documents, such as World Wide Web pages, into genres.
- Search result grouping: In the process of intelligent grouping of the
files and websites, clustering may be used to create a more relevant set
of search results compared to normal search engines like Google. There
are currently a number of web based clustering tools such as Clusty.