Download Chapter 9 The K-means Algorithm

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
Chapter 9
The k-Means Algorithm and
Genetic Algorithm
Contents
• k-Means algorithm
• Genetic algorithm
• Rough set approach
• Fuzzy set approaches
Data Warehouse and Data Mining
2
Chapter 8
The K-Means Algorithm
The K-Means algorithm is a simple yet effective
statistical clustering technique.
Here is the algorithm:
1. Choose a value for K, the total number of clusters to
be determined.
2. Choose K instances (data points) within the dataset
at random. These are the initial cluster centers.
3. Use simple Euclidean distance to assign the
remaining instances to their closest cluster center.
Data Warehouse and Data Mining
3
Chapter 8
The K-Means Algorithm
4. Use the instances in each cluster to calculate a new
mean for each cluster.
5. If the new mean values are identical to the mean
values of the previous iteration the process
terminates. Otherwise, use the new means as
cluster centers and repeat steps 3-5.
Data Warehouse and Data Mining
4
Chapter 8
The K-Means Algorithm
An Example Using K-Means
Data Warehouse and Data Mining
5
Chapter 8
The K-Means Algorithm
An Example Using K-Means
Data Warehouse and Data Mining
6
Chapter 8
The K-Means Algorithm
General Considerations
Data Warehouse and Data Mining
7
Chapter 8
The K-Means Algorithm
General Considerations
Data Warehouse and Data Mining
8
Chapter 8
The k-Nearest Neighbor Algorithm
• All instances correspond to points in the n-D space.
• The nearest neighbor are defined in terms of
Euclidean distance.
• The target function could be discrete- or real- valued.
.
_
_
_
+
_
_
.
+
xq
_
.
+
.
+
Data Warehouse and Data Mining
.
.
9
Chapter 8
The k-Nearest Neighbor Algorithm
• For discrete-valued, the k-NN returns the most
common value among the k training examples
nearest to xq.
• Vonoroi diagram: the decision surface induced
by 1-NN for a typical set of training examples.
.
_
_
_
+
_
_
.
+
xq
_
.
+
.
.
+
Data Warehouse and Data Mining
.
10
Chapter 8
Discussion on the k-NN Algorithm
• The k-NN algorithm for continuous-valued target
functions
– Calculate the mean values of the k nearest neighbors
• Distance-weighted nearest neighbor algorithm
– Weight the contribution of each of the k neighbors
according to their distance to the query point xq
• giving greater weight to closer neighbors
– Similarly, for real-valued target functions
w
Data Warehouse and Data Mining
1
d ( xq , xi )2
11
Chapter 8
Genetic Learning
Here we present a basic genetic learning algorithm.
1. Initialize a population P of n elements, often
referred to as chromosomes, as a potential solution.
2. Until a specified termination condition is satisfied:
a. Use a fitness function to evaluate each element of
the current solution. If an element passes the fitness
criteria, it remains in P.
b. The population now contains m elements
(m<=n). Use genetic operators to create (n-m) new
elements. Add the new elements to the population.
Data Warehouse and Data Mining
12
Chapter 8
Genetic Learning
Genetic Algorithms and Supervised Learning
Data Warehouse and Data Mining
13
Chapter 8
Genetic Learning
Genetic Algorithms and Supervised Learning
Data Warehouse and Data Mining
14
Chapter 8
Genetic Learning
Genetic Algorithms and Supervised Learning
Data Warehouse and Data Mining
15
Chapter 8
Genetic Learning
Genetic Algorithms and Supervised Learning
Data Warehouse and Data Mining
16
Chapter 8
Genetic Learning
Genetic Algorithms and... Supervised Learning
Data Warehouse and Data Mining
17
Chapter 8
Genetic Learning
Genetic Algorithms and ..Unsupervised Clustering
Data Warehouse and Data Mining
18
Chapter 8
Genetic Learning
Genetic Algorithms and Unsupervised
Clustering
Data Warehouse and Data Mining
19
Chapter 8
Genetic Learning
General Considerations
Here is a list of considerations when using a
problem-solving approach based on genetic learning:
 Genetic algorithms are designed to find globally
optimized solutions. However, there is no guarantee
that any given solution is not the result of a local
rather than a global optimization.
 The fitness function determines the computational
complexity of a genetic algorithm. A fitness function
involving several calculations can be
computationally expensive.
Data Warehouse and Data Mining
20
Chapter 8
Genetic Learning
General Considerations
 Genetic algorithms explain their results to the
extent that the fitness function is understandable.
 Transforming the data to form suitable for a
genetic algorithm can be a challenge.
Data Warehouse and Data Mining
21
Chapter 8
Genetic Algorithms
• GA: based on an analogy to biological evolution
• Each rule is represented by a string of bits
• An initial population is created consisting of randomly
generated rules
• Based on the notion of survival of the fittest, a new
population is formed to consists of the fittest rules
and their offsprings
• The fitness of a rule is represented by its
classification accuracy on a set of training examples
• Offsprings are generated by crossover and mutation
Data Warehouse and Data Mining
22
Chapter 8
Genetic Algorithms
• Population-based technique for
discovery of ....knowledge structures
• Based on idea that evolution represents
search for optimum solution set
• Massively parallel
Data Warehouse and Data Mining
23
Chapter 8
The Vocabulary of GAs
• Population
– Set of individuals, each represented by one
or more strings of characters
• Chromosome
– The string representing an individual
Data Warehouse and Data Mining
24
Chapter 8
The vocabulary of GAs, contd.
Chromosome
011010
Gene
(Allele="0")
Locus=5
•Gene The basic informational unit on a
chromosome
•Allele :The value of a specific gene
•Locus : The ordinal place... on a chromosome
Chapter 8
Data Warehouse and Data Mining
25
where
a specific gene is found
Genetic operators
• Reproduction
– Increase representations of strong individuals
• Crossover
– Explore the search space
• Mutation
– Recapture “lost” genes due to crossover
Data Warehouse and Data Mining
26
Chapter 8
Genetic operators illustrated...
Parent 1:
Parent 2:
Parent 1:
Parent 2:
011010
000110
011010
000110
011010
Parent 2: 000110
Data Warehouse and Data Mining
Parent 1:
Simple reproduction
Offspring 1:
Offspring 2:
Reproduction with
crossover at locus 3
Simple reproduction with mutation
at locus 3 for offspring 1
Offspring
1:
Offspring
2:
011110
000010
Offspring 1:
010010
000110
Chapter 8
Offspring 2:
27
011010
000110
GAs rely on the concept of “fitness”
• Ability of an individual to survive into
the next generation
• “Survival of the fittest”
• Usually calculated in terms of an
objective fitness function
• Maximization
• Minimization
• Other functions
Data Warehouse and Data Mining
28
Chapter 8
Genetic Programming
• Based on adaptation and evolution
• Structures undergoing adaptation are
computer programs of varying size and
shape
• Computer programs are genetically
“bred” over time
Data Warehouse and Data Mining
29
Chapter 8
The Learning Classifier System
• Rule-based knowledge discovery and
concept learning tool
• Operates by means of evaluation, credit
assignment, and discovery applied to a
population of “chromosomes” (rules) each
with a corresponding “phenotype”
(outcome)
Data Warehouse and Data Mining
30
Chapter 8
Components of a
Learning Classifier System
• Performance
– Provides interaction between environment and rule base
– Performs matching function
• Reinforcement
– Rewards accurate classifiers
– Punishes inaccurate classifiers
• Discovery
– Uses the genetic algorithm to search for plausible rules
Data Warehouse and Data Mining
31
Chapter 8
Rough Set Approach
• Rough sets are used to approximately or “roughly” define
equivalent classes
• A rough set for a given class C is approximated by two sets:
– a lower approximation (certain to be in C)
– an upper approximation (cannot be described as not belonging to C)
Data Warehouse and Data Mining
32
Chapter 8
Fuzzy Set Approaches
• Fuzzy logic uses truth values between 0.0 and 1.0 to
represent the degree of membership (such as using
fuzzy membership graph)
• Attribute values are converted to fuzzy values
– e.g., income is mapped into the discrete categories
{low, medium, high} with fuzzy values calculated
Data Warehouse and Data Mining
33
Chapter 8
Fuzzy Set Approaches
• For a given new sample, more than one fuzzy
value may apply
• Each applicable rule contributes a vote for
membership in the categories
• Typically, the truth values for each predicted
category are summed.
Data Warehouse and Data Mining
34
Chapter 8
Chapter Summary
• The K-Means algorithm is a statistical unsupervised
clustering technique.
•All input attributes to the algorithm must be numeric and the
user is required to make a decision about..... how many clusters
are to be discovered.
•The algorithm begins by randomly choosing one data point to
represent each cluster.
•Each data instance is then placed in the cluster to which it is
most similar.
•New cluster centers are computed and the process continues
until .....the cluster centers do not change.
Data Warehouse and Data Mining
35
Chapter 8
Chapter Summary
•The K-Means algorithm is easy to implement and
understand. However,
•the algorithm is not guaranteed to converge to a
globally optimal solution,
•lacks the ability to explain what has been found,
•unable to tell which attributes are significant in
determining the formed clusters.
•Despite these limitations, the K-Means algorithm is
among the most widely used clustering techniques.
Data Warehouse and Data Mining
36
Chapter 8
Chapter Summary
• Genetic algorithms apply the theory of evolution to
inductive learning.
• Genetic learning can be supervised ...or ...unsupervised
• typically used for problems that cannot be solved with
traditional techniques.
•A standard genetic approach to learning applies a fitness
function to a set of data elements to determine...... which
elements survive from one generation to the next.
Data Warehouse and Data Mining
37
Chapter 8
Chapter Summary
•Those elements not surviving are used to
create new instances to replace deleted
elements.
•In addition to being used for supervised
learning and unsupervised clustering, genetic
techniques can be employed in conjunction
with other learning techniques.
Data Warehouse and Data Mining
38
Chapter 8
Key Terms
Affinity analysis. The process of determining which
things are typically grouped together.
Confidence. Given a rule of the form “If A then B,”
confidence is defined as the conditional
probability that B is true when A is known to be
true.
Crossover. A genetic learning operation that creates
new population elements by combining parts of
two or more elements from the current population.
Data Warehouse and Data Mining
39
Chapter 8
Key Terms
Genetic algorithm. A data mining technique based on
the theory of evolution.
Mutation. A genetic learning operation that creates a
new population element by randomly modifying a
portion of an existing element.
Selection. A genetic learning operation that adds copies of
current population elements with high fitness scores to the
next generation of the population.
Data Warehouse and Data Mining
40
Chapter 8
Reference
Data Mining: Concepts and Techniques (Chapter 7 Slide for
textbook), Jiawei Han and Micheline Kamber, Intelligent Database
Systems Research Lab, School of Computing Science, Simon Fraser
University, Canada
Data Warehouse and Data Mining
41
Chapter 8