Download Unsupervised learning

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Transcript
Unsupervised learning
In machine learning, unsupervised learning refers to the problem of trying to find hidden
structure in unlabeled data. Since the examples given to the learner are unlabeled, there is no
error or reward signal to evaluate a potential solution. This distinguishes unsupervised learning
from supervised learning and reinforcement learning.
Unsupervised learning is closely related to the problem of density estimation in statistics.
However unsupervised learning also encompasses many other techniques that seek to summarize
and explain key features of the data.
Many methods employed in unsupervised learning are based on data mining methods used to
preprocess data.
Approaches to unsupervised learning include:

clustering (e.g., k-means, mixture models, hierarchical clustering),

blind signal separation using feature extraction techniques for dimensionality
reduction (e.g., Principal component analysis, Independent component analysis, Nonnegative matrix factorization,Singular value decomposition).
Among neural network models, the self-organizing map (SOM) and adaptive resonance
theory (ART) are commonly used unsupervised learning algorithms.
The SOM is a topographic organization in which nearby locations in the map represent inputs
with similar properties.
The ART model allows the number of clusters to vary with problem size and lets the user control
the degree of similarity between members of the same clusters by means of a user-defined
constant called the vigilance parameter.
ART networks are also used for many pattern recognition tasks, such asautomatic target
recognition and seismic signal processing. The first version of ART was "ART1", developed by
Carpenter and Grossberg(1988).
Supervised learning
upervised learning is the machine learning task of inferring a function
from supervised (labeled) training data.
The training data consist of a set of training examples.
In supervised learning, each example is a pair consisting of an input object (typically a vector)
and a desired output value (also called the supervisory signal). A supervised learning algorithm
analyzes the training data and produces an inferred function, which is called a classifier (if the
output is discrete, see classification) or a regression function (if the output is continuous,
see regression). The inferred function should predict the correct output value for any valid input
object.
This requires the learning algorithm to generalize from the training data to unseen situations in a
"reasonable" way (see inductive bias).
There are four major issues to consider in supervised learning:
1.Bias-variance tradeoff
2.Function complexity and amount of training data
3.Dimensionality of the input space
4.Noise in the output values
Generalizations of supervised learning
There are several ways in which the standard supervised learning problem can be generalized:
1. Semi-supervised learning: In this setting, the desired output values are provided only
for a subset of the training data. The remaining data is unlabeled.
2. Active learning: Instead of assuming that all of the training examples are given at the
start, active learning algorithms interactively collect new examples, typically by making
queries to a human user. Often, the queries are based on unlabeled data, which is a
scenario that combines semi-supervised learning with active learning.
3. Structured prediction: When the desired output value is a complex object, such as a
parse tree or a labeled graph, then standard methods must be extended.
4. Learning to rank: When the input is a set of objects and the desired output is a ranking
of those objects, then again the standard methods must be extended.
Applications

Bioinformatics ,Cheminformatics


Database marketing, Handwriting recognition, Information retrieval


Quantitative structure–activity relationship
Learning to rank
Object recognition in computer vision, Optical character recognition, Spam detection,
Pattern recognition , Speech recognition
How supervised learning algorithms work
Given a set of training examples of the form
seeks a function
function
, where
, a learning algorithm
is the input space and
is an element of some space of possible functions
space. It is sometimes convenient to represent
such that
is defined as returning the
score:
Although
models where
, usually called the hypothesis
using a scoring function
value that gives the highest
. Let
and
is the output space. The
denote the space of scoring functions.
can be any space of functions, many learning algorithms are probabilistic
takes the form of a conditional probability model
the form of a joint probability model
, or
takes
.
Reinforcement learning
reinforcement learning is an area of machine learning in computer science, concerned with how
an agent ought to take actions in an environment so as to maximize some notion of
cumulative reward.
The problem, due to its generality, is studied in many other disciplines, such as game
theory, control theory, operations research, information theory, simulation-based
optimization, statistics, and genetic algorithms
In machine learning, the environment is typically formulated as a Markov decision
process (MDP), and many reinforcement learning algorithms for this context are highly related
to dynamic programming techniques.
The main difference to these classical techniques is that reinforcement learning algorithms do
not need the knowledge of the MDP and they target large MDPs where exact methods become
infeasible.
Reinforcement learning differs from standard supervised learning in that correct input/output
pairs are never presented, nor sub-optimal actions explicitly corrected.
Further, there is a focus on on-line performance, which involves finding a balance between
exploration (of uncharted territory) and exploitation (of current knowledge)
Introduction
The basic reinforcement learning model consists of:
1. a set of environment states
2. a set of actions
;
;
3. rules of transitioning between states;
4. rules that determine the scalar immediate reward of a transition; and
5. rules that describe what the agent observes.
Reinforcement learning is particularly well suited to problems which include a long-term versus
short-term reward trade-off. It has been applied successfully to various problems, includingrobot
control, elevator scheduling, telecommunications, backgammon and checkers .
Two components make reinforcement learning powerful:
The use of samples to optimize performance and the use of function approximation to deal with
large environments. Thanks to these two key components, reinforcement learning can be used in
large environments in any of the following situations:

A model of the environment is known, but an analytic solution is not available;

Only a simulation model of the environment is given (the subject of simulation-based
optimization);

The only way to collect information about the environment is by interacting with it.
The first two of these problems could be considered planning problems (since some form of the
model is available), while the last one could be considered as a genuine learning problem.
However, under a reinforcement learning methodology both planning problems would be
converted to machine learning problems.
Decision tree
A decision tree is a decision support tool that uses a tree-like graph or model of decisions and
their possible consequences, including chance event outcomes, resource costs, and utility. It is
one way to display an algorithm.
Decision trees are commonly used in operations research, specifically in decision analysis, to
help identify a strategy most likely to reach a goal. If in practice decisions have to be taken
online with no recall under incomplete knowledge,
a decision tree should be paralleled by a Probability model as a best choice model or online
selection modelalgorithm. Another use of decision trees is as a descriptive means for
calculating conditional probabilities.
A decision tree consists of 3 types of nodes:1. Decision nodes - commonly represented by squares
2. Chance nodes - represented by circles
3. End nodes - represented by triangles
Advantages
Decision trees:

Are simple to understand and interpret. People are able to understand decision tree
models after a brief explanation.

Have value even with little hard data. Important insights can be generated based on
experts describing a situation (its alternatives, probabilities, and costs) and their preferences
for outcomes.

Use a white box model. If a given result is provided by a model, the explanation for the
result is easily replicated by simple math.

Can be combined with other decision techniques. The following example uses Net Present
Value calculations, PERT 3-point estimations (decision #1) and a linear distribution of
expected outcomes
Disadvantages
For data including categorical variables with different number of levels, information gain in
decision trees are biased in favor of those attributes with more levels
LEARNING WITH COMPLETE DATA
Our development of statistical learning methods begins with the simplest task: parameter
learning with complete data.
A parameter learning task involves finding the numerical
parameters for a probability model whose structure is fixed. For example, we might be interested
in learning the conditional probabilities in a Bayesian network with a given structure. Data are
complete when each data point contains values for every variable in the probability model being
learned. Complete data greatly simplify the problem of learning the parameters of a
complex model.
1.Maximum-likelihood parameter learning: Discrete models
2.Naive Bayes models
Probably the most common Bayesian network model used in machine learning is the naive
Bayes model. In this model, the “class” variable C (which is to be predicted) is the root
and the “attribute” variables Xi are the leaves. The model is “naive” because it assumes that
the attributes are conditionally independent of each other, given the class. (The model in
Figure 20.2(b) is a naive Bayes model with just one attribute.) Assuming Boolean variables,
the parameters are
y=P(C =true); yi1 =P(Xi =true |C =true); yi2 =P(Xi =true | C =false):
Once the model has been trained in this way, it can be used to classify new examples
for which the class variable C is unobserved. With observed attribute values x1; : : : ; xn,
the probability of each class is given by
A deterministic prediction can be obtained by choosing the most likely class. Figure 20.3
shows the learning curve for this method when it is applied to the restaurant problem from
Chapter 18. The method learns fairly well but not as well as decision-tree learning; this is
presumably because the true hypothesis—which is a decision tree—is not representable exactly
using a naive Bayes model. Naive Bayes learning turns out to do surprisingly well in a
wide range of applications; the boosted version (Exercise 20.5) is one of the most effective
general-purpose learning algorithms. Naive Bayes learning scales well to very large problems:
with n Boolean attributes, there are just 2n + 1 parameters, and no search is required
to find hML, the maximum-likelihood naive Bayes hypothesis. Finally, naive Bayes learning
has no difficulty with noisy data and can give probabilistic predictions when appropriate.
3.Maximum-likelihood parameter learning: Continuous models
4.Bayesian parameter learning
5.Learning Bayes net structures
LEARNING WITH HIDDEN VARIABLES: THE EM ALGORITHM
The preceding section dealt with the fully observable case. Many real-world problems have
LATENT VARIABLES hidden variables (sometimes called latent variables) which are not
observable in the data that are available for learning.
1.Unsupervised clustering: Learning mixtures of Gaussians
2.Learning Bayesian networks with hidden variables
3.Learning hidden Markov models
The general form of the EM algorithm
We have seen several instances of the EM algorithm. Each involves computing expected
values of hidden variables for each example and then recomputing the parameters, using the
expected values as if they were observed values. Let x be all the observed values in all the
examples, let Z denote all the hidden variables for all the examples, and let
be all the
parameters for the probability model. Then the EM algorithm is
This equation is the EM algorithm in a nutshell. The E-step is the computation of the summation,
which is the expectation of the log likelihood of the “completed” data with respect
to the distribution P(Z=z |x;
), which is the posterior over the hidden variables, given
the data. The M-step is the maximization of this expected log likelihood with respect to the
parameters. For mixtures of Gaussians, the hidden variables are the Zij s, where Zij is 1 if
example j was generated by component i. For Bayes nets, the hidden variables are the values
of the unobserved variables for each example. For HMMs, the hidden variables are the i!j
transitions. Starting from the general form, it is possible to derive an EM algorithm for a
specific application once the appropriate hidden variables have been identified.
Learning Bayes net structures with hidden variables