Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Unsupervised learning In machine learning, unsupervised learning refers to the problem of trying to find hidden structure in unlabeled data. Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution. This distinguishes unsupervised learning from supervised learning and reinforcement learning. Unsupervised learning is closely related to the problem of density estimation in statistics. However unsupervised learning also encompasses many other techniques that seek to summarize and explain key features of the data. Many methods employed in unsupervised learning are based on data mining methods used to preprocess data. Approaches to unsupervised learning include: clustering (e.g., k-means, mixture models, hierarchical clustering), blind signal separation using feature extraction techniques for dimensionality reduction (e.g., Principal component analysis, Independent component analysis, Nonnegative matrix factorization,Singular value decomposition). Among neural network models, the self-organizing map (SOM) and adaptive resonance theory (ART) are commonly used unsupervised learning algorithms. The SOM is a topographic organization in which nearby locations in the map represent inputs with similar properties. The ART model allows the number of clusters to vary with problem size and lets the user control the degree of similarity between members of the same clusters by means of a user-defined constant called the vigilance parameter. ART networks are also used for many pattern recognition tasks, such asautomatic target recognition and seismic signal processing. The first version of ART was "ART1", developed by Carpenter and Grossberg(1988). Supervised learning upervised learning is the machine learning task of inferring a function from supervised (labeled) training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which is called a classifier (if the output is discrete, see classification) or a regression function (if the output is continuous, see regression). The inferred function should predict the correct output value for any valid input object. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way (see inductive bias). There are four major issues to consider in supervised learning: 1.Bias-variance tradeoff 2.Function complexity and amount of training data 3.Dimensionality of the input space 4.Noise in the output values Generalizations of supervised learning There are several ways in which the standard supervised learning problem can be generalized: 1. Semi-supervised learning: In this setting, the desired output values are provided only for a subset of the training data. The remaining data is unlabeled. 2. Active learning: Instead of assuming that all of the training examples are given at the start, active learning algorithms interactively collect new examples, typically by making queries to a human user. Often, the queries are based on unlabeled data, which is a scenario that combines semi-supervised learning with active learning. 3. Structured prediction: When the desired output value is a complex object, such as a parse tree or a labeled graph, then standard methods must be extended. 4. Learning to rank: When the input is a set of objects and the desired output is a ranking of those objects, then again the standard methods must be extended. Applications Bioinformatics ,Cheminformatics Database marketing, Handwriting recognition, Information retrieval Quantitative structure–activity relationship Learning to rank Object recognition in computer vision, Optical character recognition, Spam detection, Pattern recognition , Speech recognition How supervised learning algorithms work Given a set of training examples of the form seeks a function function , where , a learning algorithm is the input space and is an element of some space of possible functions space. It is sometimes convenient to represent such that is defined as returning the score: Although models where , usually called the hypothesis using a scoring function value that gives the highest . Let and is the output space. The denote the space of scoring functions. can be any space of functions, many learning algorithms are probabilistic takes the form of a conditional probability model the form of a joint probability model , or takes . Reinforcement learning reinforcement learning is an area of machine learning in computer science, concerned with how an agent ought to take actions in an environment so as to maximize some notion of cumulative reward. The problem, due to its generality, is studied in many other disciplines, such as game theory, control theory, operations research, information theory, simulation-based optimization, statistics, and genetic algorithms In machine learning, the environment is typically formulated as a Markov decision process (MDP), and many reinforcement learning algorithms for this context are highly related to dynamic programming techniques. The main difference to these classical techniques is that reinforcement learning algorithms do not need the knowledge of the MDP and they target large MDPs where exact methods become infeasible. Reinforcement learning differs from standard supervised learning in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. Further, there is a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge) Introduction The basic reinforcement learning model consists of: 1. a set of environment states 2. a set of actions ; ; 3. rules of transitioning between states; 4. rules that determine the scalar immediate reward of a transition; and 5. rules that describe what the agent observes. Reinforcement learning is particularly well suited to problems which include a long-term versus short-term reward trade-off. It has been applied successfully to various problems, includingrobot control, elevator scheduling, telecommunications, backgammon and checkers . Two components make reinforcement learning powerful: The use of samples to optimize performance and the use of function approximation to deal with large environments. Thanks to these two key components, reinforcement learning can be used in large environments in any of the following situations: A model of the environment is known, but an analytic solution is not available; Only a simulation model of the environment is given (the subject of simulation-based optimization); The only way to collect information about the environment is by interacting with it. The first two of these problems could be considered planning problems (since some form of the model is available), while the last one could be considered as a genuine learning problem. However, under a reinforcement learning methodology both planning problems would be converted to machine learning problems. Decision tree A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm. Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal. If in practice decisions have to be taken online with no recall under incomplete knowledge, a decision tree should be paralleled by a Probability model as a best choice model or online selection modelalgorithm. Another use of decision trees is as a descriptive means for calculating conditional probabilities. A decision tree consists of 3 types of nodes:1. Decision nodes - commonly represented by squares 2. Chance nodes - represented by circles 3. End nodes - represented by triangles Advantages Decision trees: Are simple to understand and interpret. People are able to understand decision tree models after a brief explanation. Have value even with little hard data. Important insights can be generated based on experts describing a situation (its alternatives, probabilities, and costs) and their preferences for outcomes. Use a white box model. If a given result is provided by a model, the explanation for the result is easily replicated by simple math. Can be combined with other decision techniques. The following example uses Net Present Value calculations, PERT 3-point estimations (decision #1) and a linear distribution of expected outcomes Disadvantages For data including categorical variables with different number of levels, information gain in decision trees are biased in favor of those attributes with more levels LEARNING WITH COMPLETE DATA Our development of statistical learning methods begins with the simplest task: parameter learning with complete data. A parameter learning task involves finding the numerical parameters for a probability model whose structure is fixed. For example, we might be interested in learning the conditional probabilities in a Bayesian network with a given structure. Data are complete when each data point contains values for every variable in the probability model being learned. Complete data greatly simplify the problem of learning the parameters of a complex model. 1.Maximum-likelihood parameter learning: Discrete models 2.Naive Bayes models Probably the most common Bayesian network model used in machine learning is the naive Bayes model. In this model, the “class” variable C (which is to be predicted) is the root and the “attribute” variables Xi are the leaves. The model is “naive” because it assumes that the attributes are conditionally independent of each other, given the class. (The model in Figure 20.2(b) is a naive Bayes model with just one attribute.) Assuming Boolean variables, the parameters are y=P(C =true); yi1 =P(Xi =true |C =true); yi2 =P(Xi =true | C =false): Once the model has been trained in this way, it can be used to classify new examples for which the class variable C is unobserved. With observed attribute values x1; : : : ; xn, the probability of each class is given by A deterministic prediction can be obtained by choosing the most likely class. Figure 20.3 shows the learning curve for this method when it is applied to the restaurant problem from Chapter 18. The method learns fairly well but not as well as decision-tree learning; this is presumably because the true hypothesis—which is a decision tree—is not representable exactly using a naive Bayes model. Naive Bayes learning turns out to do surprisingly well in a wide range of applications; the boosted version (Exercise 20.5) is one of the most effective general-purpose learning algorithms. Naive Bayes learning scales well to very large problems: with n Boolean attributes, there are just 2n + 1 parameters, and no search is required to find hML, the maximum-likelihood naive Bayes hypothesis. Finally, naive Bayes learning has no difficulty with noisy data and can give probabilistic predictions when appropriate. 3.Maximum-likelihood parameter learning: Continuous models 4.Bayesian parameter learning 5.Learning Bayes net structures LEARNING WITH HIDDEN VARIABLES: THE EM ALGORITHM The preceding section dealt with the fully observable case. Many real-world problems have LATENT VARIABLES hidden variables (sometimes called latent variables) which are not observable in the data that are available for learning. 1.Unsupervised clustering: Learning mixtures of Gaussians 2.Learning Bayesian networks with hidden variables 3.Learning hidden Markov models The general form of the EM algorithm We have seen several instances of the EM algorithm. Each involves computing expected values of hidden variables for each example and then recomputing the parameters, using the expected values as if they were observed values. Let x be all the observed values in all the examples, let Z denote all the hidden variables for all the examples, and let be all the parameters for the probability model. Then the EM algorithm is This equation is the EM algorithm in a nutshell. The E-step is the computation of the summation, which is the expectation of the log likelihood of the “completed” data with respect to the distribution P(Z=z |x; ), which is the posterior over the hidden variables, given the data. The M-step is the maximization of this expected log likelihood with respect to the parameters. For mixtures of Gaussians, the hidden variables are the Zij s, where Zij is 1 if example j was generated by component i. For Bayes nets, the hidden variables are the values of the unobserved variables for each example. For HMMs, the hidden variables are the i!j transitions. Starting from the general form, it is possible to derive an EM algorithm for a specific application once the appropriate hidden variables have been identified. Learning Bayes net structures with hidden variables