Download pptx - CUNY

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Regression analysis wikipedia, lookup

Linear regression wikipedia, lookup

Data assimilation wikipedia, lookup

Choice modelling wikipedia, lookup

Least squares wikipedia, lookup

Supervised Learning Recap
Machine Learning
Last Time
• Support Vector Machines
• Kernel Methods
• Review of Supervised Learning
• Unsupervised Learning
– (Soft) K-means clustering
– Expectation Maximization
– Spectral Clustering
– Principle Components Analysis
– Latent Semantic Analysis
Supervised Learning
• Linear Regression
• Logistic Regression
• Graphical Models
– Hidden Markov Models
• Neural Networks
• Support Vector Machines
– Kernel Methods
Major concepts
Gaussian, Multinomial, Bernoulli Distributions
Joint vs. Conditional Distributions
Maximum Likelihood
Risk Minimization
Gradient Descent
Feature Extraction, Kernel Methods
Some favorite distributions
• Bernoulli
• Multinomial
• Gaussian
Maximum Likelihood
• Identify the parameter values that yield the
maximum likelihood of generating the
observed data.
• Take the partial derivative of the likelihood function
• Set to zero
• Solve
• NB: maximum likelihood parameters are the
same as maximum log likelihood parameters
Maximum Log Likelihood
• Why do we like the log function?
• It turns products (difficult to differentiate) and
turns them into sums (easy to differentiate)
• log(xy) = log(x) + log(y)
• log(xc) = c log(x)
Risk Minimization
• Pick a loss function
– Squared loss
– Linear loss
– Perceptron (classification) loss
• Identify the parameters that minimize the loss
– Take the partial derivative of the loss function
– Set to zero
– Solve
Frequentists v. Bayesians
• Point estimates vs. Posteriors
• Risk Minimization vs. Maximum Likelihood
• L2-Regularization
– Frequentists: Add a constraint on the size of the
weight vector
– Bayesians: Introduce a zero-mean prior on the
weight vector
– Result is the same!
• Frequentists:
– Introduce a cost on the size of the weights
• Bayesians:
– Introduce a prior on the weights
Types of Classifiers
• Generative Models
– Highest resource requirements.
– Need to approximate the joint probability
• Discriminative Models
– Moderate resource requirements.
– Typically fewer parameters to approximate than
generative models
• Discriminant Functions
– Can be trained probabilistically, but the output does not
include confidence information
Linear Regression
• Fit a line to a set of points
Linear Regression
• Extension to higher dimensions
– Polynomial fitting
– Arbitrary function fitting
• Wavelets
• Radial basis functions
• Classifier output
Logistic Regression
• Fit gaussians to data for each class
• The decision boundary is where the PDFs cross
• No “closed form” solution to the gradient.
• Gradient Descent
Graphical Models
• General way to describe the dependence
relationships between variables.
• Junction Tree Algorithm allows us to efficiently
calculate marginals over any variable.
Junction Tree Algorithm
• Moralization
– “Marry the parents”
– Make undirected
• Triangulation
– Remove cycles >4
• Junction Tree Construction
– Identify separators such that the running intersection
property holds
• Introduction of Evidence
– Pass slices around the junction tree to generate
Hidden Markov Models
• Sequential Modeling
– Generative Model
• Relationship between observations and state
(class) sequences
• Step function used for squashing.
• Classifier as Neuron metaphor.
Perceptron Loss
• Classification Error vs. Sigmoid Error
– Loss is only calculated on Mistakes
Neural Networks
• Interconnected Layers of Perceptrons or
Logistic Regression “neurons”
Neural Networks
• There are many possible configurations of
neural networks
– Vary the number of layers
– Size of layers
Support Vector Machines
• Maximum Margin Classification
Small Margin
Large Margin
Support Vector Machines
• Optimization Function
• Decision Function
Visualization of Support Vectors
• Now would be a good time to ask questions
about Supervised Techniques.
• Identify discrete groups of similar data points
• Data points are unlabeled
Recall K-Means
• Algorithm
– Select K – the desired number of clusters
– Initialize K cluster centroids
– For each point in the data set, assign it to the
cluster with the closest centroid
– Update the centroid based on the points assigned
to each cluster
– If any data point has changed
clusters, repeat
k-means output
Soft K-means
• In k-means, we force every data point to exist
in exactly one cluster.
• This constraint can be relaxed.
Minimizes the
entropy of cluster
Soft k-means example
Soft k-means
• We still define a cluster by a centroid, but we
calculate the centroid as the weighted mean
of all the data points
• Convergence is based on a stopping threshold
rather than changed assignments
Gaussian Mixture Models
• Rather than identifying clusters by “nearest”
• Fit a Set of k Gaussians to the data.
GMM example
Gaussian Mixture Models
• Formally a Mixture Model is the weighted sum
of a number of pdfs where the weights are
determined by a distribution,
Graphical Models
with unobserved variables
• What if you have variables in a Graphical
model that are never observed?
– Latent Variables
• Training latent variable models is an
unsupervised learning application
Latent Variable HMMs
• We can cluster sequences using an HMM with
unobserved state variables
• We will train the latent variable models using
Expectation Maximization
Expectation Maximization
• Both the training of GMMs and Gaussian
Models with latent variables are accomplished
using Expectation Maximization
– Step 1: Expectation (E-step)
• Evaluate the “responsibilities” of each cluster with the
current parameters
– Step 2: Maximization (M-step)
• Re-estimate parameters using the existing
• Related to k-means
• One more time for questions on supervised
Next Time
• Gaussian Mixture Models (GMMs)
• Expectation Maximization