Download Notes on Probabilistic Graphical Models 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Financial economics wikipedia , lookup

Network science wikipedia , lookup

Numerical weather prediction wikipedia , lookup

Predictive analytics wikipedia , lookup

Computer simulation wikipedia , lookup

History of numerical weather prediction wikipedia , lookup

Probability box wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

General circulation model wikipedia , lookup

Generalized linear model wikipedia , lookup

Transcript
Notes on Probabilistic Graphical Models 1
Created 09/12/13
Updated 11/20/14, Updated 04/30/16, Updated 10/03/16, Updated 02/23/17, Updated 03/25/17, Updated 04/24/17
Updated 05/03/17
Introduction
Most tasks require a person or an automated system to reason—to reach conclusions based on available information.
The framework of probabilistic graphical models provides a general approach for this task. The approach is modelbased, allowing interpretable models to be constructed and then manipulated by reasoning algorithms. These models can
also be learned automatically from data, allowing the approach to be used in cases where manually constructing a model
is difficult or even impossible. Because uncertainty is an inescapable aspect of most real-world applications, the book
focuses on probabilistic models, which make the uncertainty explicit and provide models that are more faithful to
reality.
We began to study this topic during late 2002, and then used the courseware being created at Coursera on this topic
starting in 2012. The Coursera class has been offered several times since then. Here is the course introduction:
Welcome to Probabilistic Graphical Models 1: Representation! You're joining thousands of learners currently enrolled in
the course. I'm excited to have you in the class and look forward to your contributions to the learning community.
This course is based on my textbook, 'Probabilistic Graphical Models: Principles and Techniques' (2009). The textbook is
not required to complete the course. In the Textbook link under Resources, we've listed the sections of the textbook that
correspond to each of the lectures in this course.
To begin, I recommend taking a few minutes to explore the course site. Review the material we'll cover each week, and
preview the assignments you'll need to complete to pass the course. Click Discussion Forums to see forums where you
can discuss the course material with fellow students taking the class. Be sure to introduce yourself to everyone in the Meet
and Greet forum.
If you have questions about course content, please post them in the forums to get help from others in the course
community. For technical problems with the Coursera platform, visit the Learner Help Center.
Good luck as you get started, and I hope you enjoy the course!
Class Terms
There was an offering of the class starting on 03/27/17. It will run for 5 weeks.
Before you dive in, we have a few tips to help you succeed:
1. Set a schedule and mark your calendar. There’s a lot of material to learn, and our most successful learners tell
us they’re diligent about setting a schedule and sticking to it. In most Specializations you should plan to
complete one course each 4-6 weeks.
2. Find a study group. Learners are much more likely to successfully complete a course if they enroll with
friends, or connect with other learners in the course forums.
Honor code discussion: Don’t post your work or other solutions to public sites.
Time management: 15 hours per week. Budget at least this amount of time.
Textbook
This course is based on the textbook “Probabilistic Graphical Models: Principles and Techniques”. The textbook is not
required to complete the course. However, if you’d like to go beyond what’s covered in the course, it could be a good
reference text.
You can get a copy of the book from https://mitpress.mit.edu/books/probabilistic-graphical-models. List price
is $120. As a learner in this MOOC, you can use the discount code DKPGM12 to get 30% off the hardcover and e-book
versions. That would bring it down to $84, which is about the same price that it can be purchased for on Amazon.com.
Page 1
Structure of the Class
ending is 5/07/17
Week 1: Introduction and Overview
Motivation (19 minutes)
Why use these? Assess and combine information.
Bayesian Networks provide a framework for this kind of application.
Bayesian Networks provide declarative models.
Distributions (5 minutes)
We begin with a discussion of Random Variables. Example is grades in a course.
From here to we move to distributions and joint distributions. For instance, if there were three variables: Intelligence,
Difficulty, and Grade (2 x 2 x 3).
Discussion of independence of parameters. Values not completely determined by other parameters.
There are operations you can perform on a distribution, including:
 Condition on a value, such as g1. Removes all inconsistent assignments.
 Reduction. Then normalize in order to get a probability distribution.
 Marginalization: multiply values.
Factors (6 minutes)
A function that gives you a set of numbers derived from a set of arguments which are the possible values of the random
variables.
So a factor really is a function, or a table. It takes a bunch of arguments. In this case, a set of random variables X1 up to
Xk, and just like any function it gives us a value for every assignment to those random variables. So it takes all possible
assignments in the cross products space of X1 up to Xk.
That is all possible combinations of assignments and in this case it gives me a real value for each such combination.
And the set of variables X1 up to Xk is called the scope of the factor. That is, it's the set of arguments that a factor takes.
Let's look at some examples of factors. We've already seen a joint distribution, and a joint distribution is a factor. For
every combination for example here of the variables I, D, and G, it gives me a number. As it happens this number is
a probability. It happens that it sums to one but that doesn't matter. What's important is that for every value of I, D, and
G (a combination of values), I get a number. That's why it's a factor.
Here's a different factor and a normalized measure is a factor also. In this case we have a factor such as the probability
of I, D, G1 and notice that in this case the scope of the factor is actually I and D. Because there is no dependence of
the factor on the value of the variable G because the variable G in this case is constant. So this is a factor whose scope is
I and D.
Factor products: consider Factor on (A, B) times Factor on (B, C). This will give a larger factor which as all values of
(A, B, C). The values are the product of the value from (A, B) times the value which matches B in (B, C)
Page 2
Factor marginalization: form sums from the matching rows that are found by removing the specified variable. So
Factor(A, B) has values that are found by summing Factor(A, B, c1) + Factor(A, B, c2).
Factor reduction: this amounts to simply a filtering operation.
Now, why factors? It turns out that factors are the fundamental building block in defining these distributions and high
dimensional spaces. That is the way in which we're going to define an exponentially large probability distribution over
N random variables is by taking a bunch of little pieces and putting them together by multiplying factors in order to
define these high dimensional probability distributions. It turns out also that the same set of basic operations that we
use to define the probability distributions in these high dimensional spaces are also what we use for manipulating them
in order to give us a set of basic inference algorithms.
Bayesian Network Fundamentals
Semantics and Factorization (17 minutes)
A Bayesian network is defined as an acyclic graph of conditional probabilities. The later part described how to confirm
that these are valid conditional probabilities.
Reasoning (9 minutes)
Here we discussed causal reasoning, evidential reasoning, and inter-causal reasoning, which are three of the primary
cases of performing reasoning using the Bayes net. Discussed how evidence can flow in the network, to update the
resulting probabilities.
Flow of Probabilistic Inference (14 minutes)
This went through details of which nodes can influence which other nodes.
There were six main cases:
X>Y
X<Y
X>W>Y
X<W<Y
X>W<Y
X<W>Y
Consider cases of observed and not observed. Think of it as flow of water, except that values open and close. Observe
means that a value is closed. Formal definition:
We have that the trail X1 up to XN and, and there should have been a K here as well, is active given Z, if, now
we have two cases. First, every V-structure needs to be activated and the only way we can activate a Vstructure is if Xi or one of its descendants is observed. So there are active V-structures. Now, all the other
valves have to be open, so no other Xi’s that are not in V-structures, so Xi not in “” not in V-structures
Page 3
Bayesian Networks: Independences
Conditional Independence (12 mins)
So far, we've defined graphical models primarily as a data structure for encoding probability distribution. So we talked
about how you can take a probability distribution, and using a set of parameters that are somehow tied to the graph
structure, go ahead and represent a probability distribution over a high dimensional space in a factored form.
It turns out that one can view the graph structure in a graphical model using a completely complimentary viewpoint: a
representation of the set of independencies that the probability distribution must satisfy.
That theme turns out to be really enlightening, and thought provoking. So let's talk about that. We are going to begin
by just defining the notion of independencies that we're going to utilize in subsequent presentations. So let's start by just
defining the very basic notion of independence within a probability distribution. Initially we're just going to talk about
the independence of events alpha and beta within a probability distribution and let me just go ahead and introduce this
notation P “satisfies” alpha “independent of” beta. The “satisfies” symbol looks like bar-equals, and the “independent”
symbol looks like an upside-down T.
Now we will talk about conditional independence, which is a broader notion and occurs more often, since it is rare for
two things to be truly independent. Defined as:
Conditioning can also lose independence.
Independencies in Bayesian Networks (18 mins)
One of the most elegant properties of probabilistic graphical models is the intricate connection between the factorization
of the distribution as the product of factors. And the independence properties that it needs to satisfy. Now we're going to
talk about how that connection manifests in the context of a directed graphical models or Bayesian networks.
So, for example, the independence definition that P of X, Y is the product of two factors P of X and P of Y is the
definition of independence. And at the same time it's a factorization of the joint distribution as a product of two factors.
Similarly one of the definitions that we gave for conditional independence, which is the joint distribution over X, Y and
Z is a factor over X and Z times a factor over Y and Z, is the definition of conditional independence. So, once again,
independence is related to factorization.
Variables depend only on parents.
Page 4
I-maps – all of the independencies that correspond to the d-separations. In effect, the i-map is an equivalent
representation of the Bayes Net.
Naïve Bayes (9 mins)
One subclass of Bayesian Networks is the class called as Naïve Bayes or sometimes even more derogatory, Idiot Bayes.
As we'll see Naïve Bayes models are called that way because they make independence assumptions that indeed
very naive and orally simplistic. And yet they provide an interesting point on the tradeoff curve of model complexity
that sometimes turns out to be surprisingly useful.
So here is a naive base model. This model is typically used for classification that is taking an instance where we have
effectively observed a bunch of features.
The assumption is that all features are independent. There are two variations: Bernoulli, and Multinomial.
Surprising effective for text classification.
Bayesian Networks: Knowledge Engineering
Application: Medical Diagnostics (9 mins)
One of the most common applications of Bayesian networks or rather one of the earliest ones that are still very much in
use today, is for the purpose of diagnosis. By diagnosis I mean both medical as well as fault diagnosis. Now this dates
back into the early 90s in the Ph.D. thesis of David Heckerman et. al. which won the ACM dissertation award in a
system called Pathfinder. This looked at a range of different pieces of evidence in order to help a doctor diagnose a set
of diseases.
The problem set was to help a pathologist diagnose lymph node pathologies. There are 60 different diseases.
See David E. Heckerman, Eric J. Horvitz, and Bharat N. Nathwani
There were several versions of Pathfinder – the first was simply rules-based, the second was Naïve Bayes, and then
Pathfinder version 3 still used naive Bayes but combine naive Bayes with better knowledge engineering.
At that point they actually understood some of the issues behind what makes a system like this work and there were able
to fix it. So specifically one of the things that turns out to be really fundamental for the performance of any
probabilistic modeling system is not to put in zero probabilities ever, except for things that are definitions because once
you put in a zero, no matter how much evidence to the contrary you have, you will never be able to get rid of it.
The net was all manually constructed.
Then finally Pathfinder four was a full Bayesian network in all of its full glory, and it no longer made incorrect
assumptions about independencies between different symptoms given the disease. Made the model more correct, with
better estimation of probabilities. Accurate in 50 cases out of 53, while the naïve Bayes version was only 47/53.
Another example was the CPCS network (medical) with about 500 variables, and about 4,000 parameters
(https://www.ncbi.nlm.nih.gov/pubmed/1762579). This was done by Max Middleton, Eric Horvitz and others at
Stanford. Eric Horvitz later went to Microsoft.
The final example is Fault Diagnosis, such as operating system failures: e.g. printer problems, etc.
A large web site for car repair is based on Bayesian networks.
Page 5
Knowledge Engineering Example - SAMIAM (14 mins)
So, now, let’s look at an example in an actual network, and try to see what the CPD’s look like, what behavior we
get, and how we might augment the network to include additional things. Now, let me warn you right upfront that this is
a baby network; it’s not a real network, but it’s compact enough to look at, but still interesting enough to get some nontrivial behaviors.
We’re going to use a system called SAMIAM. It was produced by Adnan Darwiche and his group at UCLA, and it’s
nice because it actually works on all sorts of different platforms. The name stands for Sensitivity Analysis, Modeling,
Inference, and More. Enables you to set up networks, edit the values, apply evidence, and monitor the results.
Here is a sample screen:
Here we can build Bayes Nets and try them. This software is version 3.0, and was released in 2010.
The basic operations are to Add Nodes and Add Edges, then edit the values, where the values are the name and
identifier, then the conditional probabilities. You can save and load these networks.
We are going to create a simple network that predicts the cost of insuring a driver. Starting with a node called “Cost”,
which has two values “High” and “Low”. In a real network, you would use a continuous variable instead.
In the lecture, she starts with a network of two nodes, shows the CPD’s and shows the results, then starts switching to
larger and larger networks, trying out different effects on the “Accident” node, which leads to the “Cost” node. There is
a way in SAMIAM to indicate an observation (then the node is shown in Red).
With about 8 nodes, and a few possible active trails through the network, the effects can be quite subtle. There is an
example of “Good_Student” and “Age” given. The lecture ends with a comment that this network can be downloaded
and tried on your own, but I could not find it.
Using Octave
There were 6-7 videos by Andrew Ng about basic concepts and operations in Octave. These have been moved into our
document “Notes on Octave”. Octave is much like R or Matlab. The most useful part for this class is that it has a way
to submit assignments:
Page 6
Week 1 Honors Programming Assignment
This required writing code for factor operations such as conditionalization. There were several steps which built on
each other. You can submit code several times, and get a score for each task in turn.
I learned quite a bit about Octave, and completed all of the projects.
Week 1 Checkpoint and Summary (submit by April 2, 2017)
This is the end of the first week. I was quite pleased with the presentation and projects.
The time breakdown seems to have been:
 5-6 hours of videos
 5 hours on the quizzes
 About 2 hours to install Octave and learn the basic operations (made a comparison with R).
 2 hours of working with SAMIAM, with half of that time learning, and half setting up a network with correct
behavior from the CPD’s.
 5 hours on programming using Octave (instructor’s estimate was 6 hours).
Week 2: Template Models for Bayesian Networks
This builds on the first week.
In many cases, we need to model distributions that have a recurring structure. In this module, we describe
representations for two such situations. One is temporal scenarios, where we want to model a probabilistic structure that
changes over time; here, we use Hidden Markov Models, or, more generally, Dynamic Bayesian Networks. The other is
aimed at scenarios that involve multiple similar entities, each of whose properties is governed by a similar model; here,
we use Plate Models.
Overview of Template Models (10 mins)
Today's topic is an important extension on the language on graphical models. And it's intended to deal with the
very large class of cases, where what we'd like to do is not just write down one kind of graphical model for a particular
application, but rather, come up with something that is a general-purpose representation that allows us to solve multiple
problems using the same exact model. Hence, in this week, we will describe structures that occur often, and templates
that can be reused.
Page 7
Our first examples of this are about genetic inheritance, where we are interested in reasoning about a trait. Templates
are sharing between models. Such as the idea that genotype drives bloodtype for each person, while genotype is based
on two parents. Parameters are shared.
Another example is NLP Sequence Models, specifically named entity recognition (Is it a person? Is it a location? Etc.).
Then she discusses image segmentation models.
Let’s look at the “university example” more closely. This is ‘difficulty’, ‘intelligence’, ‘grade’ applied across many
students, at the scale of a university.
Another example is robot localization, where a robot can be in one of a set of rooms, determined via sensors. The robot
dynamics are fixed.
This leads us to the examples of a Template Variable, which something that we end up repeating over and over again:
These are indexed by “person”, “pixel”, “course”, “t”, etc.
This leads us to the definition of Template Models:
Temporal Models – Dynamic Bayesian Nets (23 mins)
There are many classes of models that that allow us to represent in a single concise representation, a template
overreaching model that incorporates multiple copies of the same variable and also allow us to represent multiple
models within as a byproduct of a single representation. But one of the most commonly used among those is for
reasoning about template models where we have a system that evolves over time.
We represent a distribution over template trajectories. So the first thing we want to do when representing a distribution
over continuous time is, in most cases, not always, is to try and forget the time is actually continuous, because
Page 8
continuous quantities are harder to deal with. So, we're going to discretize time into units of “delta”. Which is the time
granularity at which we're going to measure time. Now, in many cases, this is something that is given to us by
the granularity of our sensor. So, in many cases, for example, we have a video or a robot there is a certain time
granularity at which we obtain measurements and so that's usually the granularity that we'll pick.
So now we have a set of random variables X written as “X super t”.
The first assumption we should make for simplification is the “Markov Assumption”. This means that the changes in
state are entirely dependent upon the current state, not the historical state. This is a “forgetting assumption”
The other assumption often make is the “time invariance” assumption.
She then described that some models combine time-varying with time-invariant behavior in different parts of the model.
Here is an example:
In this case, the model for weather, velocity, location, and failure are directly copied over from the prior time slice,
while the observations are added at time slice t+1.
This is known as a 2-time-slice Bayesian network (2TBN):
An unrolled network is called a “ground network”.
One example network for car motion (from AAAI 1994) was then shown.
To summarize, dynamic Bayesian networks provides us with a language for encoding structured distributions over
time. And by making the assumptions of the Markovian evolution as well as time invariance, you can use a single
compact network to allow us to code arbitrarily long transitions over arbitrarily long time sequences.
Temporal Models – Hidden Markov Models (12 mins)
A Hidden Markov Model (HMM) in its simplest form can be viewed as a probabilistic model that has a state variable S
and a single observation variable O. So the model really has only two probabilistic pieces, there is the transition model
that tells us the transition from one state to the next over time and then there is the observation model, that tells us in a
given state how likely we are to see different observations.
Page 9
As opposed to something that manifests at the level of the 2TBN structure, this kind of simple structure is useful for a
broad range of applications, such as: robot localization, and speech recognition, where HMMs are really the method
of choice for all current speech recognition systems (some of this dates back to the Hearsay system in the 1970’s-80’s).
To me, this seems to resemble the Kalman filter models of the 1970’s (Wikipedia gives a description of Kalman filters
as: The underlying model is a Bayesian model similar to a hidden Markov model but where the state space of the latent
variables is continuous and where all latent and observed variables have Gaussian distributions).
The application in speech sequence recognition is based on the of representing a path through the states as a transition
matrix, that derives S(n+1) from S(n).
Robot motion example
Speech recognition example. This had a discussion about analysis by phonemes.
Summary:
Plate Models (20 mins)
Let's start by modeling repetition. So, in this case, imagine that we're repeatedly tossing the same coin again and again.
So we have an outcome variable, and what we'd like to model is the repetition of multiple tosses. And so, we're going
to put a little box around that outcome variable and this box, which is called a plate, is a way of denoting that the
outcome variable is indexed. Which means, we don't denote exclusively by the notion, by different tosses of the coin
T. And the reason for calling it a plate is because the intuition of this is a stack of identical plates, that's kind of where
the idea comes from, for a plate model.
Just repeating the exact same model over and over again isn’t that interesting.
Instead, we have parameters inside and outside the stack plates.
The examples will be around a university with a multiple of students, courses, difficulties, etc. Notation is the variables
are inside the box and the box is labeled with the index.
These could be further nested.
For instance, mesh the student plate inside the course plate. Indexed by both S and C. If nested, it has the variables for
all elements.
Or, overlapping plates:
These models are useful for collective inference.
Page 10
Plate dependency model:
 Template variable, indexed
 Template parents
 Can’t have an index the parent that doesn’t appear in the child.
Ground network: this is the unraveled version of a plated network.
Summary:
Quiz: Template Models (10 questions)
Get ready for this.
Overview of Structured CPD’s (8 mins)
So far we focused solely on the global structure of the distribution. The fact that you can take it and factorize it as a
product of factors that correspond to subsets of variables is useful, but it turns out that you also have other types of
structure that you might want to encode and that is actually really important for real world applications.
So to motivate that, let's look at the tabular representation of conditional probability distributions, which is what we've
used universally until now in the examples that we've given. So a tabular representation is one of these examples where
we have, you know, a row for each assignment. So this is, just a reminder, this is G with parents I and v.
And here we have a row for each assignment of appearance that give us explicitly enumerating all of the entry's that
correspond to the probabilities of the variable G. So this is great because it will what's the problem and then it sounds
like a perfectly reasonable and very understandable representation.
In a medical application, there might be a variable called “cough”. There could be several reasons. This kind of
variable could have 10 or 20 parents, but if we have k parents (binary), then the number of entries in the CPD is 2 raised
to the k.
These situations are more common than not. Hence, tabular definitions are tedious.
But a fully-specified probability distribution is not required. Use a parameterize expression. Just make sure they sum to
one. Here are examples:
Page 11
Note that decision trees are within this set. Also continuous variables (useful because you can’t use a table here).
Definition of context-specific independence.
Tree-Structured CPD’s (14 mins)
One of the classes of structured CPDs that is most useful are the class of CPDs that encodes a dependence of a child on
a parent. But a dependence that is only happening in certain contexts. One method for encoding that is using the class
of what's called tree-structured CPDs.
So, to understand what tree-structured CPDs are, let's look at this simple example. Imagine that we have 3 graduate
students. And one of the students is applying for a job, indicated by the variable ‘Job’, and the prospects of the students
to get the job depend on three variables. Depends on the quality of the recommendation letter that they get from the
faculty member, their SAT scores, and whether the student chooses to apply for the job in the first place. So let's think
about one possible CPD for this, for this model.
So here we have, a tree structure that you can think of it as a set of, as a branching process. The first level is “apply for
job or not”. Here is the full example:
The variable A over here, is the multiplexer, the selector variable. The selector variable takes on values in the space one
decay and it selects which of the ZIs the Y copies. Notice that the Y here is deterministic, as we can see by the fact that
we have these two these two lines surrounding it which is our way of indicating deterministic dependencies.
The idea of the multiplexer can be very useful.
The use of CPD’s can almost be a rule-based system, such as that used in operating system failures:
Page 12
This is from Microsoft (Eric Horvitz)
The final discussion is about perceptual ambiguities.
Independence of Causal Influence (13 mins)
This lecture covered important topics that lead from binary true/false models, to continuous models.
First a discussion about the “Noisy OR” function.
Then a discussion about “Sigmoid” function in models, which can deal with multiple inputs independently, and avoid
some of the computational overload.
Consider a Bayesian network that works for traits that are controlled by multiple genes. For this part of the assignment,
you will consider the example of spinal muscular atrophy (SMA). Think about how incorporating multiple genes
changes the network.
We use a form a “weighted evidence”. Thus, there is a different weight for each allele of each gene, and the alleles that
are most involved in causing a person to have a trait have the highest corresponding weights. Note that we are assuming
that the weight for the copy of the gene from the mother is the same as the weight for the copy of the gene from the
1
1
m
1
m
father. The larger the value of f(X ,...,X ...,X ,Y ,...,Y ), the larger the likelihood that a person has the 1 n1 nm 1 nm
trait.
Where the f is a summation, followed by a call to sigmoid, which will take –infinity to +infinity and reduce it to 0 to +1.
Not every input/output combination may be needed for a sigmoid CPD.
Finally, another discussion regarding CPCD, where the approaches listed here brought the number down to about 1000
parameters.
Continuous Variables (13 mins)
These are extremely valuable in the context of graphical models, because they allow us to represent information in other
than the idea of having values such as “high”, “medium”, and “low”. In turn this allows for a much sparser
representation since you don’t have to enumerate out all of the CPD values for how each variable value influences each
other variable value in a tabular form. And when we have a network that involves continuous variables, tables are
simply not an option.
Page 13
So let's look at some examples of networks that involve continuous variables, and see what kind of representations we
might want to incorporate here. The discussion in this lecture is entirely about modeling and representation, there is no
discussion about how to calculate the resulting distributions.
Example is temperature in room. Sensor S has normal distribution of errors.
A typical model is a linear Gaussian. This is a Gaussian value that is the combination of several Gaussian inputs, but
has a fixed variance.
The next example was robot localization. In this case we are comparing the performance of the laser and sonar sensors.
She also explains the “wall-like peak” at the max range.
Finally, she gives a discussion of robot motion. The resulting position distributions were banana-shaped. The shape
goes more diffuse as uncertainty increases.
Quiz: Structured CPD’s
This only had four questions, which was about structure and independencies in the various types of CPDs covered in
lectures.
Week 2 Honors Programming Assignment
The project is about Genetic Inheritance models. You create these models in Octave code, and then run a utility to
convert them to SamIAM networks (in order to run the models). To confirm that you have done this correctly, you must
answer quiz questions about the behavior of the network.
There are two stages to the programming part, the second one being a decoupled network. Finally, you run a provided
SamIam model, and that will be used for more quiz questions about the behavior of the network.
The programming was not too difficult, but it did use more concepts of Octave than I had used before, helping me
understand more about working with matrices, using statistical functions, etc.
The quiz totaled 11 questions, and connected the lectures and project together much better than I expected. For
instance, the lecture went by very fast on de-coupling in networks, and use of sigmoids to simplify networks, but now
one is programming it and answering questions.
Week 2 Checkpoint and Summary (submit by April 9, 2017)
Completed on time. The programming assignment took more like 11-12 hours, which matched the estimate for this
week (interestingly most of that time was reading and reviewing concepts of genetics, the actual amount of
programming was not that large).
At this point we have covered a wide range of Bayesian network structures and applications. This was quite
comprehensive. The use of SamIam for evaluating the models was a good idea, as was the code that creates a SamIam
network from functions in Octave.
I am becoming more interested in the code for evaluating networks, which isn’t covered until the next 5-week class that
is part of this specialization.
Week 3: Markov Networks
Here is an introduction to the concept from Wikipedia:
A Markov logic network (MLN) is a probabilistic logic which applies the ideas of a Markov network to first-order logic,
enabling uncertain inference. Markov logic networks generalize first-order logic, in the sense that, in a certain limit,
all unsatisfiable statements have a probability of zero, and all tautologies have probability one.
The goal of inference in a Markov logic network is to find the stationary distribution of the system, or one that is close to it; that this may
be difficult or not always possible is illustrated by the richness of behavior seen in the Ising model. As in a Markov network, the stationary
distribution finds the most likely assignment of probabilities to the vertices of the graph; in this case, the vertices are the ground atoms of
Page 14
an interpretation. That is, the distribution indicates the probability of the truth or falsehood of each ground atom. Given the stationary
distribution, one can then perform inference in the traditional statistical sense of conditional probability: obtain the probability that formula
A holds, given that formula B is true.
Inference in MLNs can be performed using standard Markov network inference techniques over the minimal subset of the relevant Markov
network required for answering the query. These techniques include Gibbs sampling, which is effective but may be excessively slow for
large networks, belief propagation, or approximation via pseudolikelihood.
Pairwise Markov Networks (10 mins)
We begin week three here, and discuss Markov Networks. By contrast with Bayesian networks, these are undirected.
Hence, relationship lines go in any configuration, and don’t have arrows. They are often used to indicate “pairwise
closeness” or “pairwise similarity”. The example was four people in a study group relationship indicating who worked
well with who.
We can’t use many of the concepts of the prior lessons, there are no variables for conditioning.
But we will still use the definition of a factor. The factor typically indicates a measure of the pairwise relationship, and
are often called “affinity functions”.
She asks us to think about the behavior within the network as a measure of “happiness” i.e., which nodes are in
agreement with which other nodes.
How do we turn this unnormalized measure into probability distribution? We normalize it and that's normalization
here. It has a name. It's called the partition function for historical reasons that it's comes from its origins and statistical
physics but that's what it's called.
But you can think of it simply as the normalizing constant that is going to make all of these sum to one. So we're going
to get it by simply summing up all these entries and it's going to give us the value z. And if we divide all of these
entries by z, we get a normalized probability distribution and that is the probability distribution as defined by this graph.
So now we can start to make some mathematical analysis of the information in the graph.
But it still doesn’t have meaning of a probability distribution in the sense of the prior weeks, as indicated below:
The final example given is a grid, which corresponds to pixels in an image. This will be important in the homework,
since we are analyzing images there.
Generalized Gibbs Distribution (15 mins)
So we've previously seen the notion of Pairwise Markov networks. But now we're going to define a much more general
notion, that is considerably more expressive than the Pairwise case. And that definition is called the Gibbs distribution.
Let’s consider having four random variables, each of which are linked to the other three.
So consider a fully connected pairwise Markov network over n random variables, and let's assume that each variable
xi has d values. The number of parameters is Order(n squared times d squared).
But it is all a set of factors, not probability distributions.
Page 15
Create an induced Markov network.
So in order to define this general framework more formally a Gibbs distribution is parameterized by a set of factors phi.
More generally, if we have a set of factors phi, each of which has, or each phi sub i has a particular scope, D sub i. The
induced Markov network, which we're going to call H sub phi, has an energy between a pair of variables, X sub i and X
sub j. Whenever there exists a factor phi sub i in phi such that X sub i, X sub i are both in the scope of the factor phi M.
That is two variables are connected whenever they appear together in the same scope.
Factorization: Now we can go ahead and turn this around and define a notion, just like we have for Bayesian network,
of when a probability distribution P factorizes over a graph H, that is, at what point can I represent P over a
particular graph H?
Cannot read the factorization from the graph.
The definition of an active trail in a Markov net is simpler than that for a Bayesian net, since:
Summary
Conditional Random Fields (22 mins)
There is a note on the courseware that says: This lecture is an important one; over 70% of people review it more than
once and it likely covers some critical concepts.
So a conditional random field, you can think of it as a, something that looks very much like a Markov network, but for a
somewhat different purpose. So let's think about what we are trying to do here. This class of model is intended to
deal with what we call task-specific prediction, that where we have a set of input variables for observed variables, X,
we have a set of target variables that we're trying to predict y. And, the class of models is intended to, is designed for
those cases where we always have the same types of variables is the instance variables in the same types of variables as
the targets.
An example is image segmentation (classification of pixels).
Text processing is another example (classification of words such as person, location, etc.).
CRF representation: product of a series of Gibbs factors.
Page 16
Apply a logistic model.
Summary
Quiz: Markov Networks (4 questions)
Was not too difficult.
Independencies in Markov Networks (4 mins)
We've shown the connection between independencies and the factorization of the distribution in the context of Bayesian
networks. Now, we're going to show that that same kind of connection holds also in the case of Markov networks. So
how are, so, first of all, we need to come up with a similar notion of what kind of independencies are encoded by the
structure of the graphical model? So, in this case, we're going t have a notion of separation, which is the analogous
notion to de-separation. Except that now there's no D for directed, it's just separation. And then, actually it's a much
simpler notion because there's not multiple kinds of different flows of influence. You only have one type of edge, the
undirected edge, so it's very simple.
Hence we can use the separation property, and also re-use the concepts of i-maps (independency map).
I-map and Perfect Maps (20 mins)
This lecture is about independence in networks, and how the independencies represent the structure. We use the
notation I(P) to represent the independencies in P, and then look at other networks to determine if they are equivalent,
minimal, or comparable.
We would also like to have a sparse I(P), to reduce excess complexity.
Page 17
An i-map is a list (graph) of the independencies. We can have i-maps modeled as incomplete, or perfect.
With a perfect map, it is minimal and equivalent. It may not exist.
The rules for analysis of a MN are similar to that of a BN, but we don’t have conditional independencies, such as {A
indep B | C}. Hence the number of independencies is much less, and hence there may not be a perfect i-map for a MN
that captures the correct information.
Is a perfect map unique: yes
Quiz: Independencies in Markov Networks
Get ready, there are 3 questions
Log-Linear Models (Local Structure) (22 mins)
Local structure that doesn't require full table representations is important in both directed and undirected models. How
do we incorporate local structure into undirected models? The framework for that is called log-linear models for reasons
that will be clear in just a moment.
So whereas, in the original representation of the unnormalized density, we define p tilde to be a product of factors pi
Di is, which is potentially, a full table. Now we're going to shift that representation to something that uses a linear
form.
So here is a linear form that is subsequently exponentiated. That's why it's called log-linear, because the logarithm is a
linear function. So what is this what is this form over here? It's a linear function that has:


these things that are called coefficients
these things that are called features
Features are factors each having a scope, which is the set of variables on which the feature depends. Different features
can have the same scopes.
Example:
Page 18
These models are a good way to represent the combination of multiple independent pieces of evidence.
The first example is for language (text) analysis.
The Ising model is discussed next:
The model consists of discrete variables that represent magnetic dipole moments of atomic spins that can be in
one of two states (+1 or −1). The spins are arranged in a graph, usually a lattice, allowing each spin to interact
with its neighbors. The model allows the identification of phase transitions, as a simplified model of reality.
Then a discussion about metric MRF’s (metric random field).
Finally a discussion about using metric MRF’s for image analysis.
Shared Features in Log-Linear Models (8 mins)
Shared structures and shared parameters comes up in directed models. But it comes up as much or perhaps even more,
in undirected models. And that is because for reasons that we've already discussed and will continue to discuss.
Eliciting parameters in an undirected model is much more difficult because they don't represent conditional
probabilities or probabilities. And so, it's a lot easier to represent them as sort of templates that are comprised of
smaller building blocks.
Summary:
Programming Assignment: Markov Networks for OCR
This was rated at 15 hours, which is more than prior weeks, so plan ahead. Here is the summary:
In the last assignment, you used Bayesian networks to model real-world genetic inheritance networks. Your rival claims that this application to
genetic inheritance underscores the limited applicability of graphical models, because one doesn’t often find problems with network structures that
clear.
To prove him wrong, you decide to apply the graphical model framework to the task of optical character recognition (OCR), a problem that is
considerably messier than that of genetic inheritance. Your goal is to accept as input an image of text and output the text content itself.
The real-world applications of OCR are endless. Some examples include:

The Google Books project has scanned millions of printed books from libraries around the country. Using OCR, these book scans can be
converted to text files, making them available for searching and downloading as eBooks. 
There has long been research on OCR applied to handwritten documents. This research has been so successful that the US Postal Service
can use OCR to pre-sort mail (based on the handwritten address on each envelope), and many ATMs now automatically read the checks
you deposit so the funds are available sooner without the need for human intervention. 
Research on OCR for real-world photos has made it possible for visually impaired individuals to read the text of street and building signs
with the assistance of only a small camera. The camera captures an image of the sign, runs OCR on that image, and then uses text-tospeech synthesis to read the contents of the sign. In this assignment, we will give you sets of images corresponding to handwritten characters in a word. Your task is to build a graphical model to
recognize the character in each image as accurately as possible. This assignment is based on an assignment developed by Professor Andrew
McCallum, Sameer Singh, and Michael Wick, from the University of Massachusetts, Amherst.
Page 19
This week, you will construct a Markov network with a variety of different factors to gain familiarity with undirected
graphical models and conditional random fields (CRFs) in particular. We provide code that will perform inference in the
network you construct, so you can find the best character assignment for every image and evaluate the performance of
the network.
This didn’t involve all that much programming, but it required an understanding of the models and concepts.
Week 3 Checkpoint and Summary (submit by April 16, 2017)
As of the end of this week, we have spent a whole week on Markov nets, and these are quite useful. The programming
exercise made more and more of the concepts clearer: for instance we began to considering only one letter at a time,
and move up to three letters at a time, including the relationship between possible letters. The score improved with each
revision. In a sense, the single-letter version was not Markovian at all, it just found the best match.
Week 4: Decision Theory and Decision Making
In this module, we discuss the task of decision making under uncertainty. We describe the framework of decision
theory, including some aspects of utility functions. We then talk about how decision making scenarios can be encoded
as a graphical model called an Influence Diagram, and how such models provide insight both into decision making and
the value of information gathering.
Maximum Expected Utility (25 mins)
This is another lecture that is marked in the courseware as being particularly critical.
We've show how probabilistic graphical models can be used for a variety of inference tasks like computing conditional
probabilities or finding the MAP assignment. But often the thing that you actually want to do with a probability
distribution is make decisions in the world. So, for example, if you're a doctor encountering a patient, it's not just
enough to figure out what disease the patient has. Ultimately you need to decide what treatment to give the patient. How
do we use a probability distribution and specifically a probabilistic graphical model in order to make the decisions? It
turns out that the theoretical foundations for doing this kind of inference were actually established long before
probabilistic graphical models came to the fore work and that framework is called the framework of maximum
expected utility. So lets formulate the problem that we are trying to solve and then we can define the principle of M E U
or maximum expected utility.
We organized the problem, including the utility function, then state the goal as being one of “maximize”. Also,
consider the actions. This will require iterative finding of the max, using gradients.
Consider an influence diagram. This is an extension of the Bayesian network (i.e., random variables), plus action
variables, and utility nodes. There is a visual convention used for each through the diagrams of the course and
homework.
Actions don’t have a CPD.
Utility functions can represent the value of an outcome. The utility function may be driven by a large number of
random variables and actions, so it is common to split into different components of the utility function. In the diagram
shown in the lecture, she has broken down the utility function into constituent pieces.
We also represent the decision roles at actions. These are effectively a CPD.
She never actually talks about how to find a maximum, but is very good at organizing the terms and formulas.
Page 20
Utility Functions (15 mins)
Utility functions are a combination of economics, math, and psychology. We ascribe a numerical utility to different
outcomes. For instance we can resolve the difficult problem of different payoffs and different probabilities. The
“lottery” analysis is used. Utilities are not linear in the amount of payoff (for most people). This discussion was similar
to that held in business school ‘decisions and data’ class.
Discussion about St. Petersburg paradox.
Discussion of certainly equivalents.
Multi-attribute utilities.
Example: pre-natal diagnosis.
Summary:
Page 21
Value of Perfect Information (17 mins)
But often we want to answer a different type of question, which is what observations should I even make before making
a decision? For example, a doctor encountering a particular patient might have to decide which set of tests to perform
on that patient. Tests are not free, they cause pain to the patient, they come with a risk, and they cost money. So which
ones are worthwhile and which ones are not? The same kind of question comes up in many other scenarios. So for
example, if you're running a sensor network, which sensors should I measure? The sensor might require energy in order
to transmit the information and that may be something that we want to consider carefully.
Example:
The basic approach here was to construct a network, evaluate its optimized MEU, then construct variations that had
various test conditions plugged in, and evaluate the resulting MEU. Subtract the resulting from the baseline to get the
value of the new information.
Summary:
Quiz (4 questions)
This was not too hard.
Week 4 Honors Programming Assignment
This is about decision theory. The assignment is rated at 20 hours, followed by a quiz with 9 questions. The
background reads:
ARVD is a heart disease that is characterized by progressive replacement of cardiac muscle by adipose (fat) cells and/or
fibrosis (scar tissue), starting in the right ventricle. ARVD patients can have potentially deadly arrhythmias (irregularities in
the heart rate). There is a large heritable component to the disease - up to half of cases are linked to a family history of the
disease. It is a progressive disease, and people with ARVD can remain asymptomatic for a long time without suspecting
Page 22
that they have this condition. Currently, most patients with ARVD have an implantable cardioverter defibrillator (ICD)
surgically implanted. These devices are quite effective in reducing arrhythmias.
However, ARVD can be difficult to diagnose with certainty. While having known genetic risk factors confer susceptibility
to developing ARVD, it is far from certain. Furthermore, diagnostic tests that directly measure some characteristic of heart
function are not perfect, with some false positives and false negatives occurring. Different diagnostic tests also have
different risks - getting some electrodes put on your chest during a stress test has different risks than surgery. How can you
make sense of all this? In this assignment, we will bring the machinery of graphical models and decision theory to bear on
these sorts of problems.
As with week 3, this didn’t involve all that much programming, but it required an understanding of the models and
concepts. About 2/3 of the needed code was supplied, the rest were functions that you had to fill in.
Then there was a related 9-question quiz. These questions involved the results of evaluation from various networks that
you construct or model. The most applicable ones were questions where you were to calculate the value of perfect
information by starting with a baseline network (getting its MEU), then adding a link between a test result and a
decision (get the MEU of the optimized network, and subtract the baseline MEU).
Week 4 Checkpoint and Summary (submit by April 23, 2017)
The discussion about influence diagrams was completely on-target. The utility functions lecture was quite good, though
I had seen most of it before in business school classes.
The real knowledge was covered in the programming assignment. These were some of the most computationally
complex tasks, and had to also be numerically correct, as they were used for questions in the related quiz.
Week 5: Knowledge Engineering and Summary
This module provides an overview of graphical model representations and some of the real-world considerations when
modeling a scenario as a graphical model. It also includes the course final exam.
Knowledge Engineering and Conclusion (23 mins)
Let's try and take a big step back, and figure out how you might put these all together if you actually wanted to build a
graphical model for some application that you care about. Now, let me start by saying that this is really not a
science. Just like any other design, it's much closer to an art, or even a black magic, than a scientific endeavor. And, so
the only thing that one can do here is to provide hints about how one might go about doing this.
So let's first identify some important distinctions, and then we'll get concrete about particular examples. There's at
least, three main classes of design choices that one needs to make. The first is whether you have a template-based
model versus a very specific model for concrete fixed set of random variables. This is true whether the model is directed
or undirected. And whether it's generative, versus discriminative. These are all terms that we've seen before, and we'll
talk about them in just a moment.
But, before we go into the sort of, trade offs between each of these, let me emphasize this last point, which is probably
the most critical thing to remember. It's often not the case that you just go in one direction or the other. That is, in many
models you're going to have for example, template based pieces, as well as some stuff that, isn't at the template level.
You might have directed as well as undirected components, and so on. So these are not, a sharp boundary, and it's useful
to keep that in mind.
Example of template-based: image segmentation
Examples of specific: medical diagnosis.
Fault diagnosis is both. Based on where you are in this spectrum, it changes the way that you approach knowledge
engineering.
Page 23
So it's really about feature engineering. As opposed to, sort of, complicated model design, not always, I mean certainly
not entirely but the features turn out to play a very big role. On the specific model side you have, usually a large number
because unless you build small models, each variables going to be unique, so a large number, of unique variables.
Another distinction is discriminative models vs. generative models. The former are particular prediction tasks
(classification?). The latter are built on richly-described features (examples of these are for medical diagnosis). These
are good for cases where I want to measure one thing and predict something else. Generative models are easier.
When that prediction task is often better solved by having richly expressive features, richly discriminative features,
then modeling this as a discriminative model allows me to avoid dealing with correlations between them. So that gives
usually a high performance.
Variable types include Target and Observed, plus Latent. Example of Latent was GMT. These can simplify the model.
Next topic is Structure. You can use causal or non-causal ordering. You can reverse the ordering if that simplifies the
model.
Parameter values:
Watch out for zeros in the CPD’s since they end up excluding (or zeroing) a whole set of related probabilities.
Relative values make a larger difference than absolute values.
In most cases one would use structured CPD’s of the forms we’ve discussed plus other forms. Split them on two axes
(discrete vs. continuous), and context-specific vs. aggregating.
Finally, as with other software projects, use iterative refinement: Model testing, adding features, adding dependencies,
and refining values. Have a process to determine errors in the model, and correct.
Final Exam (8 questions)
Get ready, as this exam could cover material from any part of the course.
Week 5 Checkpoint and Summary
There is no programming assignment here. But the exam had some difficult questions. Finally got a perfect score.
Page 24