Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Introduction to Probabilistic
Models for Computational Biology
Lectures 2 – Oct 3, 2011
CSE 527 Computational Biology, Fall 2011
Instructor: Su-In Lee
TA: Christopher Miles
Monday & Wednesday 12:00-1:20
Johnson Hall (JHN) 022
1
Review: Gene Regulation
a switch! (“transcription factor binding site”)
Gene regulation
DNA
AGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC
Gene
RNA
Protein
transcription
AUGUGGAUUGUU
MWIV
AUGCGCGUC
AUGCGCGUC
MRV
MRV
AUGUUACGCACCUAC
translation
RNA
degradation
MLRTY
AUGAUUGAU
AUGAUUAU
MID
“Gene Expression”
gene
Genes regulate each others’
expression and activity.
Genetic regulatory network
Review: Variations in the DNA
“Single nucleotide polymorphism (SNP)”
C
T
X
X
A
X
T
G
X
X
AGATATGTGGATTGTTAGGATTTATGCGCGTCAGTGACTACGCATGTTACGCACCTACGACTAGGTAATGATTGATC
RNA
Protein
C
X
AUGUGGAUUGUU
X
MWIV
C
X
X
AUGCGCGUC
U
X
AUGUUACGCACCUAC
T
X
MRV
AUGAUUGAU
X
MLRTY
MID
L
gene
Sequence variations perturb
the regulatory network.
Genetic regulatory network
Outline
Probabilistic models in biology
Model selection problems
Mathematical foundations
Bayesian networks
Probabilistic Graphical Models: Principles and
Techniques, Koller & Friedman, The MIT Press
Learning from data
Maximum likelihood estimation
Expectation and maximization
4
Example 1
How a change in a nucleotide in DNA, blood
pressure and heart disease are related?
There can be several “models”…
DNA
alteration
Blood
pressure
Heart
disease
DNA
alteration
Blood
pressure
Heart
disease
OR
Blood
pressure
DNA
alteration
Heart
disease
5
Example 2
How genes A, B and C regulate each other’s
expression levels (mRNA levels) ?
There can be several models…
A
B
A
C
B
A
OR
C
B
?
C
6
Model I
Model II
Model III
A
A
A
B
C
B
Exp 1 Exp 2
OR
C
…
Gene A
C
Exp N
N instances
Gene B
Probabilistic
graphical models
B
?
Gene
C
A
graphical
representation of statistical dependencies.
Statistical dependencies between expression
levels of genes A, B, C?
Probability that model x is true given the data
Model selection: argmaxx P(model x is true | Data)
7
Outline
Probabilistic models in biology
Model selection problem
Mathematical foundations
Bayesian networks
Learning from data
Maximum likelihood estimation
Expectation and maximization
8
Probability Theory Review
Assume random variables Val(A)={a1,a2,a3}, Val(B)={b1,b2}
Conditional probability
Definition
Chain rule
Bayes’ rule
Probabilistic independence
9
Probabilistic Representation
Joint distribution P over {x1,…, xn}
xi is binary
2n-1 entries
If x’s are independent
P(x) = p(x1) … p(xn)
10
Conditional Parameterization
The Diabetes example
Genetic risk (G), Diabetes (D)
Val (G) = {g1,g0}, Val (D) = {d1,d0}
P(G,D) = P(G) P(D|G)
P(G): Prior distribution
P(D|G): Conditional probabilistic
distribution (CPD)
Genetic risk
Diabetes
11
Naïve Bayes Model - Example
Elaborating the diabetes example,
Genetic Risk (G), Diabetes (D), Hypertension (H)
Val (G) = {g1,g0}, Val (D) = {d1,d0}, Val (H) = {h1,h0}
8 entries
If S and G are independent given I,
P(G,D,H) = P(G)P(D|G)P(H|G)
5 entries; more compact than joint
Genetic risk
Diabetes
Hypertension
12
Naïve Bayes Model
A class C where Val (C) = {c1,…,ck}.
Finding variables x1,…,xn
Naïve Bayes assumption
The findings are conditionally independent given the
individual’s class.
The model factorizes as:
The Diabetes example
class: Genetic risk, findings: Diabetes, Hypertension
13
Naïve Bayes Model - Example
Medical diagnosis system
Class C: disease
Findings X: symptoms
Computing the confidence:
Drawbacks
Strong assumptions
14
Bayesian Network
Directed acyclic graph (DAG)
Node: a random variable
Edge: direct influence of one node on another
The Diabetes example revisited
Genetic risk (G), Diabetes (D), Hypertension (H)
Val (G) = {g1,g0}, Val (D) = {d1,d0}, Val (H) = {h1,h0}
Genetic risk
Diabetes
Hypertension
15
Bayesian Network Semantics
A Bayesian network structure G is a directed acyclic graph
whose nodes represent random variables X1,…,Xn.
PaXi: parents of Xi in G
NonDescendantsXi: variables in G that are not descendants of Xi.
G encodes the following set of conditional independence
assumptions, called the local Markov assumptions, and
denoted by IL(G):
x2
For each variable Xi:
x1
x3
x4
x11
x3
x10
x7
x5
x6
x8
x9
16
The Genetics Example
Variables
B: blood type (a phenotype)
G: genotype of the gene that encodes a person’s blood
type; <A,A>, <A,B>, <A,O>, <B,B>, <B,O>, <O,O>
17
Bayesian Network Joint Distribution
Let G be a Bayesian network graph over the variables
X1,…,Xn. We say that a distribution P factorizes according
to G if P can be expressed as:
A Bayesian network is a pair (G,P) where P factorizes over
G, and where P is specified as a set of CPDs associated
with G’s nodes.
18
The Student Example
More complex scenario
Course difficulty (D), quality of the recommendation
letter (L), Intelligence (I), SAT (S), Grade (G)
Val(D) = {easy, hard}, Val(L) = {strong, weak},
Val(I) = {i1,i0}, Val (S) = {s1,s0}, Val (G) = {g1,g2,g3}
Joint distribution requires 47 entries
19
The Student Bayesian network
Joint distribution
P(I,D,G,S,L) =
from Koller & Friedman
20
Parameter Estimation
Assumptions
For example,
{i0,d1,g1,l0,s0}
Fixed network structure
Fully observed instances of the network variables: D={d[1],…,d[M]}
Maximum likelihood estimation (MLE)!
“Parameters” of the
Bayesian network
from Koller & Friedman
21
Outline
Probabilistic models in biology
Model selection problem
Mathematical foundations
Bayesian networks
Learning from data
Maximum likelihood estimation
Expectation and maximization
22
Acknowledgement
Profs Daphne Koller & Nir Friedman,
“Probabilistic Graphical Models”
23