Download S7.Hidden Markov Models-Homework

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genomics wikipedia , lookup

Gene expression programming wikipedia , lookup

Public health genomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Genomic imprinting wikipedia , lookup

Transposable element wikipedia , lookup

Microevolution wikipedia , lookup

Human genome wikipedia , lookup

Ridge (biology) wikipedia , lookup

Pathogenomics wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Designer baby wikipedia , lookup

Gene wikipedia , lookup

Genome (book) wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genome editing wikipedia , lookup

Gene expression profiling wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Helitron (biology) wikipedia , lookup

Minimal genome wikipedia , lookup

Genome evolution wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Weisstein et al. A Hands-on Introduction to Hidden Markov Models
HMM Homework
1. Draw a state diagram for an HMM that will scan an un-annotated genome for
genes with a start codon, a stop codon, exons, introns, a 5’ splice site and a 3’
splice site.
a. State Diagram:
Answer:
b. What are the nucleotide types of this HMM?
Answer: {Start, Exon, 5’SS, Intron, 3’SS, Stop}
2. Consider the following simple state machine:
1
Weisstein et al. A Hands-on Introduction to Hidden Markov Models
a. Using the state machine above, calculate the probability of each of the
following state paths:
Path1:
Path2:
Path3:
Start
Start
Start
A
C
G
C G
E
E
E
E
E
5
E
5
I
5
I
I
I
I
I
End
End
End
For each state path, multiply the emission and transition
probabilities together to get the probability of each individual state
path.
P1 = (1.0 × 0.92 ) × 0.1 × 1.0 × 0.1 × (0.253 ) × 0.95 × 0.4
P2 = 1.0 × 0.9 × 0.1 × 1.0 × 0.9 × 0.1 × 0.252 × 0 × 0.1 × 0.4
P3 = 1.0 × 0.1 × 1.0 × 0.9 × 0.9 × 0.1 × 0.25 × 0.95 × 0.4 × 0.1 × 0.4
P1 = 0.07695
P2 = 0
P3 = 0.0003078
b. Specifically, which state path is the most likely to annotate the sequence
correctly?
State path 1 is the most likely because it has the highest
probability.
c. Note that each state path has the 5’ splice site at a different position in the
sequence. Which 5’ splice site is likely to be the correct one?
0.07695
L(P1) = 0.07695+0+0.0003078 = 0.996
0
L(P2) = 0.07695+0+0.0003078 = 0
0.0003078
L(P3) = 0.07695+0+0.0003078 = 0.003
Path 1 is likely to contain the correct annotation for the 5’ splice site
because it has the highest likelihood.
d. Given what you know about splice sites in eukaryotic genes, does
anything in this model seem problematic?
2
Weisstein et al. A Hands-on Introduction to Hidden Markov Models
It assumes that eukaryotic genes have only one splice site. Genes
in eukaryotes have two splice sites per intron. It also assumes that
splice sites only consist of one base. Splice sites in eukaryotes
typically consist of two bases (GT, AG).
3. HMMs assume that each base is an independent observation. For instance, they
assume that the probability of the ith base being in an intron does not affect
whether the (i+1)th base is in an intron. Give an example of a genomic feature
where this assumption does not hold. Explain your reasoning.
RNA secondary structure cannot be predicted by HMMs. An HMM
will assume that a given base in the genome can pair with any
complementary base in the genome, without respect to distance. In
reality, a given base in an RNA sequence will be able to pair with
any other base along the entire RNA sequence (Students might
only answer this correctly if they have read the Eddy paper).
4. For the next exercise, we will use the excel spreadsheet provided by Anton
Weisstein to explore properties of Hidden Markov Models. Please read the
instructions for using this spreadsheet called “HMM Manual.pdf.” After reading it,
open the excel file HiddenMarkovModel.intron.xls. Before changing any of the
sliders on the spreadsheet, save a copy of the file under a different name. Then,
answer the following questions:
a. Hold the transition probabilities from exon to splice site constant at 0.4,
0.6, 0.0, 0.0 (first slider), change the transition probabilities from intron to
splice site by sliding the second slider. How does this affect the likelihood
profile of the 5’ splice site position?
Decreasing the probability of a transition from an intron to another
intron decreases the likelihood of the first splice site, and increases
the likelihood of the second splice site. Increasing the probability of
a transition from an intron to another intron increases the likelihood
of the first 5’ splice site and decreases the likelihood of the second
5’ splice site.
b. Holding the transition probabilities from intron to splice site constant at 0.0,
0.0, 0.9 and 0.1 (second slider), change the transition probabilities from
exon to splice site by sliding the first slider. How does this affect the
likelihood profile of the 5’ splice site position?
3
Weisstein et al. A Hands-on Introduction to Hidden Markov Models
Decreasing the probability of a transition from an exon to a splice
site increases the likelihood of the first 5’ splice site. Increasing the
probability of a transition from an exon to a splice site decreases
the likelihood of the first 5’ splice site and increases the likelihood of
the second 5’ splice site.
c. If your HMM predicted that both splice sites were equally likely, how could
you use RNA-Seq data to reconcile the discrepancy between splice sites?
Since RNA-Seq allows one to map back cDNA to the genome,
gaps between exons would be represented by lower peaks
between annotated exons. The splice junction supported by both
RNA-Seq data and is than the prediction itself because there is
RNA-seq data to support it (This is supposed to get the students to
realize that sometimes external information is necessary to make
sense of gene predictions).
5. Consider the notion that transition probabilities in an evidence-based gene
prediction algorithm can be calculated based on the length of a genetic feature
(exons, introns, etc). That is, transition probabilities are a function of average
gene length. If an HMM were trained using the genome of an organism with
many short genes and few long genes, would you expect this HMM to predict
more long genes or more short genes on an un-annotated genome? Why?
We would expect it to predict mostly short genes because the
transition probabilities reflect gene length. If most of the genes are
short, then this will skew the transition probabilities such that long
genes will never be emitted by the HMM. Remember, evidencebased gene prediction methods do not “know” anything about the
genome they are annotating. They can only make judgments about
states using information gleaned from the training data (in this
case, the transition probabilities).
4