* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download S7.Hidden Markov Models-Homework
Gene expression programming wikipedia , lookup
Public health genomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Genomic imprinting wikipedia , lookup
Transposable element wikipedia , lookup
Microevolution wikipedia , lookup
Human genome wikipedia , lookup
Ridge (biology) wikipedia , lookup
Pathogenomics wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Designer baby wikipedia , lookup
Genome (book) wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genome editing wikipedia , lookup
Gene expression profiling wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Helitron (biology) wikipedia , lookup
Minimal genome wikipedia , lookup
Weisstein et al. A Hands-on Introduction to Hidden Markov Models HMM Homework 1. Draw a state diagram for an HMM that will scan an un-annotated genome for genes with a start codon, a stop codon, exons, introns, a 5’ splice site and a 3’ splice site. a. State Diagram: Answer: b. What are the nucleotide types of this HMM? Answer: {Start, Exon, 5’SS, Intron, 3’SS, Stop} 2. Consider the following simple state machine: 1 Weisstein et al. A Hands-on Introduction to Hidden Markov Models a. Using the state machine above, calculate the probability of each of the following state paths: Path1: Path2: Path3: Start Start Start A C G C G E E E E E 5 E 5 I 5 I I I I I End End End For each state path, multiply the emission and transition probabilities together to get the probability of each individual state path. P1 = (1.0 × 0.92 ) × 0.1 × 1.0 × 0.1 × (0.253 ) × 0.95 × 0.4 P2 = 1.0 × 0.9 × 0.1 × 1.0 × 0.9 × 0.1 × 0.252 × 0 × 0.1 × 0.4 P3 = 1.0 × 0.1 × 1.0 × 0.9 × 0.9 × 0.1 × 0.25 × 0.95 × 0.4 × 0.1 × 0.4 P1 = 0.07695 P2 = 0 P3 = 0.0003078 b. Specifically, which state path is the most likely to annotate the sequence correctly? State path 1 is the most likely because it has the highest probability. c. Note that each state path has the 5’ splice site at a different position in the sequence. Which 5’ splice site is likely to be the correct one? 0.07695 L(P1) = 0.07695+0+0.0003078 = 0.996 0 L(P2) = 0.07695+0+0.0003078 = 0 0.0003078 L(P3) = 0.07695+0+0.0003078 = 0.003 Path 1 is likely to contain the correct annotation for the 5’ splice site because it has the highest likelihood. d. Given what you know about splice sites in eukaryotic genes, does anything in this model seem problematic? 2 Weisstein et al. A Hands-on Introduction to Hidden Markov Models It assumes that eukaryotic genes have only one splice site. Genes in eukaryotes have two splice sites per intron. It also assumes that splice sites only consist of one base. Splice sites in eukaryotes typically consist of two bases (GT, AG). 3. HMMs assume that each base is an independent observation. For instance, they assume that the probability of the ith base being in an intron does not affect whether the (i+1)th base is in an intron. Give an example of a genomic feature where this assumption does not hold. Explain your reasoning. RNA secondary structure cannot be predicted by HMMs. An HMM will assume that a given base in the genome can pair with any complementary base in the genome, without respect to distance. In reality, a given base in an RNA sequence will be able to pair with any other base along the entire RNA sequence (Students might only answer this correctly if they have read the Eddy paper). 4. For the next exercise, we will use the excel spreadsheet provided by Anton Weisstein to explore properties of Hidden Markov Models. Please read the instructions for using this spreadsheet called “HMM Manual.pdf.” After reading it, open the excel file HiddenMarkovModel.intron.xls. Before changing any of the sliders on the spreadsheet, save a copy of the file under a different name. Then, answer the following questions: a. Hold the transition probabilities from exon to splice site constant at 0.4, 0.6, 0.0, 0.0 (first slider), change the transition probabilities from intron to splice site by sliding the second slider. How does this affect the likelihood profile of the 5’ splice site position? Decreasing the probability of a transition from an intron to another intron decreases the likelihood of the first splice site, and increases the likelihood of the second splice site. Increasing the probability of a transition from an intron to another intron increases the likelihood of the first 5’ splice site and decreases the likelihood of the second 5’ splice site. b. Holding the transition probabilities from intron to splice site constant at 0.0, 0.0, 0.9 and 0.1 (second slider), change the transition probabilities from exon to splice site by sliding the first slider. How does this affect the likelihood profile of the 5’ splice site position? 3 Weisstein et al. A Hands-on Introduction to Hidden Markov Models Decreasing the probability of a transition from an exon to a splice site increases the likelihood of the first 5’ splice site. Increasing the probability of a transition from an exon to a splice site decreases the likelihood of the first 5’ splice site and increases the likelihood of the second 5’ splice site. c. If your HMM predicted that both splice sites were equally likely, how could you use RNA-Seq data to reconcile the discrepancy between splice sites? Since RNA-Seq allows one to map back cDNA to the genome, gaps between exons would be represented by lower peaks between annotated exons. The splice junction supported by both RNA-Seq data and is than the prediction itself because there is RNA-seq data to support it (This is supposed to get the students to realize that sometimes external information is necessary to make sense of gene predictions). 5. Consider the notion that transition probabilities in an evidence-based gene prediction algorithm can be calculated based on the length of a genetic feature (exons, introns, etc). That is, transition probabilities are a function of average gene length. If an HMM were trained using the genome of an organism with many short genes and few long genes, would you expect this HMM to predict more long genes or more short genes on an un-annotated genome? Why? We would expect it to predict mostly short genes because the transition probabilities reflect gene length. If most of the genes are short, then this will skew the transition probabilities such that long genes will never be emitted by the HMM. Remember, evidencebased gene prediction methods do not “know” anything about the genome they are annotating. They can only make judgments about states using information gleaned from the training data (in this case, the transition probabilities). 4