Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
APAM 1080 Extended Syllabus Spring 2015 Professor Charles Lawrence Office hours: Tuesday: 3:30-4:30 & Wednesday 11AM to noon, also posted on AM web site. I. Introduction (week 1) a. Assignment: Review probability and math statistics from AM165 b. The genome and DNA i. Sequencing of Genome as 1st Killer application 1. A new research paradigm in biology 2. The hypothesis driven paradigm 3. The genome as the 800 pound gorilla ii. Fundamentals of DNA structure and function 1. The genome is not enough a. Variables of greatest interest still unknown b. Inference of unknown variables 2. The central dogma 3. Fundamental of DNA structure 4. Heterogeneous organization of the genome 5. Repeated sequence in genomes c. Fundamental concepts of Inference i. Definition of Inference ii. Inference a daily activity iii. Inference of unobserved variables in inherently uncertain 1. Probability the language to describe uncertainty 2. A coins example Quiz 5% II. Basics of probability and statistics a. Probabilistic models (week 2: May vary depending on quiz results) i. Review of fundamental of probability theory 1. Probabilities of events a. Sample space of events b. Mutually exclusive events c. Conditional probability and independence d. Two laws of probability e. Total probability and Bayes rule 2. Random variables (RV) a. Functional assignment real numbers to events b. Multivariate distribution of RVs i. Joint, conditional 1 ii. Marginalization 3. Expected values ii. Review: Some basic probabilistic models 1. Binomial/multinomial 2. Beta/Dirichlet 3. Poisson 4. Gamma a. Chi-square b. Inverse chi-square 5. Normal a. Multivariate normal b. Statistical inference i. Review Fundaments of Statistical Inference 1. Definitions 2. Probability theory to deal with the inherent uncertainty in inference a. The likelihood b. Inference as reverse engineering i. Two major paradigms 1. The sampling distribution and frequentist concepts of behavior on repeated samples 2. Bayesian concept of probability distribution of unknowns for given data ii. Review: Frequentist estimation (Week 3) 1. Example for estimating proportion with blue eyes 2. Data: A random sample from a population a. One of many possible samples 3. Estimator as a procedure to find point estimates a. Statistics as function of the data b. Distribution of a statistic i. Statistic: a function of the data ii. Given an independent random sample of N students derive the probability that K = k of them will have blue eyes if the proportion at brown is p iii. Which of the numbers in you formula is an RV? iv. What is you estimate of p, p^hat 1. Is p^hat the same as p? 2. Is p^hat an RV? 3. How would p^hat behave on repeated samples iii. Review: Maximum likelihood estimation (MLE) 1. Maximum likelihood principle 2 a. Define a maximum likelihood estimate i. In your own words ii. Mathematically for binomial use for blue eyes iii. General of any likelihood iv. Show that the MLE of estimate of p in binomial is p^hat 1. Hint take logs 2. Characteristics of estimators a. Unbiased i. Minimum variance unbiased estimators b. Efficiency c. Consistency d. Minimum squared error loss estimators i. Squared bias and variance ii. Bias/variance tradeoff 3. Properties of MLEs a. Consistent, asymptotically unbiased and normally distributed i. Asymptotics depend on sample growing large compared to number of unknowns. iv. Bayesian Inference (Weeks 4-5) 1. Likelihood*prior P(K=k|p,n)f(p|alpha,beta) = f(K,p|n,alpha,beta) a. For binomial b. What are the RVs 2. Bayes rule and posterior a. What are we trying to infer? b. Binomial, likelihood c. Beta prior i. Conjugate priors general case 3. Derive posterior distribution for p a. Bayesian inference for binomial emissions i. Binomial likelihood ii. Beta prior iii. Derive posterior b. Hierarchal model i. Capturing similarities and differences in multiple coins ii. Validation example 4. MCMC algorithms a. Gibbs sampling b. Metropolis Hastings algorithm In class exam 20% 3 III. Modeling Sequences (weeks 6-8) a. Probabilistic models for sequences i. Markov models 1. DNA composition example 2. Markov Chain a. Conditional independence b. Recursion ii. Hidden Markov models (HMM) 1. Heterogeneity in DNA composition example a. Yeast promoter example b. Two dice example 2. Generative hidden Markov models a. Hidden State model i. Markov transition model 1. Geometric length assumption ii. Emission models 1. Categorical 2. Discrete 3. Continuous b. Inference with Hidden Variables and HMMs i. HMM algorithms 1. Derive: Forward algorithm 2. Derive: Back sampling algorithm 3. Backward algorithm ii. Other examples of hidden variables 1. Alignment a. Indices is each sequence that correspond 2. RNA secondary structure a. Indices of bases that from base pairs iii. Estimation of unknown parameters 1. States known a. MLEs b. Bayesian Inference 2. States unknown a. Gibbs sampling i. Sample states ii. Sample from parameter distributions b. EM algorithm for HMMs i. Expectations step ii. Maximization step iii. EM theory iv. Other Emission models 1. Normal emissions a. Geo-science proxy example 2. Poisson emissions 4 a. High throughput sequencing example Project Report 25% c. Change Point algorithm CP (Week 9) i. The challenge ii. Examples 1. Coins 2. Sequence composition 3. Paleoclimate problem iii. Identify: Conditional independence nature of CP iv. Number of change points and combinatorial prior v. Forward algorithm vi. Inference of number of states vii. Back sampling viii. Inference of change points d. Tree Models (Week 10) i. Phylogeny ii. Trees and conditional independence iii. Generative model iv. Upward algorithm v. Sampling algorithm vi. Gibbs sampling and EM e. The two sides of genomic inference (Week 11) i. Discrete high-D unknowns 1. Curse of dimensionality 2. Application of decision theory in genomics ii. Population parameters 1. Asymptotics come to bare 2. So much data frequentist model for repeat samples is attainable. a. Bootstrap and other resampling approaches Take home exam 15% (Combine with Review paper?) IV. Hypothesis testing (week 12-14) a. Basic concepts i. Frequentist error types 1. Type I and p-values 2. Type II and power ii. Bayesian evidence b. Multiple comparisons in high-d settings i. Family-wise p-values 5 ii. FDR 1. Expecting some false positives a. Controlling the proportion above the critical value 2. q-values 3. Storey & Tibshirani, PNAS, (2003) 100: 9440–9445 iii. Local fdr a Bayesian approach 1. Using high-D data to create an empirical density distribution of p-values 2. Robust assessment departures from uniform distribution of p-values a. Z-transform b. Estimating mu and sigma from the core distribution c. Local fdr and the Bayesian posterior 3. Genomics as an discovery science a. An observational science b. Untoward effects in observation science i. Confounding ii. Unseen correlation 4. Efron, JASA, (2004) 99:96-104 iv. Ridge Regression and LASSO 1. Regularizer a. Bayesian prior b. Penalty on high-D estimates 2. Optimization a. Max{Likelihood – regularizer (penalty)} 3. Tibshirani JRSS-B (1996) 58: 276-288 Review Paper 15% (Combine with take home exam?) Class participation 20% Course reading materials 1) 2) 3) 4) Biological Sequence Analysis, Durbin et al. Mathematical Statistics Wackerly et al. Bayesian Data Analysis, Gelman A, et al. (Brown Library e-Book) Papers by Storey & Tibshirani, Efron and Tibshirani Computer questions: Prof. Bill Thompson: [email protected] 6