Download ppt - University of Illinois Urbana

Essential CS & Statistics (Lecture for CS498-CXZ Algorithms in Bioinformatics) Aug. 30, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign Essential CS Concepts • • • Programming languages: Languages that we use to communicate with a computer – – – – Machine language (010101110111…) Assembly language (move a, b; add c, b; …) High-level language (x= a+2*b… ), e.g., C++, Perl, Java Different languages are designed for different applications System software: Software “assistants” to help a computer – Understand high-level programming languages (compilers) – Manage all kinds of devices (operating systems) – Communicate with users (GUI or command line) Application software: Software for various kinds of applications – Standing alone (running on a local computer, e.g., Excel, Word) – Client-server applications (running on a network, e.g., web browser) Intelligence/Capacity of a Computer • The intelligence of a computer is determined by the intelligence of the software it can run • Capacities of a computer for running software are mainly determined by its – Speed – Memory – Disk space • Given a particular computer, we would like to write software that is highly intelligent, that can run fast, and that doesn’t need much memory (contradictory goals) Algorithms vs. Software • Algorithm is a procedure for solving a problem – Input: description of a problem – Output: solution(s) – Step 1: We first do this – Step 2: …. – …. – Step n: here’s the solution! • Software implements an algorithm (with a particular programming language) Example: Change Problem • Input: – M (total amount of money) – C1 > c2 > … >Cd (denominations) • Output – i1 , i2 , … ,id (number of coins of each kind), such that i1*C1 + i2 *C2 + … + id *Cd =M and i1+ i2 + … + id is as small as possible Algorithm Example: BetterChange c  (c1 ,..., cd ) BetterChange(M,c,d) 1 r=M; 2 for k=1 to d { 1 ik=r/ck 3 r=r-r-ik*ck 4 } 5 Return (i1, i2, …, id) Input variables Take only the integer part (floor) Output variables Properties of an algorithms: - Correct vs. Incorrect algorithms (Is BetterChange correct?) - Fast vs. Slow algorithms (How do we quantify it?) Big-O Notation • How can we compare the running time of two algorithms in a computer-independent way? • Observations: – In general, as the problem size grows, the running time increases (sorting 500 numbers would take more time than sorting 5 elements) – Running time is more critical for large problem size (think about sorting 5 numbers vs. sorting 50000 numbers) • How about measuring the growth rate of running time? Big-O Notation (cont.) • • • • • • • • Define problem size (e.g., the lengths of a sequence, n) Define “basic steps” (e.g., addition, division,…) Express the running time as a function of the problem size ( e.g., 3*n*log(n) +n) As the problem size approaches the positive infinity, only the highest-order term “counts” Big-O indicates the highest-order term E.g., the algorithm has O(n*log(n)) time complexity Polynomial (O(n2)) vs. exponential (O(2n)) NP-complete Basic Probability & Statistics Purpose of Prob. & Statistics • Deductive vs. Plausible reasoning • Incomplete knowledge -> uncertainty • How do we quantify inference under uncertainty? – Probability: models of random process/experiments (how data are generated) – Statistics: draw conclusions on the whole population based on samples (inference on data) Basic Concepts in Probability • Sample space: all possible outcomes, e.g., – Tossing 2 coins, S ={HH, HT, TH, TT} • Event: ES, E happens iff outcome in E, e.g., – E={HH} (all heads) – E={HH,TT} (same face) • Probability of Event : 1P(E) 0, s.t. – P(S)=1 (outcome always in S) – P(A B)=P(A)+P(B) if (AB)= Basic Concepts of Prob. (cont.) • Conditional Probability :P(B|A)=P(AB)/P(A) – P(AB) = P(A)P(B|A) =P(B)P(B|A) – So, P(A|B)=P(B|A)P(A)/P(B) – For independent events, P(AB) = P(A)P(B), so P(A|B)=P(A) • Total probability: If A1, …, An form a partition of S, then – P(B)= P(BS)=P(BA1)+…+P(B An) – So, P(Ai|B)=P(B|Ai)P(Ai)/P(B) (Bayes’ Rule) Interpretation of Bayes’ Rule Hypothesis space: H={H1 , …, Hn} P( H i | E )  Evidence: E P( E | H i )P( H i ) P( E ) If we want to pick the most likely hypothesis H*, we can drop p(E) Posterior probability of Hi Prior probability of Hi P ( H i | E )  P ( E | H i ) P( H i ) Likelihood of data/evidence if Hi is true Random Variable • X: S   (“measure” of outcome) • Events can be defined according to X – E(X=a) = {si|X(si)=a} – E(Xa) = {si|X(si)  a} • So, probabilities can be defined on X – P(X=a) = P(E(X=a)) – P(aX) = P(E(aX)) (f(a)=P(a>x): cumulative dist. func) • Discrete vs. continuous random variable (think of “partitioning the sample space”) An Example • • • • • Think of a DNA sequence as results of tossing a 4-face die many times independently P(AATGC)=p(A)p(A)p(T)p(G)p(C) A model specifies {p(A),p(C), p(G),p(T)}, e.g., all 0.25 (random model M0) P(AATGC|M0) = 0.25*0.25*0.25*0.25*0.25 Comparing 2 models – M1: coding regions – M2: non-coding regions – Decide if AATGC is more likely a coding region Probability Distributions • Binomial: Times of successes out of N trials N k p(k | N )    p (1  p) N k k  • Gaussian: Sum of N independent R.V.’s  1 f ( x)  e 2 ( x   )2 2 2 • Multinomial: Getting ni occurrences of outcome i N  k ni p(n1 ,..., nk | N )     pi  n1 ... nk  i 1 Parameter Estimation • General setting: – Given a (hypothesized & probabilistic) model that governs the random experiment – The model gives a probability of any data p(D|) that depends on the parameter  – Now, given actual sample data X={x1,…,xn}, what can we say about the value of ? • Intuitively, take your best guess of  -- “best” means “best explaining/fitting the data” • Generally an optimization problem Maximum Likelihood Estimator Data: a sequence d with counts c(w1), …, c(wN), and length |d| Model: multinomial M with parameters {p(wi)} Likelihood: p(d|M) Maximum likelihood estimator: M=argmax M p(d|M) N |d |   N c ( wi ) p(d | M )     i c ( wi )   i i 1  c( w1 )...c( wN )  i 1 where,i  p ( wi ) N l (d | M )  log p(d | M )   c( wi ) log i i 1 N N i 1 i 1 l (d | M )   c( wi ) log i   ( i  1) ' l ' c( wi )   0 i i N Since  i 1  i   N i Use Lagrange multiplier approach c( wi ) Set partial derivatives to zero   1,    c( wi )   | d | i 1 We’ll tune p(wi) to maximize l(d|M) So, i  p( wi )  c( wi ) |d | ML estimate Maximum Likelihood vs. Bayesian • Maximum likelihood estimation – “Best” means “data likelihood reaches maximum” ˆ  arg max P ( X |  )  – Problem: small sample • Bayesian estimation – “Best” means being consistent with our “prior” knowledge and explaining data well ˆ  arg max P( | X )  arg max P( X |  ) P( )   – Problem: how to define prior? Bayesian Estimator • ML estimator: M=argmax M p(d|M) • Bayesian estimator: – First consider posterior: p(M|d) =p(d|M)p(M)/p(d) – Then, consider the mean or mode of the posterior dist. • p(d|M) : Sampling distribution (of data) • P(M)=p(1 ,…, N) : our prior on the model parameters • conjugate = prior can be interpreted as “extra”/“pseudo” data • Dirichlet distribution is a conjugate prior for multinomial sampling distribution ( 1     N ) N  i 1 Dir( |  1 ,  ,  N )  i ( 1 )  ( N ) i 1 “extra”/“pseudo” counts e.g., i= p(wi|REF) Dirichlet Prior Smoothing (cont.) Posterior distribution of parameters: p( | d )  Dir( | c( w1 )  1 , , c( wN )   N ) Property : If  ~ Dir( |  ), then E( )  { i i  } The predictive distribution is the same as the mean: p(w i | ˆ )   p(w i |  ) Dir( |  )d c( w i )   i c( w i )  p( w i | REF )   N | d |  | d |  i i 1 Bayesian estimate (|d| ?) Illustration of Bayesian Estimation Posterior: p(|D) p(D|)p() Likelihood: p(D|) D=(c1,…,cN) Prior: p()  : prior mode : posterior mode ml: ML estimate Basic Concepts in Information Theory • Entropy: Measuring uncertainty of a random variable • Kullback-Leibler divergence: comparing two distributions • Mutual Information: measuring the correlation of two random variables Entropy Entropy H(X) measures the average uncertainty of random variable X H ( X )  H ( p)    p( x) log p( x)   all possible values x Define 0 log 0  0, log  log 2 Example: fair coin p( H )  0.5 1  H ( X )  between 0 and 1 biased coin p( H )  0.8 0 completely biased p( H )  1  Properties: H(X)>=0; Min=0; Max=log M; M is the total number of values Interpretations of H(X) • Measures the “amount of information” in X – Think of each value of X as a “message” – Think of X as a random experiment (20 questions) • Minimum average number of bits to compress values of X – The more random X is, the harder to compress A fair coin has the maximum information, and is hardest to compress A biased coin has some information, and can be compressed to <1 bit on average A completely biased coin has no information, and needs only 0 bit " Information of x "  "# bits to code x "   log p( x) H ( X )  E p [ log p( x)] Cross Entropy H(p,q) What if we encode X with a code optimized for a wrong distribution q? Expected # of bits=? H ( p, q)  E p [ log q( x)]    p( x) log q( x) x Intuitively, H(p,q)  H(p), and mathematically, q ( x) ] p ( x ) x q ( x)   log  [ p( x) ] 0 p ( x ) x H ( p, q)  H ( p)   p( x)[ log By Jensen ' s inequality :  p f ( x )  f ( p x ) i i i i i i where, f is a convex function, and p i i 1 Kullback-Leibler Divergence D(p||q) What if we encode X with a code optimized for a wrong distribution q? How many bits would we waste? D( p || q)  H ( p, q)  H ( p)   p( x) log x Properties: - D(p||q)0 - D(p||q)D(q||p) - D(p||q)=0 iff p=q p ( x) q ( x) Relative entropy KL-divergence is often used to measure the distance between two distributions Interpretation: -Fix p, D(p||q) and H(p,q) vary in the same way -If p is an empirical distribution, minimize D(p||q) or H(p,q) is equivalent to maximizing likelihood Cross Entropy, KL-Div, and Likelihood Data / Sample for X : Y  ( y1 ,..., yN ) 1 if x  y 1 N Empirical distribution : p( x)    ( yi , x)  ( y, x)   N i 1 0 o.w. N Likelihood: L(Y )   p ( X  yi ) i 1 N log Likelihood: log L(Y )   log p( X  yi )   c( x) log p ( X  x)  N  p ( x) log p ( x) i 1 x x 1 log L(Y )   H ( p, p)   D( p || p)  H ( p) N 1 Fix the data, arg max p log L(Y )  arg min p H ( p, p )  arg min p D( p, p) N Criterion for estimating a good model Mutual Information I(X;Y) Comparing two distributions: p(x,y) vs p(x)p(y) I ( X ; Y )   p ( x, y ) log x, y p ( x, y )  H ( X )  H ( X | Y )  H (Y )  H (Y | X ) p( x) p( y ) Conditional Entropy: H(Y|X) H (Y | X )  E p ( x , y ) [ log p( y | x)]   p( x, y ) log p( y | x)   p( x)  p( y | x) log p( y | x) x, y x y Properties: I(X;Y)0; I(X;Y)=I(Y;X); I(X;Y)=0 iff X & Y are independent Interpretations: - Measures how much reduction in uncertainty of X given info. about Y - Measures correlation between X and Y What You Should Know • • Computational complexity, big-O notation Probability concepts: – sample space, event, random variable, conditional prob. multinomial distribution, etc • • • Bayes formula and its interpretation Statistics: Know how to compute maximum likelihood estimate Information theory concepts: – entropy, cross entropy, relative entropy, conditional entropy, KL-div., mutual information, and their relationship

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ppt - University of Illinois Urbana