Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics Sampled from Morris H. DeGroot & Mark J. Schervish, “Probability and Statistics”, 3rd Edition, Addison Wesley 1 Probability: Continuous distribution and variables • Continuous distributions – – – – – – – Random variables Probability density function Uniform, normal and exponential distributions Expectations and variance Law of large numbers Central limit theorem Probability density functions of more than one variable – Rejection and transformation methods for sampling distributions (section) 2 Statistics • Estimators: mean, standard deviation • Maximum likelihood • Confidence intervals 2 • χ statistics • Regression • Goodness of fit 3 Continuous random variables • A random variable X is a real value function defined on a sample space S. • X is a continuous random variable if a nonnegative function f, defined on the real line, exists such that an integral over the domain A is the probability that X takes a value in domain A. (A is, for example, the interval [a,b]) b Pr ( a < X < b ) = ∫ f ( x ) dx a 4 Probability density function • f is called probability density function (p.d.f.). Note that the unit of the pdf below are of 1/length, only after the multiplication with a length element we get probability • For every p.d.f. we have f ( x) ≥ 0 ∞ ∫ −∞ f ( x ) dx = 1. 5 Examples of p.d.fs • A car is driving in a circle at a constant speed. What is the probability that it will be found in the interval between 1 and 2 radians? • A computer is generating with equal probability density, random numbers between 0 and 1. What is the probability of obtaining 0.75? • Protein folds at a constant rate (the probability that a protein will fold at the time interval [t,t+dt] is a constant αdt). If we have at time zero N0 protein molecules, what is the probability that all protein molecules will fold after time t’ ? 6 Uniform distribution on an interval • Consider an experiment in which a point X is selected from an interval S = { x : a ≤ x ≤ b} in such a way that the probability of finding X at a given interval is proportional to the interval length (hence the p.d.f. is a constant). This distribution is called the uniform distribution We must have for this distribution∞ b ∫ f ( x ) dx =∫ f ( x ) dx = 1 −∞ a 7 Uniform distribution (continue) f(x) ⎧ 1 ⎪ f ( x) = ⎨b − a ⎪⎩ 0 ⎫ for a ≤ x ≤ b ⎪ ⎬ otherwise ⎭⎪ 1/(b-a) a b x 8 Examples of simple distributions • • X is a random variable distributed uniformly on a circle of radius a. Find f(x) Check that the following function satisfies the conditions to be a p.d.f. ⎧ 2 −1 3 ⎪ x f ( x) = ⎨ 3 ⎪⎩ 0 ⎫ for 0 < x < 1⎪ ⎬ otherwise ⎭⎪ 9 Exponential distribution f ( x ) = ⎡⎣ a exp ( − ax ) 0 < x < ∞ ⎤⎦ f(x) x 10 Normal distribution ⎡⎛ a ⎞1/ 2 2 f ( x ) = ⎢⎜ ⎟ exp − a ( x − x0 ) ⎢⎣⎝ π ⎠ ( 1 ) ⎤ −∞ < x < ∞ ⎥ ⎥⎦ a xo 11 Continuous distribution functions defined as F ( x ) = Pr( X ≤ x) for - ∞ < x < ∞ F(x) is a monotonic non decreasing function of x (can you show it?), that can be written in terms of its corresponding p.d.f. F ( x ) = Pr ( X ≤ x ) = x ∫ f ( x ) dx −∞ or dF = f ( x) dx 12 Distribution function: Example ⎡ a exp ( − ax ) for 0 < x < ∞⎤ f ( x) = ⎢ ⎥ 0 otherwise ⎦ ⎣ F ( x) = x x ∫ f ( x ) dx = ∫ a exp ( −ax ) dx = ⎡⎣− exp ( −ax )⎤⎦ −∞ x 0 = 1 − exp ( − ax ) 0 13 Expectation • For a random variable X with a p.d.f. f(x) the expectation E(X) is defined E(X ) = ∞ ∫ x ⋅ f ( x ) dx −∞ The expectation exists if and only if the integral is absolutely converged, i.e. ∞ ∫ x f ( x )dx < ∞ −∞ 14 Expectation (example) ⎧2 x 0 < x < 1 ⎫ f ( x) = ⎨ ⎬ ⎩ 0 otherwise ⎭ ∞ 1 ⎡x ⎤ 2 E ( X ) = ∫ x ⋅ 2 x ⋅ dx = 2 ⎢ ⎥ = ⎣ 3 ⎦0 3 −∞ 3 Even if the p.d.f, satisfies the requirements, it is not obvious that the expectation exists (next slide) 15 The Cauchy p.d.f. ⎡ 1 f ( x) = ⎢ 2 ⎢⎣ π (1 + x ) ⎤ −∞ < x < ∞ ⎥ ⎥⎦ ( f ( x ) ≥ 0) x 1 1 1⎛ ⎛ π ⎞⎞ arctan arctan F ( x) = ∫ dx x x = = − ( ) −∞ ( ) ⎜− ⎟⎟ ⎜ 2 π π⎝ ⎝ 2 ⎠⎠ −∞ π (1 + x ) x 1 ⎛π π ⎞ F (∞) = ⎜ + ⎟ = 1 π ⎝2 2⎠ ⎛∞ ⎞ 1 ⎜∫ dx = 1⎟ 2 ⎜ −∞ π (1 + x ) ⎟ ⎝ ⎠ 16 Cauchy distribution: Expectation Test for existence of expectation E(X ) = ∞ ∫ −∞ x ⋅ f ( x ) ⋅ dx = ∞ ∫ −∞ 1 x dx → ∞ 2 π (1 + x ) Expectation does not exist for the Cauchy distribution. 17 Some properties of expectations • Expectation is linear E ( aX + bY ) = aE ( X ) + bE (Y ) • If the random variables X and Y are independent ( f ( x, y ) = f ( x ) f ( y ) ) then E ( X ⋅ Y ) = E ( X ) ⋅ E (Y ) 18 Expectation of a function • Is essentially the same as the expectation of a variable E ( r ( x )) = ∞ ∞ −∞ −∞ ∫ r ⋅ g ( r ) dr = ∫ r ( x ) ⋅ f ( x ) ⋅ dx of special interest is the expectation value of moments ⎡ ⎤ variance ≡ E ( X ) − ⎣⎡ E ( X ) ⎦⎤ = ∫ x ⋅ f ( x ) ⋅ dx − ⎢ ∫ x ⋅ f ( x ) ⋅ dx ⎥ −∞ ⎣ −∞ ⎦ 2 2 ∞ ∞ 2 Can you show that the variance is always non-negative? 19 2 Functions of several random variables • We consider a p.d.f. f ( x1 ,..., xn ) of several random variables X 1 ,..., X n The p.d.f. satisfies (of course) f ( x1 ,..., xn ) ≥ 0 b1 b2 bn ∫ ∫ ... ∫ f ( x ,..., x ) dx ...dx 1 a1 a2 an n 1 n =1 20 Expectation of function of several variables • Similarly to one variable case, expectations of functions with several variables are computes E (Y = r ( x1 ,..., xn ) ) = ∞ ∞ ∫ ... ∫ r ( x ,..., x ) ⋅ f ( x ,..., x ) dx ...dx 1 −∞ n 1 n 1 n −∞ 21 Example: expectation of more than one variable ⎧1 for ( x, y ) ∈ S ⎫ f ( x, y ) = ⎨ ⎬ otherwise ⎭ ⎩0 S is a square: 0 < x < 1 0 < y < 1 E( X +Y 2 1 1 2 ) = ∫ ∫(x 2 +y 2 ) f ( x, y ) ⋅dx ⋅ dy 0 0 1 1 2 = ∫ ∫ ( x + y )dx ⋅ dy = 3 0 0 2 2 22 Markov Inequality • X is a random variable such that Pr ( X ≥ 0 ) = 1 • For every t>0 E(X ) Pr ( X ≥ t ) ≤ t • Prove it • Why E(X)>t is not interesting? 23 Chebyshev Inequality is a special case of the Markov inequality • X is a random variable for which the variance exists. For t>0 ( Pr ⎡⎣ X − E ( X ) ⎤⎦ ≥ t 2 2 ) var ( X ) ≤ 2 t • Substitute Y = ⎡⎣ X − E ( X ) ⎤⎦ 2 → E (Y ) =var ( X ) and t 2 by t to obtain the Markov inequality 24 The law of large numbers I • Consider a set of N random variables X 1 ,... X n i.i.d. Each of the random variables has mean (expectation value) µ and variance σ2 • The arithmetic average of n samples is 1 defined X n = ( X 1 + ... + X n ) . It defines a new randomnvariable that we call the sample mean • The expectation value of the sample mean 1 1 E ( X n ) = ∑ E ( X i ) = ⋅ nµ = µ 25 n i n The Law of Large Numbers II • The variance of X n ( var ( X n ) = E X − E 2 n 2 ( X )) n 1 = 2 n 1 ⎡ ⎤ E ( X i X j ) − 2 ⎢∑ E ( X i )⎥ ∑ n ⎣ i i, j ⎦ 2 Since X i and X j are independent for i ≠ j E ( X i X j ) = E ( X i ) E ( X j ) 1 var ( X n ) = 2 ⎡⎣ n ⋅ E ( X 2 ) + ( n 2 − n ) ⋅ E 2 ( X ) − n 2 E 2 ( X ) ⎤⎦ n ( ) var ( X n ) = E ( X 2 ) − E 2 ( X ) n = var ( X ) n Which means that the variance is decreasing linearly with the number of sampled points 26 Law of Large numbers III Chebyshev Inequality: 1 − Pr (( X − µ) ≥ ε 2 n 2 ) = Pr (( X − µ) < ε 2 n 2 ) var ( X ) ≥ 1− nε 2 for ε ≥ 0 ⇒ Xn → µ 27 Central Limit Theorem • Statement without proof: • Given a set of random variables X 1 ,..., X n with mean µi and variance σ2i we define a new random variable Y = ∑ X i i = 1 ,..., n n ⎛ ⎜ ∑ σ ⎝ i = 1 ,..., n 2 i ⎞ ⎟ ⎠ 1 2 Xi • For very large n, the distribution of i =∑ 1,..., n is normal with mean ∑ µ and variance ∑σ i =1,..., n 2 i i =1,.., n i 28 Statistical Inference • Data generated from unknown probability distribution and statement on the unknown distribution are warranted. Determine parameters (e.g. β for exponential distribution, µ and σ for normal distribution) • Prediction of new experiments 29 Estimation of parameters • Notation: f(x|θ) is the probability density of sampling x given (conditioned on) parameters θ. • For a set of n independent and identically distributed samples the probability density is: f ( x1 ,..., xn | θ ) = ∏ f ( x |θ ) ≡ f (x |θ ) i =1,..., n i • However, what we want to determine now are the parameters… For example assuming the distribution is normal, we seek the mean µ and the variance σ2 2 ⎛ ⎞ µ − x ( ) 1 2 f ( x | µ ,σ ) = exp ⎜ − ⎟ 2 ⎜ ⎟ σ 2πσ 2 ⎝ ⎠ 30 Bayesian arguments • What we want is the function f (θ | x ) given a set of observations x, what is the probability that the set of parameters is θ ? • Bayesian statistics: Think of the parameters like other random variables with probability ξ(θ). The joint probability f ( x,θ ) ≡ f ( x | θ ) ξ (θ ) is also f ( x, θ ) ≡ f (θ | x ) g ( x ) 31 The likelihood function f ( x | θ ) ξ (θ ) • We can formally write ξ (θ | x ) = g ( x) which is the probability of having a particular set of parameter for the p.d.f provided a set of observation (what we wanted). Note that our prime interest here is in the parameter set θ and the samples of x is given. Since g(x) is independent of θ we can write the likelihood function ξ (θ | x ) ∝ f ( x | θ ) ξ (θ ) 32 Example: Likelihood function I • Consider the exponential distribution ⎧ β exp [ − β x ] for x > 0 ⎫ f (x | β ) = ⎨ ⎬ 0 otherwise ⎩ ⎭ ⎧ n ⎫ ⎡ ⎤ ⎪ β exp ⎢ − β ∑ xi ⎥ for x > 0 ⎪ f (x | β ) = ⎨ ⎬ ⎣ i =1,...,n ⎦ ⎪ ⎪ 0 otherwise ⎩ ⎭ • And assume the p.d.f. of the parameter β is a Gaussian with a mean and variance of 1. ⎛ β2 ⎞ 1 exp ⎜ − ξ (β ) = ⎟ 2 2π ⎝ ⎠ 33 Example: Likelihood function II 2 ⎡ ⎤ ⎛ 1 β ⎞ n ξ ( β | x) = exp ⎜ − ⎟ β exp ⎢ − β ∑ xi ⎥ 1/ 2 ( 2π ) ⎝ 2 ⎠ ⎣ i =1,...,n ⎦ 34 Maximum Likelihood We look for a maximum of the function L (θ ) = log ( f n ( x | θ ) ) as a function of the parameters θ As a concrete example we consider the normal distribution L (θ ) = log ⎡⎣ f n ( x | µ ,θ ) ⎤⎦ n n 1 = − log ( 2π ) − log (σ 2 ) − 2 2 2 2σ ∑ ( xi − µ ) 2 i =1,..., n To find the most likely set of parameters we determine the maximum of L(θ) 35 Maximum of L(θ) for normal distribution dL 1 =0=− 2 dµ 2σ ⎞ 1 ⎛ 2 ( xi − µ ) = − 2 ⎜ ∑ xi − nµ ⎟ ∑ σ ⎝ i =1,...,n i =1,..., n ⎠ 1 ⇒ µ = ∑ xi n i =1,...,n 1 dL n =− 2 + 2 2 2 2σ dσ 2 (σ ) ∑ (x − µ) i =1,..., n i 2 =0 1 2 σ = ∑ ( xi − µ ) n i =1,...,n 2 36 Determine a most likely parameter for the uniform distribution ⎧1 ⎫ for 0 ≤ x ≤ θ ⎪ ⎪ f ( x | θ ) = ⎨θ ⎬ ⎪⎩ 0 otherwise ⎪⎭ ⎧1 ⎫ ⎪ n for 0 ≤ xi ≤ θ ( i = 1,..., n ) ⎪ f ( x | θ ) = ⎨θ ⎬ ⎪⎩ 0 ⎪⎭ otherwise It is clear that θ must be larger than all the xi and at the same time maximizes the monotonically decreasing function 1 θ n, hence θ = max [ x1 ,..., xn ] 37 Potential problems in maximum likelihood procedure • Value of θ is underestimated (note that θ should be larger than all x, not only the ones we sample so far) •No guarantee that a solution exists for the distribution below θ must be large than any x but at the same time equal to the maximal x. This is not possible and hence, no solution ⎧1 ⎫ ⎪ f ( x | θ ) = ⎨θ ⎪⎩ 0 •The solution is not necessarily unique for 0 < x < θ ⎪ ⎬ otherwise ⎪⎭ ⎧1 for θ ≤ xi ≤ θ + 1 ( i=1,...,n ) ⎫ fn ( x | θ ) = ⎨ ⎬ 0 otherwise ⎩ ⎭ ⎧1 for max ( x1 ,..., xn ) − 1 ≤ θ ≤ min ( x1 ,..., xn ) ⎫ ⇒ fn ( x | θ ) = ⎨ ⎬ 0 otherwise ⎩ ⎭ ⇒ max ( x1 ,..., xn ) − 1 ≤ θ ≤ min ( x1 ,..., xn ) 38 The χ2 distribution with n degrees of freedom ∞ Γ ( n ) = ∫ t n −1e − t dt n>0 0 1 ( n / 2) −1 exp ( −x 2) f ( x) = n / 2 x 2 Γ ( n 2) E ( x) = n x>0 var ( x ) = 2n There is a useful relation between the χ2 and the normal distributions 39 Theorem connecting χ2 and normal distributions If the random variables X1,…,Xn are i.i.d. and if each of these variables has standard normal distribution, then the sum of the squares Y = X + ... + X 2 2 1 k n Has a χ2 distribution with n degrees of freedom The distribution functions F ( y ) = Pr (Y ≤ y ) = Pr ( X 2 ≤ y ) = Pr ( − y1/ 2 ≤ X ≤ y1/ 2 ) = Φ ( y1/ 2 ) − Φ ( − y1/ 2 ) The p.d.f is obtained by differentiating both side f ( y ) = F ' ( y ) φ ( y ) = Φ ' ( y ) . Note φ ( y1/ 2 ) = ( 2π ) −1/ 2 exp ( − y / 2 ) . We have f ( y ) = φ ( y1/ 2 )(1 2 y −1/ 2 ) + φ ( − y1/ 2 )(1 2 y −1/ 2 ) f ( y ) = ( 2π ) −1/ 2 y −1/ 2 exp ( − y / 2 ) which is the χ 2 distribution with one degree of freedom 40 Normal distribution: Parameters Let X1,…,Xn be a random sample from normal distribution having mean µ and variance σ2 . Then the sample mean (hat denotes M.L.E) µˆ = X n = and the sample variance 1 Xi ∑ n i =1,...,n 2 1 σˆ = ∑ ( X i − X n ) n i =1,...,n 2 are independent random variables. µ̂ has a normal distribution with a mean µ and variance σ2/n. nσˆ / σ 2 2 has a chi-square distribution of n-1 degrees of freedom Why n-1 ? (next slide) 41 Parameters of the normal distribution: Note 1 • Let x1 ,..., xn be a vector of random number of length n sampled from the normal distribtuion • Let y1 ,..., yn be another vector of n random numbers, related to the previous vector by linear transformation A (AAt=I) y = Ax • Consider now the calculation of the variance (next slide) 42 Variance • The formula we should use for the variance 1 2 var ( X ) = ∑ ( X i − µ ) n i =1,...,n • However, we do not know the exact mean, and therefore we use 1 n 2 var ( X ) = ∑ ( X i − X n ) n i =1 • What are the consequences of using this approximation? 43 Variance is not changing upon linear transformations • Consider the expression ∑ (Y − Y ) i =1,..., n i ∑ (X i =1,..., n n t i 2 = ∑ ( AX i =1,..., n − AX n ) ( AX i − AX n ) t i − Xn ) A A( X − Xn ) = t t t i ∑ (X i =1,..., n t i − Xn ) 2 • The analysis is based on the unitarity of A. Hence, linear transformation dos change the variance of he distribution. This makes it possible to exploit the difference between X n and µ 44 The n-1 (versus n) factor • Since A is arbitrary ( as long as it is unitary). We can choose one of the transformation vectors a to be (1,…,1)/n1/2 • The scalar product X a − X na = 0 t • Is identically zero (remember how we compute the mean?) • Hence since we computed the average from the same sample we computed the variance, the variance lost one degree of freedom. 45 The n-1 factor II • Note that the n-1 makes sense. Consider only a single sample point, which is of course very poor and leaves a high degree of uncertainty regarding the value sof the parameters. If we use n then the estimated variance becomes zero, while if we use n1 we obtain infinite, which is more appropriate to the problem at hand, for which we have no information to determine the variance 46 The t distribution (in preparation for confidence intervals) • Consider two random variables Y and Z, such that Y has chi-2 distribution with n degrees of freedom and Z has a standard normal distribution the variable X is defined by ⎛ Y ⎞ X = Z ⎜ 1/ 2 ⎟ ⎝n ⎠ Then the distribution of X is the t distribution with n degrees of freedom. 47 The t distribution • The function is tabulated and can be written in terms of Γ function ∞ Γ (α ) = ∫ xα −1 exp ( − x ) dx ⎛ n +1⎞ Γ⎜ ⎟ ⎛ x 2 ⎞ −( n +1) / 2 ⎝ 2 ⎠ 1+ tn ( x ) = ⎜ ⎟ 1/ 2 n ⎠ ⎛n⎞⎝ ( nπ ) Γ ⎜ ⎟ ⎝2⎠ 0 for − ∞ < x < ∞ • The t distribution is approaching the normal distribution as n → ∞ . It has the same mean but longer tails. 48 Confidence Interval • Confidence interval provide an alternative to the use of estimator instead of the actual value of an unknown parameter. We can find an interval (A,B) that we think has high probability of containing the desired parameter. The length of the interval gives us an idea how well we can estimate the parameter value. 49 Confidence interval: Example Sample distribution is normal with mean µ and standard deviation σ. We expect to find a sample S in the intervals µ S ± σ ; µ S ± 2σ ; µ S ± 3σ About 68.27% 95.45% and 99.73 of the time respectively 50 Confidence interval for means If the statistics S has the sample mean X then 95% and 99% confidence limits for estimation of the population mean are given by X ± 1.96σ X and X ± 2.58σ respectively. For large samples ( n ≥ 30 ) we can write (depending on the level of confidence we are interested in) X ± z σ X c n For small sample we need to t distribution 51 Confidence interval for the mean of the normal distribution • Let X 1 ,..., X nfor a random sample from a normal distribution with unknown mean and unknown variance. Let tn-1(x) denote the p.d.f of the t distribution with n-1 degrees of freedom, and let c be a constant such that c t x dx = γ ( ) n − 1 ∫ −c • For every value of n, the value of c can be found from the table of the t distribution to fit the confidence (probability) γ 52 Confidence interval for means small sample (n<30) • We use the special distribution t (instead the direct normal distribution expression) since the variance is estimated from the sample too. For 95% confidence in the mean −t0.975 X − µ) ( < σˆ n < t0.975 and the mean can be estimated as (with confidence 95%) σ σˆ < µ < X + t.975 X − t0.975 n n 53