Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The normal distribution is extremely common/ useful, for one reason: the normal distribution approximates a lot of other distributions. This is the result of one of the most fundamental theorems in Math: 0.1 Central Limit Theorem (CLT) Theorem 0.1.1 (Central Limit Theorem) If X1 , X2 , . . . , Xn are n independent, identically distributed random variables with E[Xi ] = µ and V ar[Xi ] = σ 2 , then: Pn the sample mean X̄ := n1 i=1 Xi is approximately normal distributed with E[X̄] = µ and V ar[X̄] = σ 2 /n. 2 i.e. X̄ ∼ N (µ, σn ) or P i Xi ∼ N (nµ, nσ 2 ) Corollary 0.1.2 (a) for large n the binomial distribution Bn,p is approximately normal Nnp,np(1−p) . (b) for large λ the Poisson distribution P oλ is approximately normal Nλ,λ . (c) for large k the Erlang distribution Erlangk,λ is approximately normal N k , k λ λ2 Why? (a) Let X be a variable with a Bn,p distribution. We know, that X is the result from repeating the same Bernoulli experiment n times and looking at the overall number of successes. We can therefor, write X as the sum of n B1,p variables Xi : X := X1 + X2 + . . . + Xn X is then the sum of n independent, identically distributed random variables. Then, the Central Limit Theorem states, that X has an approximate normal distribution with E[X] = nE[Xi ] = np and V ar[X] = nV ar[Xi ] = np(1 − p). (b) it is enough to show the statement for the case that λ is a large integer: Let Y be a Poisson variable with rate λ. Then we can think of Y as the number of occurrences in an experiment that runs for time λ - that is the same as to observe λ experiments that each run independently for time 1 and add their results: Y = Y1 + Y2 + . . . + Yλ , with Yi ∼ P o1 . Again, Y is the sum of n independent, identically distributed random variables. Then, the Central Limit Theorem states, that X has an approximate normal distribution with E[Y ] = λ · 1 and V ar[Y ] = λV ar[Yi ] = λ. (c) this statement is the easiest to prove, since an Erlangk,λ distributed variable Z is by definition the sum of k independently distributed exponential variables Z1 , . . . , Zk . For Z the CLT holds, and we get, that Z is approximately normal distributed with E[Z] = kE[Zi ] = and V ar[Z] = kV ar[Zi ] = λk2 . k λ 2 2 Why do we need the central limit theorem at all? - first of all, the CLT gives us the distribution of the sample mean in a very general setting: the only thing we need to know, is that all the observed values come from the same distribution, and the variance for this distribution is not infinite. A second reason is, that most tables only contain the probabilities up to a certain limit - the Poisson table e.g. only has values for λ ≤ 10, the Binomial distribution is tabled only for n ≤ 20. After that, we can use the Normal approximation to get probabilities. Example 0.1.1 Hits on a webpage Hits occur with a rate of 2 per min. What is the probability to wait for more than 20 min for the 50th hit? Let Y be the waiting time until the 50th hit. We know: Y has an Erlang50,2 distribution. therefore: P (Y > 20) = 1 − Erlang50,2 (20) = 1 − (1 − P o2·20 (50 − 1)) = CLT ! = P o40 (49) ≈ N40,40 (49) = 49 − 40 table √ = Φ(1.42) = 0.9222. = Φ 40 Example 0.1.2 Mean of Uniform Variables Let U1 , U2 , U3 , U4 , and U5 be standard uniform variables, i.e. Ui ∼ U(0,1) . Without the CLT we would have no idea, what distribution the sample mean Ū = approx 1 With it, we know: Ū ∼ N (0.5, 60 ). Issue: 1 5 P5 i=1 Ui had! Accuracy of approximation • increases with n • increases with the amount of symmetry in the distribution of Xi Rule of thumb for the Binomial distribution: Use the normal approximation for Bn,p , if np > 5 (if p ≤ 0.5) or nq > 5 (if p ≥ 0.5)! From now on, we will use probability theory only to find answers to the questions arising from specific problems we are working on. In this chapter we want to draw inferences about some characteristic of an underlying population - e.g. the average height of a person. Instead of measuring this characteristic of each individual, we will draw a sample, i.e. choose a “suitable” subset of the population and measure the characteristic only for those individuals. Using some probabilistic arguments we can then extend the information we got from that sample and make an estimate of the characteristic for the whole population. Probability theory will give us the means to find those estimates and measure, how “probable” our estimates are. Of course, choosing the sample, is crucial. We will demand two properties from a sample: • the sample should be representative - taking only basketball players into the sample would change our estimate about a person’s height drastically. • if there’s a large number in the sample we should come close to the “true” value of the characteristic The three main area of statistics are 0.2. PARAMETER ESTIMATION 3 • estimation of parameters: point or interval estimates: “my best guess for value x is . . . ”, “my guess is that value x is in interval (a, b)” • evaluation of plausibility of values: hypothesis testing • prediction of future (individual) values 0.2 Parameter Estimation Statistics are all around us - scores in sports, prices at the grocers, weather reports ( and how often they turn out to be close to the actual weather), taxes, evaluations . . . The most basic form of statistics are descriptive statistics. But - what exactly is a statistic? - Here is the formal definition: Definition 0.2.1 (Statistics) Any function W (x1 , . . . , xk ) of observed values x1 , . . . , xk is called a statistic. Some statistics you Mean (Average) Minimum Maximum Range Mode Median already know are: P X̄ = n1 i Xi X(1) - Parentheses indicate that the values are sorted X(n) X(n) − X(1) value(s) that appear(s) most often “middle value” - that value, for which one half of the data is larger, the other half is smaller. If n is odd the median is X(n/2) , if n is even, the median is the average of the two middle values: 0.5 · X((n−1)/2) + 0.5 · X((n+1)/2) For this section it is important to distinguish between xi and Xi properly. If not stated otherwise, any capital letter denotes some random variable, a small letter describes a realization of this random variable, i.e. what we have observed. xi therefore is a real number, Xi is a function, that assigns a real number to an event from the sample space. Definition 0.2.2 (estimator) Let X1 , . . . , Xk be k i.i.d random variables with distribution Fθ with (unknown) parameter θ. A statistic Θ̂ = Θ̂(X1 , . . . , Xk ) used to estimate the value of θ is called an estimator of θ. θ̂ = Θ̂(x1 , . . . , xk ) is called an estimate of θ. Desirable properties of estimates: 4 x true value • value for from one sample Unbiasedness: • Unbiasedness, i.e the expected value of an estimator is the true parameter: • E[Θ̂] = θ • • Efficiency, for two estimators, Θ̂1 and Θ̂2 of the same parameter θ, Θ̂1 is said to be more efficient than Θ̂2 , if V ar[Θ̂1 ] < V ar[Θ̂2 ] • • • • x • • • and not • • • • • • •• • • •• • • • • • x Efficiency: estimator 1 • •• • • •• • x• • • • • • is better than • • • •• • • x• • • • • estimator 2 • Consistency, if we have a larger sample size n, we want the estimate θ̂ to be closer to the true parameter θ. Consistency: • same estimator • • • • •• • • • x• • • for n = 100 • •• • • x•• • •• • • • lim P (|Θ̂ − θ| > ) = 0 n→∞ for n = 10000 Example 0.2.1 Let X1 , . . . , XP n be n i.i.d. random variables with E[Xi ] = µ. n Then X̄ = n1 i=1 Xi is an unbiased estimator of µ, because n E[X̄] = 1X 1 E[Xi ] = n · µ = µ. n i=1 n ok, so, once we have an estimator, we can decide, whether it has the properties. But how do we find estimators? 0.2.1 Maximum Likelihood Estimation Situation: We have n data values x1 , . . . , xn . The assumption is, that these data values are realizations of n i.i.d. random variables X1 , . . . , Xn with distribution Fθ . Unfortunately the value for θ is unknown. X observed values x1, x2, x3, ... f with =0 f with = -1.8 f with =1 By changing the value for θ we can “move the density function fθ around” - in the diagram, the third density function fits the data best. Principle: since we do not know the true value θ of the distribution, we take that value θ̂ that most likely produced the observed values, i.e. maximize something like P (X1 = x1 ∩ X2 = x2 ∩ . . . ∩ Xn = xn ) Xi are independent! = = P (X1 = x1 ) · P (X2 = x2 ) · . . . · P (Xn = xn ) = = n Y i=1 P (Xi = xi ) (*) 0.2. PARAMETER ESTIMATION 5 This is not quite the right way to write the probability, if X1 , . . . , Xn are continuous variables. (Remember: P (X = x) = 0 for a continuous variable X; this is still valid) We use the above “probability” just as a plausibility argument. To come around the problem that P (X = x) = 0 for a continuous variable, we will write (*) as: n Y pθ (xi ) n Y and | fθ (xi ) i=1 i=1 {z } for discreteXi | {z } for continuousXi where pθ ) is the probability mass function of discrete Xi (all Xi have the same, since they are i.d) and fθ is the density function of continuous Xi . Both these functions depend on θ. In fact, we can write the above expressions as a function in θ. This function, which we will denote by L(θ), is called the Likelihood function of X1 , . . . , Xn . The goal is now, to find a value θ̂ that maximizes the Likelihood function. (this is what “moves” the density to the right spot, so it fits the observed values well) How do we get a maximum of L(θ)? - by the usual way, we maximize a function! - Differentiate it and set it to zero! (After that, we ought to check with the second derivative, whether we’ve actually found a maximum, but we won’t do that unless we’ve found more than one possible value for θ̂.) Most of the time, it is difficult to find a derivative of L(θ) - instead we use another trick, and find a maximum for log L(θ), the Log-Likelihood function. Note: though its name is “log”, we use the natural logarithm ln. The plan to find an ML-estimator is: 1. Find Likelihood function L(θ). 2. Get natural log of Likelihood function log L(θ). 3. Differentiate log-Likelihood function with respect to θ. 4. Set derivative to zero. 5. Solve for θ. Example 0.2.2 Roll a Die A die is rolled until its face shows a 6. repeating this experiment 100 times gave the following results: 6 #Rolls of a Die until first 6 20 15 # runs 10 5 0 1 k # trials 1 18 2 20 3 8 4 9 2 3 4 5 9 5 6 6 5 7 8 7 8 9 11 8 3 9 5 14 11 3 16 14 3 20 15 3 27 16 1 17 1 29 20 1 21 1 27 1 29 1 We know, that k the number of rolls until a 6 shows up has a geometric distribution Geop . For a fair die, p is 1/6. The Geometric distribution has probability mass function p(k) = (1 − p)k−1 · p. What is the ML-estimate p̂ for p? 1. Likelihood function L(p): Since we have observed 100 outcomes k1 , ..., k100 , the likelihood function L(p) = L(p) = 100 Y i=1 (1 − p)ki −1 p = p100 · P 100 Y (1 − p)ki −1 = p100 · (1 − p) 100 i=1 (ki −1) 2. log of Likelihood function log L(p): P 100 log p100 · (1 − p) i=1 ki −100 = 100 = log p100 + log (1 − p) i=1 ki −100 = ! 100 X = 100 log p + ki − 100 log(1 − p). = P i=1 i=1 p(ki ), P = p100 · (1 − p) i=1 log L(p) Q100 100 i=1 ki −100 . 0.2. PARAMETER ESTIMATION 7 3. Differentiate log-Likelihood with respect to p: d log L(p) dp 1 100 + p = 100 X 100 X 100(1 − p) − ! ! ki − 100 p = i=1 1 p(1 − p) = −1 = 1−p ki − 100 i=1 1 p(1 − p) = ! 100 − p 100 X ! ki . i=1 4. Set derivative to zero. For the estimate p̂ the derivative must be zero: ⇐⇒ 1 p̂(1 − p̂) d log L(p̂) = 0 dp ! 100 X 100 − p̂ ki = 0 i=1 5. Solve for p̂. 1 p̂(1 − p̂) 100 − p̂ 100 X ! ki = 0 = 0 i=1 100 − p̂ 100 X ki i=1 100 p̂ = P100 i=1 In total, we have an estimate p̂ = 100 568 ki = 1 100 1 P100 ki i=1 . = 0.1710. Example 0.2.3 Red Cars in the Parking Lot The values 3,2,3,3,4,1,4,2,4,3 have been observed while counting the numbers of red cars pulling into the parking lot # 22 between 8:30 - 8:40 am Mo to Fr during two weeks. The assumption is, that these values are realizations of ten independent Poisson variables with (the same) rate λ. What is the Maximum Likelihood estimate of λ? x The probability mass function of a Poisson distribution is pλ (x) = e−λ · λx! . We have ten values xi , this gives a Likelihood function: L(λ) = 10 Y i=1 e−λ · λXi = e−10λ · λ Xi ! P 10 i=1 Xi · 10 Y 1 X i! i=1 The log-Likelihood then is log L(λ) = −10λ + ln(λ) · 10 X i=1 Xi − X ln(Xi ). 8 Differentiating the log-Likelihood with respect to λ gives: 10 d 1 X log L(λ) = −10 + · Xi dλ λ i=1 Setting it to zero: 10 1 X · Xi = 10 λ̂ i=1 10 ⇐⇒ λ̂ = 1 X Xi 10 i=1 29 = 2.9 10 This gives us an estimate for λ - and since λ is also the expected value of the Poisson distribution, we can say, that on average the number of red cars pulling into the parking lot each morning between 8:30 and 8:40 pm is 2.9. ⇐⇒ λ̂ = ML-estimators for µ and σ 2 of a Normal distribution Let X1 , . . . , Xn be n independent, identically distributed normal variables with E[Xi ] = µ and V ar[Xi ] = σ 2 . µ and σ 2 are unknown. The normal density function fµ,σ2 is fµ,σ2 (x) = √ 1 2πσ 2 e− (x−µ)2 2σ 2 Since we have n independent variables, the Likelihood function is a product of n densities: L(µ, σ 2 ) = n Y i=1 √ 1 2πσ 2 e− (xi −µ)2 2σ 2 = (2πσ 2 )n/2 · e− P (xi −µ)2 n i=1 2σ 2 Log-Likelihood: log L(µ, σ 2 ) = − n n 1 X ln(2πσ 2 ) − 2 (xi − µ)2 2 2σ i=1 Since we have now two parameters, µ and σ 2 , we need to get 2 partial derivatives of the log-Likelihood: d log L(µ, σ 2 ) dµ d log L(µ, σ 2 ) dσ 2 = 0−2· n n −1 X 1 X 2 · (x − µ) · (−1) = (xi − µ)2 i 2σ 2 i=1 σ 2 i=1 n n 1 1 X = − + (xi − µ)2 2 σ2 2(σ 2 )2 i=1 We know, must find values for µ and σ 2 , that yield zeros for both derivatives at the same time. d Setting dµ log L(µ, σ 2 ) = 0 gives n 1X xi , µ̂ = n i=1 plugging this value into the derivative for σ 2 and setting n d dσ 2 log L(µ̂, σ 2 ) = 0 gives 1X σˆ2 = (xi − µ̂)2 n i=1 0.3. CONFIDENCE INTERVALS 0.3 9 Confidence intervals The previous section has provided a way to compute point estimates for parameters. Based on that, our next question is - how good is this point estimate? or How close is the estimate to the true value of the parameter? Instead of just looking at the point estimate, we will now try to compute an interval around the estimated parameter value, in which the true parameter is “likely” to fall. An interval like that is called confidence interval. Definition 0.3.1 (Confidence Interval) Let θ̂ be an estimate of θ. If P (|θ̂ − θ| < e) > α, we say, that the interval (θ̂ − e, θ̂ + e) is an α · 100% Confidence interval of θ (cf. fig. 1.1). Usually, α is a value near 1, such as 0.9, 0.95, 0.99, 0.999, etc. Note: • for any given set of values x1 , . . . , xn the value or θ̂ is fixed, as well as the interval (θ̂ − e, θ̂ + e). • The true value θ is either within the confidence interval or not. P( prob £ 1 - x-e x -e x+e -e< x < e + e) > prob £ 1 - confidence interval for Figure 1: The probability that x̄ falls into an e interval around µ is α. Vice versa, we know, that for all of those x̄ µ is within an e interval around x̄. That’s the idea of a confidence interval. !!DON’T DO!! A lot of people are tempted to reformulate the above probability to: P (θ̂ − e < θ < θ̂ + e) > α Though it looks ok, it’s not. Repeat: IT IS NOT OK. θ is a fixed value - therefore, it does not have a probability to fall into some interval. The only probability that we have, here, is P (θ − e < θ̂ < θ + e) > α, we can therefore say, that θ̂ has a probability of at least α to fall into an e- interval around θ. Unfortunately, that doesn’t help at all, since we do not know θ! How do we compute confidence intervals, then? - that’s different for each estimator. First, we look at estimates of a mean of a distribution: 10 0.3.1 Large sample C.I. for µ Situation: we have a large set of observed values (n > 30, usually). The assumption is, that these values are realizations of n i.i.d random variables X1 , . . . , Xn with E[X̄] = µ and V ar[X̄] = σ 2 . We already know from the previous section, that X̄ is an unbiased ML-estimator for µ. But we know more! - The CLT tells us, that in exactly the situation we are X̄ is an approximately normal 2 distributed random variable with E[X̄] = µ and V ar[X̄] = σn . We therefore can find the boundary e by using the standard normal distribution. Remember: if X̄ ∼ X̄−µ √ ∼ N (0, 1) = Φ: N (µ, σ 2 /n) then Z := σ/ n P (|X̄ − µ| ≤ e) ≥ α use standardization |X̄ − µ| e √ ≤ √ ⇐⇒ P ≥α σ/ n σ/ n e √ ⇐⇒ P |Z| < ≥α σ/ n e e √ ⇐⇒ P − √ < Z < ≥α σ/ n σ/ n e e √ ⇐⇒ Φ −Φ − √ ≥α σ/ n σ/ n e e √ √ ⇐⇒ Φ − 1−Φ ≥α σ/ n σ/ n e √ −1≥α ⇐⇒ 2Φ σ/ n e α √ ⇐⇒ Φ ≥1+ 2 σ/ n e 1 + α −1 √ ≥Φ ⇐⇒ 2 σ/ n σ 1+α √ ⇐⇒ e ≥ Φ−1 2 n | {z } :=z This computation gives a α· 100% confidence value around µ as: σ σ X̄ − z · √ , X̄ + z · √ n n Now we can do an example: Example 0.3.1 Suppose, we want to find a 95% confidence interval for the mean salary of an ISU employee. A random sample of 100 ISU employees gives us a sample mean salary of $21543 = x̄. Suppose, the standard deviation of salaries is known to be $3000. By using the above expression, we get a 95% confidence interval as: 1 + 0.95 3000 −1 21543 ± Φ ·√ = 21543 ± Φ−1 (0.975) · 300 2 100 How do we read Φ−1 (0.975) from the standard normal table? - We look for which z the probability N(0,1) (z) ≥ 0.975! 0.3. CONFIDENCE INTERVALS 11 This gives us z = 1.96, the 95% confidence interval is then: 21543 ± 588, i.e. if we repeat this study 100 times (with 100 different employees each time), we can say: in 95 out of 100 studies, the true parameter µ falls into a $588 range around x̄. Critical values for z, depending on α are: α 0.90 0.95 0.98 0.99 z = Φ−1 ( 1+α 2 ) 1.65 1.96 2.33 2.58 Problem: Usually, we do not know q σP n 1 2 Slight generalization: use s = n−1 i=1 (Xi − X̄) instead of σ! An α· 100% confidence interval for µ is given as s s X̄ − z · √ , X̄ + z · √ n n where z = Φ−1 ( 1+α 2 ). Example 0.3.2 Suppose, we want to analyze some complicated queueing system, for which we have no formulas and theory. We are interested in the mean queue length of the system after reaching steady state. The only thing possible for us is to run simulations of this system and look at the queue length at some large time t, e.g. t = 1000 hrs. After 50 simulations, we have got data: X1 = number in queue at time 1000 hrs in 1st simulation X2 = number in queue at time 1000 hrs in 2nd simulation ... X50 = number in queue at time 1000 hrs in 50th simulation q Pn 1 2 Our observations yield an average queue length of x̄ = 21.5 and s = n−1 i=1 (xi − x̄) = 15. A 90% confidence interval is given as s s x̄ − z · √ , x̄ + z · √ n n = = 15 15 21.5 − 1.65 · √ , 21.5 + 1.65 · √ 50 50 (17.9998, 25.0002) = Example 0.3.3 The graphs show a set of 80 experiments. The values from each experiment are shown in one of the green framed boxes. Each experiment consists of simulating 20 values from a standard normal distributions (these are drawn as the small blue lines). For each of the experiments, the average from the 20 value is computed (that’s x̄) as well as a confidence interval for µ- for parts a) and b) it’s the 95% confidence interval, for part c) it is the 90% confidence interval, for part d) it is the 99% confidence interval. The upper and the lower confidence bound together with the sample mean are drawn in red next to the sampled observations. 12 a) 95 % confidence intervals b) 95 % confidence intervals c) 90 % confidence intervals d) 99 % confidence intervals There are several things to see from this diagram. First of all, we know in this example the “true” value of the parameter µ - since the observations are sampled from a standard normal distribution, µ = 0. The true parameter is represented by the straight horizontal line through 0. 0.3. CONFIDENCE INTERVALS 13 We see, that each sample yields a different confidence interval, all of the are centered around the sample mean. The different sizes of the intervals tells us another thing: in computing these confidence intervals, we had to use the estimate s instead of the true standard deviation σ = 1. Each sample gave a slightly different standard deviation. Overall, though, the intervals are not very different in lengths between parts a) and b). The intervals in c) tend to be slightly smaller, though - these are 90% confidence intervals, whereas the intervals in part d) are on average larger than the first ones, they are 99% confidence intervals. Almost all the confidence intervals contain 0 - but not all. And that is, what we expect. For a 90% confidence interval we expect, that in 10 out of 100 times, the confidence interval does not contain the true parameter. When we check that - we see, that in part c) 4 out of the 20 confidence intervals don’t contain the true parameter for µ - that’s 20%, on average we would expect 10% of the conficence intervals not to contain µ. Official use of Confidence Intervals: In an average of 90 out of 100 times the 90% confidence interval of θ does contain the true value of θ. 0.3.2 Large sample confidence intervals for a proportion p Let p be a proportion of a large population or a probability. In order to get an estimate for this proportion, we can take a sample of n individuals from the population and check each one of them, whether or not they fulfill the criterion to be in that proportion of interest. Mathematically, this corresponds to a Bernoulli-n-sequence, where we are only interested in the number of “successes”, X, which in our case corresponds to the number of individuals that qualify for the interesting subgroup. X then has a Binomial distribution, with parameters n and p. We know, that X̄ is an estimate for E[X]. Now think: for a Binomial variable X, the expected value E[X] = n · p. Therefore we get an estimate p̂ for p as p̂ = n1 X̄. Furthermore, we even have a distribution for p̂ for large n: Since X̂ is, using the CLT, a normal variable with E[X̄] = np and V ar[X̄] = np(1 − p), we get that for large n p̂ is a approximately normally distributed with E[p̂] = p and V ar[p̂] = p(1−p) . n BTW: this tells us, that p̂ is an unbiased estimator of p. Prepared with the distribution of p̂ we can set up an α · 100% confidence interval as: (p̂ − e, p̂ + e) where e is some positive real number with: P (|p̂ − p| ≤ e) ≥ α We can derive the expression for e in the same way as in the previous section and we come up with: e=z· p(1 − p) n where z = Φ−1 ( 1+α 2 ). We also run into the problem that e in this form is not ready for use, since we do not know the value for p. In this situation, we have different options. We can either find a value that maximizes the value p(1 − p) or we can substitute an appropriate value for p. 0.3.2.1 Conservative Method: replace p(1 − p) by something that’s guaranteed to be at least as large. The function p(1 − p) has a maximum for p = 0.5. p(1 − p) is then 0.25. p 14 The conservative α · 100% confidence interval for p is 1 p̂ ± z · √ 2 n where z = Φ−1 ( 1+α 2 ). 0.3.2.2 Substitution Method: Substitute p̂ for p, then: The α · 100% confidence interval for p by substitution is r p̂(1 − p̂) p̂ ± z · n where z = Φ−1 ( 1+α 2 ). Where is the difference between the two methods? • for large n there is almost no difference at all • if p̂ is close to 0.5, there is also almost no difference Besides that, conservative confidence intervals (as the name says) are larger than confidence intervals found by substitution. However, they are at the same time easier to compute. Example 0.3.4 Complicated queueing system, continued Suppose, that now we are interested in the large t probability p that a server is available. Doing 100 simulations has shown, that in 65 of them a server was available at time t = 1000 hrs. What is a 95% confidence interval for this probability? 60 If 60 out of 100 simulations showed a free server, we can use p̂ = 100 = 0.6 as an estimate for p. −1 For a 95% confidence interval, z = Φ (0.975) = 1.96. The conservative confidence interval is: 1 1 = 0.6 ± 0.098. p̂ ± z √ = 0.6 ± 1.96 √ 2 n 2 · 100 For the confidence interval using substitution we get: r r p̂(1 − p̂ 0.6 · 0.4 p̂ ± z = 0.6 ± 1.96 = 0.6 ± 0.096. n 100 Example 0.3.5 Batting Average In the 2002 season the baseball player Sammy Sosa had a batting average of 0.288. (The batting average is the ratio of the number of hits and the times at bat.) Sammy Sosa was at bats 555 times in the 2002 season. Could the ”true” batting average still be 0.300? Compute a 95% Confidence Interval for the true batting average. Conservative Method gives: 1 0.288 ± 1.96 · √ 2 555 0.288 ± 0.042 0.3. CONFIDENCE INTERVALS 15 Substitution Method gives: r 0.288 ± 1.96 · 0.288(1 − 0.288) 555 0.288 ± 0.038 The substitution method gives a slightly smaller confidence interval, but both intervals contain 0.3. There is not enough evidence to allow the conclusion that the true average is not 0.3. Confidence intervals give a way to measure the precision we get from simulations intended to evaluate probabilities. But besides that it also gives as a way to plan how large a sample size has to be to get a desired precision. Example 0.3.6 Suppose, we want to estimate the fraction of records in the 2000 IRS data base that have a taxable income over $35 K. We want to get a 98% confidence interval and wish to estimate the quantity to within 0.01. this means that our boundaries e need to be smaller than 0.01 (we’ll choose a conservative confidence interval for ease of computation): e ≤ 0.01 1 z is 2.33 ⇐⇒ z · √ ≤ 0.01 2 n 1 ⇐⇒ 2.33 · √ ≤ 0.01 2 n √ 2.33 ⇐⇒ n ≥ = 116.5 2 · 0.01 ⇒ n ≥ 13573 0.3.3 Related C.I. Methods Related to the previous confidence intervals, are confidence intervals for the difference between two means, µ1 − µ2 , or the difference between two proportions , p1 − p2 . Confidence intervals for these differences are given as: large n confidence interval for µ1 − µ2 (based on independent X̄1 and X̄2 ) X̄1 − X̄2 ± z q s21 n1 + s22 n2 large n confidence interval for p1 − p2 (based on independent p̂1 and p̂2 ) p̂1 − p̂2 ± z 12 q or p̂1 − p̂2 ± z stitution) 1 qn1 + 1 n2 p̂1 (1−p̂1 ) n1 (conservative) + p̂2 (1−p̂2 ) n2 (sub- Why? The argumentation in both cases is very similar - we will only discuss the confidence interval for the difference between means. X̄1 − X̄2 is approximately normal, since X̄1 and X̄2 are approximately normal, with (X̄1 , X̄2 are independent) E[X̄1 − X̄2 ] V ar[X̄1 − X̄2 ] = E[X̄1 ] − E[X̄2 ] = µ1 − µ2 = V ar[X̄1 ] + (−1)2 V ar[X̄2 ] = σ2 σ2 + n1 n2 16 Then we can use the same arguments as before and get a C.I. for µ1 − µ2 as shown above. 2 Example 0.3.7 Assume, we have two parts of the IRS database: East Coast and West Coast. We want to compare the mean taxable income between reported from the two regions in 2000. East Coast West Coast # of sampled records: n1 = 1000 n2 = 2000 mean taxable income: x̄1 = $37200 x̄2 = $42000 standard deviation: s1 = $10100 s2 = $15600 We can, for example, compute a 2 sided 95% confidence interval for µ1 − µ2 = difference in mean taxable income as reported from 2000 tax return between East and West Coast: r 101002 156002 + = −5000 ± 927 37000 − 42000 ± 1000 2000 Note: this shows pretty conclusively that the mean West Coast taxable income is higher than the mean East Coast taxable income (in the report from 2000). The interval contains only negative numbers - if it contained the 0, the message wouldn’t be so clear. One-sided intervals idea: use only one of the end points x̄ ± z √sn This yields confidence intervals for µ of the form (##, ∞) | {z } (−∞, #) | {z } upper bound lower bound However, now we need to adjust z to the new situation. Instead of worrying about two tails of the normal distribution, we use for a one sided confidence interval only one tail. P( x < x + e) < e x+e prob ≤ 1 - confidence interval for Figure 2: One sided (upper bounded) confidence interval for µ (in red). Example 0.3.8 complicated queueing system, continued What is a 95% upper confidence bound of µ, the parameter for the length of the queue? −1 x̄ + z √sn is the upper confidence bound. Instead of z = Φ−1 ( α+1 (α) (see fig. 1.2). 2 ) we use z = Φ This gives: 21.5 + 1.65 √1550 = 25.0 as the upper confidence bound. Therefore the one sided upper bounded confidence interval is (−∞, 25.0). 0.4. HYPOTHESIS TESTING 17 Critical values z = Φ−1 (α) for the one sided confidence interval are α z = Φ−1 (α) 0.90 1.29 0.95 1.65 0.98 2.06 2.33 0.99 Example 0.3.9 Two different digital communication systems send 100 large messages via each system and determine how many are corrupted in transmission. p̂1 = 0.05 and pˆ2 = 0.10. What’s the difference in the corruption rates? Find a 98% confidence interval: Use: r 0.05 − 0.1 ± 2.33 · 0.05 · 0.95 0.10 · 0.90 + = −0.05 ± 0.086 100 100 This calculation tells us, that based on these sample sizes, we don’t even have a solid idea about the sign of p1 − p2 , i.e. we can’t tell which of the pi s is larger. So far, we have only considered large sample confidence intervals. The problem with smaller sample sizes is, that the normal approximation in the CLT doesn’t work, if the standard deviation σ 2 is unknown. What you need to know is, that there exist different methods to compute C.I. for smaller sample sizes. 0.4 Hypothesis Testing Example 0.4.1 Tea Tasting Lady It is claimed that a certain lady is able to tell, by tasting a cup of tea with milk, whether the milk was put in first or the tea was put in first. To put the claim to the test, the lady is given 10 cups of tea to taste and is asked to state in each case whether the milk went in first or the tea went in first. To guard against deliberate or accidental communication of information, before pouring each cup of tea a coin is tossed to decide whether the milk goes in first or the tea goes in first. The person who brings the cup of tea to the lady does not know the outcome of the coin toss. Either the lady has some skill (she can tell to some extent the difference) or she has not, in which case she is simply guessing. Suppose, the lady tested 10 cups of tea in this manner and got 9 of them right. This looks rather suspicious, the lady seems to have some skill. But how can we check it? We start with the sceptical assumption that the lady does not have any skill. If the lady has no skill at all, the probability she gives a correct answer for any single cup of tea is 1/2. The number of cups she gets right has therefore a Binomial distribution with parameter n = 10 and p = 0.5. The diagram shows the probability mass function of this distribution: 18 p(x) observed x x Events that are as unlikely or less likely are, that the lady got all 10 cups right or - very different, but nevertheless very rare - that she only got 1 cup or none right (note, this would be evidence of some “antiskill”, but it would certainly be evidence against her guessing). The total probability for these events is (remember, the binomial probability mass function is p(x) = nx px (1− p)n−x ) p(0) + p(1) + p(9) + p(10) = 0.510 + 10 · 0.510 + 10 · 0.510 + 0.510 = 0.021 i.e. what we have just observed is a fairly rare event under the assumption, that the lady is only guessing. This suggests, that the lady may have some skill in detecting which was poured first into the cup. Jargon: 0.021 is called the p-value for testing the hypothesis p = 0.5. The fact that the p-value is small is evidence against the hypothesis. Hypothesis testing is a formal procedure to check whether or not some - previously made - assumption can be rejected based on the data. We are going to abstract the main elements of the previous example and cook up a standard series of steps for hypothesis testing: Example 0.4.2 University CC administrators have historical records that indicate that between August and Oct 2002 the mean time between hits on the ISU homepage was 2 per min. They suspect that in fact the mean time between hits has decreased (i.e. traffic is up) - sampling 50 inter-arrival times from records for November 2002 gives: X̄ = 1.7 min and s = 1.9 min. Is this strong evidence for an increase in traffic? 0.4. HYPOTHESIS TESTING 1 2 3 4 19 Formal Procedure Application to Example State a “null hypothesis” of the form H0 : function of parameter(s) = # meant to embody a status quo/ pre data view State an “alternative hypothesis” of the form > 6= # Ha : function of parameter(s) < meant to identify departure from H0 State test criteria - consists of a test statistic, a “reference distribution” giving the behavior of the test statistic if H0 is true and the kinds of values of the test statistic that count as evidence against H0 . show computations H0 : µ = 2.0 min between hits 5 Report and interpret a p-value = “observed level of significance, with which H0 can be rejected”. This is the probability of an observed value of the test statistic at least as extreme as the one at hand. The smaller this value is, the less likely it is that H0 is true. Note aside: a 90% confidence interval for µ is Ha : µ < 2 (traffic is down) √ test statistic will be Z = X̄−2.0 s/ n The reference density will be standard normal, large negative values for Z count as evidence against H0 in favor of Ha sample gives z = 1.7−2.0 √ 1.9/ 50 = −1.12 The p-value is P (Z ≤ −1.12) = Φ(−1.12) = 0.1314 This value is not terribly small - the evidence of a decrease in mean time between hits is somewhat weak. s x̄ ± 1.65 √ = 1.7 ± 0.44 n This interval contains the hypothesized value of µ = 2.0 There are four basic hypothesis tests of this form, testing a mean, a proportion or differences between two means or two proportions. Depending on the hypothesis, the test statistic will be different. Here’s an overview of the tests, we are going to use: Hypothesis Statistic Reference Distribution √ H0 : µ = # Z = X̄−# Z is standard normal s/ n H0 : p = # H0 : µ1 − µ2 = # H 0 : p1 − p2 = # where p̂ = q p̂−# Z = X̄r −X̄ −# + Z = √ p̂ −p̂q−# p̂(1−p̂) + Z= Z is standard normal #(1−#) n 1 1 Z is standard normal 2 s2 1 n1 s2 2 n2 2 1 n1 1 n2 Z is standard normal n1 p̂1 +n2 p̂2 n1 +n2 . Example 0.4.3 tax fraud Historically, IRS taxpayer compliance audits have revealed that about 5% of individuals do things on their tax returns that invite criminal prosecution. A sample of n = 1000 tax returns produces p̂ = 0.061 as an estimate of the fraction of fraudulent returns. does this provide a clear signal of change in the tax payer behavior? 1. state null hypothesis: H0 : p = 0.05 2. alternative hypothesis: Ha : p 6= 0.05 20 3. test statistic: p̂ − 0.05 Z=p 0.05 · 0.95/n Z has under the null hypothesis a standard normal distribution, any large values of Z - positive and negative values - will count as evidence against H0 . p 4. computation: z = (0.061 − 0.05)/ 0.05 · 0.95/1000 = 1.59 5. p-value: P (|Z| ≥ 1.59) = P (Z ≤ −1.59) + P (Z ≥ 1.59) = 0.11 This is not a very small value, we therefore have only very weak evidence against H0 . Example 0.4.4 life time of disk drives n1 = 30 and n2 = 40 disk drives of 2 different designs were tested under conditions of “accelerated” stress and times to failure recorded: Standard Design n1 = 30 x̄1 = 1205 hr s1 = 1000 hr New Design n2 = 40 x̄2 = 1400 hr s2 = 900 hr Does this provide conclusive evidence that the new design has a larger mean time to failure under “accelerated” stress conditions? 1. state null hypothesis: H0 : µ1 = µ2 (µ1 − µ2 = 0) 2. alternative hypothesis: Ha : µ1 < µ2 (µ1 − µ2 < 0) 3. test statistic is: x̄1 − x̄2 − 0 Z= q 2 s1 s22 n1 + n2 Z has under the null hypothesis a standard normal distribution, we will consider large negative values of Z as evidence against H0 . p 4. computation: z = (1205 − 1400 − 0)/ 10002 /30 + 9002 /40 = −0.84 5. p-value: P (Z < −0.84) = 0.2005 This is not a very small value, we therefore have only very weak evidence against H0 . Example 0.4.5 queueing systems 2 very complicated queuing systems: We’d like to know, whether there is a difference in the large t probabilities of there being an available server. We do simulations for each system, and look whether at time t = 2000 there is a server available: System 1 System 2 n1 = 1000 runs n2 = 500 runs (each with different random seed) server at time t = 2000 available? 551 p̂1 = 1000 p̂2 = 303 500 How strong is the evidence of a difference between the t = 2000 availability of a server for the two systems? 0.5. GOODNESS OF FIT TESTS 21 1. state null hypothesis: H0 : p1 = p2 (p1 − p2 = 0) 2. alternative hypothesis: Ha : p1 6= p2 (p1 − p2 6= 0) 3. Preliminary: note that, if there was no difference between the two systems, a plausible estimate of the availability of a server would be p̂ = np̂1 + np̂2 551 + 303 = = 0.569 n1 + n2 1000 + 500 a test statistic is: Z=p p̂1 − p̂2 − 0 q p̂(1 − p̂) · n11 + 1 n2 Z has under the null hypothesis a standard normal distribution, we will consider large values of Z as evidence against H0 . p p 4. computation: z = (0.551 − 0.606)/( 0.569 · (1 − 0.569) 1/1000 + 1/500) = −2.03 5. p-value: P (|Z| > 2.03) = 0.04 This is fairly strong evidence of a real difference in t=2000 availabilities of a server between the two systems. 0.5 Goodness of Fit Tests The basic situation is still the same as in the previous section: we have n realizations x1 , . . . , xn (observed data) of independent, identically distributed random variables X1 , . . . , Xn . A goodness of fit test is different from the previous ones. Here, we don’t test a single parameter, but we test the whole distribution underlying our observations. Basically, the null hypothesis will be H0 : the data follow a specified distribution F vs. Ha : the data do not follow the specified distribution. For this problem there are different approaches depending on whether the specified distribution is continuous or discrete. For simplification, we will only consider the case of a finite discrete distribution, i.e. we are dealing with a finite sample space Ω = {1, . . . , k}, on which we have a probability mass function p. The above null hypothesis then becomes H0 vs Ha : pX (i) = p(i) for i = 1, . . . , k : pX (i) 6= pi for at least one i ∈ {0, . . . , k} 22 Example 0.5.1 M&Ms On the web page of the M&M/Mars company: www.m-ms.com the percentages of each color in a bag of peanut M&Ms is given as These percentages form a probability mass function for the colors in the M&Ms bag. A count for two different bags, gave the following numbers for each color: bag brown yellow red blue orange green sum 129 GM 12 43 28 26 25 33 24 179 129 GM 22 40 38 36 20 24 16 174 How do we check, whether these numbers come from the distribution as given on the web site? To get an answer for that question, we will need to think about what kind of results we expected to get. Think: for each color, we have a certain probability pcolor , to draw an M&M out of the bag, which has this color vs. 1 − pcolor for a different color. We can therefore think of the number of M&Ms in each color as random variables with a Binomial distribution. Nbr , the number of brown M&Ms has parameters n and pbr . Under the null hypothesis H0 : pbr = pr = pye = pbl = 0.2, por = pgr = 0.1 vs Ha : one of the pi is different from the above specification Nbr has a Bn,0.2 distribution. For the first bag we therefore expected Nbr to be 0.2 · 179 = 35.8. In the same manner we can compute the expected values for all the other colors in each bag: bag brown yellow red blue orange green 129 GM 12 35.8 35.8 35.8 35.8 17.9 17.9 34.8 34.8 34.8 34.8 17.4 17.4 129 GM 22 Now we are going to need a test statistics that measures the difference between what we have observed and what we have expected. As a test statistic, we will use Q= where obsj : expj : k X (obsj − expj )2 , expj j=1 the number of times j is observed among the xi , i = 1, . . . , n and number of expected js = n · p(j) Theorem 0.5.1 The test statistic Q, defined as above, has χ2 distribution with k − 1 degrees of freedom. In order to be able to use that, we obviously need some more information about the χ2 distribution. 0.5. GOODNESS OF FIT TESTS 23 The χ2 distribution Given a set of independent standard normal random variables Z1 , . . . , Zr , the distribution of their sum of squares r X X := Zi2 i=1 2 is called the χ distribution with r degrees of freedom. The density function itself is a bit complicated (it’s a special case of a Gamma distribution), all we need to know about the distribution at this stage is: E[X] = r V ar[X] = 2r, and the probability that P (X ≥ 2(r + 1)) ≤ 0.05, roughly. For large r the probability is far smaller than 0.05. Why has the above test statistic Q a χ2 statistic with k − 1 degrees of freedom? This is difficult to prove, but it is at least plausible: The parts, from which Q is put together, look - almost - like squared normal distributions: since obsj has a Binomial distribution, we may for large n assume that we can approximate its distribution by N (np(j), npj (1 − p(j)). A standardization of obsj would therefore look like: obsj − expj 1 obs − np(j) p j =p . √ expj np(j)(1 − p(j)) 1 − p(j) The degrees of freedom is reduced by one, because we have random variables, that are dependent: once we know the numbers of five colors in the bag, we get the sixth by subtracting the other numbers from the total number of M&Ms in the bag. A more formal reason for the degrees of freedom is given by computing the expected value for Q, which we can do by using that E[X 2 ] = V ar[X] + (E[X])2 : E[Q] = k X j=1 = k X j=1 = k X j=1 = k− 1 E[(obsi − np(j))2 ] = np(j) 1 (V ar[obsj − np(j)] +(E[obsj − np(j))2 ])2 ) = np(j) | {z } | {z } =0 =V ar[obsj ] k X 1 (1 − p(j)) = npj (1 − p(j)) = np(j) j=1 k X p(j) = k − 1. j=1 Now that we’ve defined a reference distribution for the test, we need to identify values, which we will count as evidence against H0 . Since we’ve squared the differences between expected and observed values, we can only count large positive numbers for Q as evidence against H0 . Now, let’s get back to our example: Example 0.5.2 M&Ms, continued The value for Q with the above null hypothesis about the color distribution is 23.91 for the first bag (129 GM 12) and 10.02 for the second bag (129 GM 22) The p-values for these results are 0.00023 and 0.075, respectively. For the first bag it’s highly unlikely that the M&Ms have a color distribution as posted on the web site, for the second bag, however, we can’t quite reject the null hypothesis with the same vigor. The p-value is still quite small, though. 24 Maybe, there is something wrong with the filling routine at M&Ms’ . . . We can also look into which of the colors contribute most to the Q statistic (unsquared). These numbers are called the residuals. bag brown yellow red blue orange green 129 GM 12 1.20 -1.30 -1.64 -1.81 3.57 1.44 0.88 0.54 0.20 -2.51 1.58 -0.34 129 GM 22 The largest residuals in each bag are too many orange M&Ms in the first, too few blue in the second. If we had combined the results from the two bags, the result for Q would have been even more extreme: color brown yellow red blue orange green sum #in bag 83 66 62 45 57 40 353 exptd 70.6 70.6 70.6 70.6 35.3 35.3 353 1.48 -0.55 -1.02 -3.05 3.65 0.79 Q = 26.77 residuals Here, the two largest residuals are for blue and orange M&Ms. It seems as if the number that are too few for the blue is replaced by the orange M&Ms.