Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
15.Math-Review Statistics 1 Central Limit Theorem Let us consider X1, X2,…,Xn, n independent identically distributed random variables with mean and standard deviation . n And define: Sn X i i 1 15.Math-Review 2 Central Limit Theorem The Central Limit Theorem (CLT) states: If n is large (say n30) then Sn follows approximately a normal distribution with mean n and standard deviation n 1 n If n is large (say n30) then S n follows approximately a normal distribution with mean and standard deviation n 15.Math-Review 3 Central Limit Theorem Example: sums of a Bernoulli random variable Forecast: n=1 10,000 Trials Forecast: n=10 Frequency Chart 0 Outliers .797 7974 10,000 Trials .598 .224 .399 .149 .199 .075 .000 0 1.00 1.25 1.50 1.75 Frequency Chart 2981 745.2 .000 0 14.50 2.00 15.88 17.25 18.63 20.00 Forecast: n=50 Forecast: n=30 10,000 Trials 75 Outliers .298 Frequency Chart 28 Outliers .178 1778 10,000 Trials Frequency Chart 70 Outliers .146 1459 .109 .133 .089 889 .073 729.5 .044 444.5 .036 364.7 0 .000 .000 48.00 15.Math-Review 51.00 54.00 57.00 60.00 0 82.50 86.25 90.00 93.75 97.50 4 Central Limit Theorem Example: Averages of Bernoulli random variable Forecast: n=1 10,000 Trials Forecast: mean, n=10 Frequency Chart 0 Outliers .797 7974 10,000 Trials .598 .224 .399 .149 .199 .075 .000 0 1.00 1.25 1.50 1.75 Frequency Chart 2981 745.2 .000 2.00 0 1.45 1.59 1.73 1.86 2.00 Forecast: mean, n=50 Forecast: mean, n=30 10,000 Trials 75 Outliers .298 Frequency Chart 47 Outliers .178 1778 10,000 Trials Frequency Chart 70 Outliers .146 1459 .109 .133 .089 889 .073 729.5 .044 444.5 .036 364.7 0 .000 .000 1.60 15.Math-Review 1.70 1.80 1.90 2.00 0 1.65 1.73 1.80 1.88 1.95 5 Central Limit Theorem Example: Compare a binomial random variable X~B(40,0.2) with its normal approximation: What is the normal approximation? Compare P(X10), P(X 20), P(X30) for the binomial and the normal approximation. BINOMIAL: X<=5 X<=10 X<=20 X<= 30 15.Math-Review AVERAGE: 0.16133 0.83923 0.99999 1.00000 X<5 X<10 X<20 X<30 0.07591 0.73178 0.99998 1.00000 0.11862 0.78550 0.99999 1.00000 NORMAL: 0.11784 0.78540 1.00000 1.00000 6 Sampling Let us consider the following example. We work at a phone company and we would like to be able to estimate the shape of the demand. We assume that monthly household telephone bills follow a certain probability distribution (continuous) We have obtained the following data of monthly household telephone bills by interviewing 70 randomly chosen households (or their habitants rather) for the month of October. 15.Math-Review 7 Sampling Table: 15.Math-Review Respondent Number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 October Respondent October Respondent October Phone Bill Number Phone Bill Number Phone Bill $ 95.67 25 $ 79.32 49 $ 90.02 $ 82.69 26 $ 89.12 50 $ 61.06 $ 75.27 27 $ 63.12 51 $ 51.00 $ 145.20 28 $ 145.62 52 $ 97.71 $ 155.20 29 $ 37.53 53 $ 95.44 $ 80.53 30 $ 97.06 54 $ 31.89 $ 80.81 31 $ 86.33 55 $ 82.35 $ 60.93 32 $ 69.83 56 $ 60.20 $ 86.67 33 $ 77.26 57 $ 92.28 $ 56.31 34 $ 64.99 58 $ 120.89 $ 151.27 35 $ 57.78 59 $ 35.09 $ 96.93 36 $ 61.82 60 $ 69.53 $ 65.60 37 $ 74.07 61 $ 49.85 $ 53.43 38 $ 141.17 62 $ 42.33 $ 63.03 39 $ 48.57 63 $ 50.09 $ 139.45 40 $ 76.77 64 $ 62.69 $ 58.51 41 $ 78.78 65 $ 58.69 $ 81.22 42 $ 62.20 66 $ 127.82 $ 98.14 43 $ 80.78 67 $ 62.47 $ 79.75 44 $ 84.51 68 $ 79.25 $ 72.74 45 $ 93.38 69 $ 76.53 $ 75.99 46 $ 139.23 70 $ 74.13 $ 80.35 47 $ 48.06 $ 49.42 48 $ 44.51 8 Sampling From this information we would like to be able to estimate, for example: What is an estimate of the shape of the distribution of October household telephone bills? What is an estimate of the percentage of households whose October telephone bill is bellow $45.00 What is an estimate of the percentage of households whose October telephone bill is between $60.00 and $100.00? What is an estimate of the mean of the distribution of October household telephone bills? What is an estimate of the standard deviation of the distribution of October household telephone bills? 15.Math-Review 9 Sampling A population (or “universe”) is the set of all units of interest. A sample is a subset of the units of a population. A random sample is a sample collected in such a way that every unit in the population is equally likely to be selected. It is hard to ensure that a sample will be random. 15.Math-Review 10 Sampling In our example the population corresponds to all the households in our area of coverage. The random sample selected were the 70 households (or their inhabitants) interviewed. And for the random variables X1,X2,… ,Xn corresponding to households 1, 2,… , n we observed x1=$95.67, x2=$82.69,… , xn=$74.13. Note that if we had chosen a different random set of households we would have observed a different collection of values. 15.Math-Review 11 Sampling To fix notation: n will be our random sample size. X1,X2,… ,Xn correspond to the random variables of unknown distribution f(x), which is common to our population, and what we want to study. x1,x2,… ,xn are the observations obtained by observing the outcome of our random sample. These are numbers!! We try to use these numbers to estimate the characteristics of f(x), for example what is the distribution, what is its mean, variance, etc. 15.Math-Review 12 Sampling To “look” at the shape of the distribution of X it is useful to create a frequency table and histogram of the sample values x1,x2,… ,xn. 15.Math-Review Histogram of Sample of October Telephone Bills 14 12 10 8 6 4 2 Range for Oct. Bill 13 160- 150-160 140-150 130-140 120-130 110-120 100-110 90-100 80-90 70-80 60-70 50-60 0 40-50 % Cumulative % 0.00% 0.00% 4.29% 4.29% 8.57% 12.86% 10.00% 22.86% 18.57% 41.43% 17.14% 58.57% 15.71% 74.29% 12.86% 87.14% 0.00% 87.14% 0.00% 87.14% 2.86% 90.00% 2.86% 92.86% 4.29% 97.14% 2.86% 100.00% 0.00% 100.00% 30-40 0 3 6 7 13 12 11 9 0 0 2 2 3 2 0 -30 Frequency Number of households Interval Limit -30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 100-110 110-120 120-130 130-140 140-150 150-160 160- Sampling A histogram can be obtained from excel, the output looks something like this: Bin 30 40 50 60 70 80 90 100 110 120 130 140 150 160 More 15.Math-Review FrequencyCumulative % Bin FrequencyCumulative % 0 .00% 70 13 18.57% 3 4.29% 80 12 35.71% 6 12.86% 90 11 51.43% 7 22.86% 100 9 64.29% 13 41.43% 60 7 74.29% 12 58.57% 50 6 82.86% 11 74.29% 40 3 87.14% 9 87.14% 150 3 91.43% 0 87.14% 130 2 94.29% 0 87.14% 140 2 97.14% 2 90.00% 160 2 100.00% 2 92.86% 30 0 100.00% 3 97.14% 110 0 100.00% 2 100.00% 120 0 100.00% 0 100.00% More 0 100.00% 14 Sampling From this analysis we can give the following description of the shape of this distribution (qualitative): An estimate of the shape of the distribution of October telephone bills in the site area is that it is shaped like a Normal distribution, with a peak near $65.00, except for a small but significant group in the range between $125.00 and $155.00. 15.Math-Review 15 Sampling In order to answer the other relevant questions we can use the original data, and count favorable outcomes and divide by total possible outcomes (70): P(X 45.00) = 5/70 = 0.07 P (60.00 X 100.00) = 45/70 = 0.64 Here we are approximating the continuous unknown distribution by the discrete distribution given by the outcomes of the sample 15.Math-Review 16 Sampling Sample mean, variance and standard deviation: From our observed values x1,x2,… ,xn, we can compute: The observed sample mean, x xn 1 x 1 n n n x i i 1 The observed sample variance, 1 s2 n 1 n x x 2 i i 1 The observed sample stardard deviation, s 15.Math-Review 1 n 1 n 2 x x i i 1 17 Sampling In our example we have: The observed sample mean, x xn 95 .67 82 .69 74 .13 x 1 $79 .40 n 70 and the observedsample stardard deviation, s 15.Math-Review 1 n 1 n x x 2 i i 1 (95 .67 79 .40 ) 2 (74 .13 79 .40 ) 2 $28 .79 69 18 Sampling We will use these observed values to estimate the unknown mean , and standard deviation , of our unknown underlying distribution. In other words: x will estimate s will estimate Also note that if we pick a different sample of the population, our observed values will be different. We can define the random variables: sample mean, sample standard deviation, of which x and s are observations. 15.Math-Review 19 Sampling Before the sample is collected, the random variables X1,X2,… ,Xn, can be used to define: The sample mean, X X n 1 X 1 n n n X i i 1 The observed sample variance, 1 S n 1 2 2 X X i n i 1 The observed sample stardard deviation, S 15.Math-Review 1 n 1 2 X X i n i 1 20 Sampling X and S are random variables We distinguish between the sample mean X, which is a random variable, and the observed sample mean x, which is a number. Similarly, the sample standard deviation S is a random variable, and the observed sample standard deviation s is a number. 15.Math-Review 21 Sampling Distribution of X From the formula that defines the sample mean we see that according to CLT it should follow approximately a normal distribution (if n30) The mean is E(X) = The standard deviation is E(X) = n In summary: X ~ N , 15.Math-Review 2 n " N x, s n " 2 22 Sampling Example: At two different branches of the G-Mart department store, they randomly sampled 100 customers on August 13. At Store 1, the average amount purchased was $41.25 per customer, with a sample standard deviation of $24.00. At Store 2, the average amount purchased was $45.75 with a sample standard deviation of $34.00 Let X denote the amount of a random purchase by a single customer at Store 1 and let Y denote the amount of a random purchase by a single customer at Store 2. Assuming that X and Y satisfy a joint normal distribution, what is the distribution of X-Y? What is the probability that the mean of X exceeds the mean of Y? 15.Math-Review 23 Sampling Example: In the quality control department of our company, knobs are inspected to make sure that they meet quality standards. Since it is not practical to test every knob, we draw a random sample to test. It is extremely necessary that our knobs weigh at least 0.45 pounds. If we know that the average weight is less than 0.45 pounds, we stop the production line and reset all the machines. In a day we produce 300,000 knobs, and draw a random sample of 1,000 knobs to test. If yesterday (Wednesday) the observed sample mean was 0.42 pounds, and observed sample standard deviation was 0.2, how confident are you that the average weigh of knobs is less than 0.45 pounds? If the average weight of knobs produced is 0.45 pounds, with standard deviation of 0.2, what is the probability that the average weight of the sample will be 0.42 or lower? Are these questions the same? 15.Math-Review 24