* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download For Populations
Survey
Document related concepts
Transcript
2 EC Polikar Lecture 8 Engineering Statistics Part II: Estimation © 2003 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering Review of Basic Concepts of Statistics Statistics is used to make generalized decisions about a population, by analyzing only a small set of sample from the population. Parameter vs. statistic Important statistical quantities: Mean, median, mode, standard deviation, variance M x x xM 1 x 1 2 M M x1 x2 xN 1 N x xi N N i 1 xi i 1 Population mean 2 x i 2 M i 1 M 1 Population variance 1/ 2 N x x 2 i s i 1 N Sample mean 2 x x i s2 N i 1 N 1 Sample variance 1/ 2 (large sample) N x x 2 i s i 1 N 1 Sample standard deviation (small sample) © 2003 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering Statistical Distributions Normal (Gaussian) Distribution Function 1.4 value of x at the peak is the mean distribution function, normalized 1.2 f ( x) 1 1 2 1 x 2 e 2 inflection point marks the standard deviation, 0.8 68.2% 0.6 0.4 2 0.2 4 99.7% 0 0 -3 95.4% -20.5 95.4% 6 - 1 + distribution variable, x 1.5 +2 99.7% +3 2 The Gaussian Curve Distribution Function Area under the curve f ( x) A 1 e 2 1 x 2 2 x2 f ( x)dx x1 Area from - x + 68.2 % of the total area (x1=- ; x2=) Area from -2 x +2 95.4% of the total area (x1=-2 ; x2=2) Area from -3 x +3 99.7 % of the total area (x1=-3 ; x2=3 The analytical computation of the area under the Gaussian curve is difficult. Therefore, standardized tables generated for this particular purpose are used. The standardization assumes a mean of zero and variance of 1. © 2003 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering Using Gaussian Tables Area under the curve on each side of zero is 0.5. The curve is symmetric, so the total area is 1 Normalization to use standard tables: z x Example: if z=0.82 Area under the curve for [0 0.82] : 0.294 Total area for [-∞ 0.82]=0.5+0.294=0.794 This value is the probability that z<0.82 © 2003 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering Example The chip manufacturing company Lentil ® produces its much anticipated chip Pantsium© XIX running at 66.666 THz. However, the rival company DAM© manufactures its chip Craplon© 66++, also running at 66.666 THz. However, DAM claims that Lentil’s chip is flawed, and cannot run any faster than 63 THz. Lentil, which manufactures 100,000 chips everyday, decides to test its chips. They take a sample of 1% (1000 chips). They find that the mean speed of these chips is 65.980 THz with a std. dev. of 1.2 THz. Assuming that the chip speed is normally distributed, is Lentil’s speed claim justifiable? Assume that the claim is justifiable, if 95% of the chips lie in the speed limits of 65 to 67 THz. x z z x 65 65.980 0.82 1.2 67 65 .980 0.85 1.2 0.294 -0.82 0.302 +0.85 The probability that a Lentil chip has a speed in the [65 –67] THz is 0.294+0.302=0.596. Thus only 59.6% of the chips satisfy the criterion. Now assume that the claim is justifiable, if 90% of the chips run faster than 65.0 THz. z x The probability that a Lentil chip has a speed larger than 65THz is 0.294+0.5=0.794. That means, roughly 80% of the chips satisfy the criterion. In any case, however, Lentil does better than DAM’s claim of THz. WhatRobi % Polikar, of Lentil runDept. over 99.3%) © 2003 All63 Rights Reserved, Rowanchips University, of 63THz? Electrical and Computer(Ans. Engineering 65.0 65 .980 0.82 1.2 Estimation Theory & Confidence Intervals Point estimate vs. interval estimate Bulb wattage: 60 W vs. 60 ± 5W 55W ~ 65 W Part length: 5.28cm vs. 5.28 ± 0.03 cm 5.25 ~ 5.31 cm. Flight time: 11 hrs vs. 11 h ± 15 min 10 h 45 min ~ 11 h 15 min. Scientific polls: 59% will vote for XYZ (margin of error 4%) How confident can we be about such interval estimates? Are we 75% sure? 90% sure? …95% sure? What does it mean to be 95% sure? Confidence level: The percentage of confidence Confidence interval: The interval in which we have certain confidence that a value lies. © 2003 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering Confidence Intervals Recall: For normal distribution, the mean of a statistic lies within one, two or three sigma intervals, 68.27%. 95.45% and 99.73% of the time, respectively. Example: Let’s assume that the average height at Rowan is 176 inches, with a standard deviation of 5 inches 68.27% of Rowan students are 176 ± 5 in 171 ~ 181 in 95.45% of Rowan students are 176 ± 2x5in 166 ~186 in 99.73% of Rowan students are 176 ± 3x5 in 161 ~ 191 in Thus, we are 95.45% sure that Rowan students are 166~186 in. Note that these numbers are true for variables that are Normally distributed. In most practical scenarios, the statistic of a sample size greater than 30 is usually normally distributed! © 2003 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering How to Compute Confidence Intervals If the statistic is the sample mean, then the confidence limits (end points of the interval) are given by (1) x zc Sample mean Population* std. dev. N (2) x zc N M N M 1 Sample size Use Eq. (2) for finite populations of size M, and use Eq. (1) for infinite (very large) populations. Critical value obtained from normal distribution tables based on the desired confidence Confidence Level (%) 99.73 99 98 96 95.45 95 90 80 68.27 50 Critical Value zc 3.00 2.58 2.33 2.05 2.00 1.96 1.645 1.28 1.00 0.675 * Since population std. dev. usually unknown, is estimated by sample std. © 2003 All is Rights Reserved, Robi Polikar,itRowan University, Dept. of Electrical anddev. Computer Engineering How To… Ex: 98% confidence means we have to be sure that the value we estimate must be within the specified limits 98% of the time. Thus the area under the curve on both sides of the mean must be 0.98. Since the curve is symmetric, 0.49 on one side of the curve. The zc value corresponding to 0.49 is 2.33. For 93% confidence zc=1.81 © 2003 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering Example Measurements of the diameters of a random sample of 200 ball bearings made by a certain machine has a mean of 0.824 in and a std. dev. of 0.042 in. What are the 95% and 99% confidence limits for the mean diameter of the ball bearings? 95% confidence limit Half the area under the curve = 0.475 zc=1.96. Confidence limits are therefore 0.824 ± zc * /√N = 0.0824 ± 1.96 * 0.042 √200 = 0.0824 ± 0.0058 in. 99% confidence limit Half the area under the curve = 0.495 zc=2.58. Confidence limits are therefore 0.824 ± zc * /√N = 0.0824 ± 2.58 * 0.042 √200 = 0.0824 ± 0.0077 in. Note 1: Note that we will use the sample std. dev. as an estimate of the population std. dev. Note 2: Our confidence interval of 0.0116 is narrower for 95% confidence, than the 0.0154 for the 99% confidence. This makes sense, because the interval in which the true value takes place becomes larger as we demand a higher confidence. © 2003 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering For Populations If the statistic to be estimated is a proportion of “successes”, then the confidence limits for p the proportion of success (the probability of success) is P zc p (1 p ) N For infinite (very large) samples sizes P zc p(1 p ) N M N M 1 For a sample size of M>30 P is the sample probability of success , and p is the population probability of success. We will use the sample estimate P for the population estimate p in our calculations. © 2003 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering Example In an exit poll, a news network asks 300 people (from a state of 9M) for whom they voted, and 55% says they have voted for XYZ. Can the network claim the candidate XYZ the winner with a 95% confidence? For 95% confidence, the confidence interval is P 1.96 p 1 p = 0.55 ± 0.056 0.494 ~ 0.606 N This means that the network at best, can be 95 % confidence that the actual vote the candidate received is between 0.494 and 0.606. In other words, if 55% of 300 people said they voted for XYZ, than there is a 95% probability (or we can be 95% sure) that the actual vote the candidate received will lie between 49.4% and 60.6%. Since at least 50% is required to win the election, the network cannot claim XYZ as the winner. The natural question to ask is then, how many people to they need to ask that they can claim XYZ’s success with 95% confidence? Assuming again that 55% of N people ( N is now unknown) said they voted for XYZ, and considering that XYZ needs at least 50% of the votes: 0.55 1.96 p (1 p ) 0.5 N>380. Thus if 55% of 380 people say they voted N for XYZ, then the confidence interval will be 0.55 1.96 p(1 p ) 0.55 0.45 0.55 1.96 0.55 0.05 0.50 ~ 0.60 380 380 © 2003 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering