Download For Populations

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

German tank problem wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
2
EC
Polikar
Lecture 8
Engineering Statistics
Part II: Estimation
© 2003 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
Review
of Basic Concepts of Statistics
 Statistics is used to make generalized decisions about a population, by analyzing
only a small set of sample from the population.
 Parameter vs. statistic
 Important statistical quantities: Mean, median, mode, standard deviation, variance
M
x  x    xM
1
x  1 2

M
M
x1  x2    xN 1 N
x
  xi
N
N i 1
 xi
i 1
Population mean
2


x


 i
2 
M
i 1
M 1
Population variance
1/ 2
 N x  x 2 
 i

s   i 1


N


Sample mean
2


x

x
 i
s2 
N
i 1
N 1
Sample variance
1/ 2
(large sample)
 N x  x 2 
 i

s   i 1


N 1


Sample standard deviation
(small sample)
© 2003 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
Statistical Distributions
Normal (Gaussian) Distribution Function
1.4
value of x at the
peak is the mean
distribution function, normalized
1.2
f ( x) 

1
1
 2
1  x 
 

2  
e
2
inflection point
marks the standard
deviation, 
0.8
68.2%
0.6
0.4
2
0.2
4
99.7%
0
0
-3
95.4%
-20.5
95.4%
6
-

1
+
distribution variable, x
1.5 +2
99.7%
+3 2
The Gaussian Curve
Distribution Function
Area under the curve
f ( x) 
A
1
e
 2
1  x 
 

2  
2
x2
 f ( x)dx
x1
Area from -  x  +  68.2 % of the total area (x1=- ; x2=)
Area from -2  x  +2 95.4% of the total area (x1=-2 ; x2=2)
Area from -3  x  +3 99.7 % of the total area (x1=-3 ; x2=3
The analytical computation of the area under the Gaussian curve is difficult.
Therefore, standardized tables generated for this particular purpose are used.
The standardization assumes a mean of zero and variance of 1.
© 2003 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
Using Gaussian Tables
Area under the curve on each side of zero is 0.5.
The curve is symmetric, so the total area is 1
Normalization to use standard tables:
z
x

Example: if z=0.82 
Area under the curve for [0 0.82] : 0.294
Total area for [-∞ 0.82]=0.5+0.294=0.794
This value is the probability that z<0.82
© 2003 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
Example
The chip manufacturing company Lentil ® produces its much anticipated chip
Pantsium© XIX running at 66.666 THz. However, the rival company DAM© manufactures its chip
Craplon© 66++, also running at 66.666 THz. However, DAM claims that Lentil’s chip is flawed,
and cannot run any faster than 63 THz. Lentil, which manufactures 100,000 chips everyday,
decides to test its chips. They take a sample of 1% (1000 chips). They find that the mean speed of
these chips is 65.980 THz with a std. dev. of 1.2 THz. Assuming that the chip speed is normally
distributed, is Lentil’s speed claim justifiable?
Assume that the claim is justifiable, if 95% of the chips lie in the speed limits of 65 to 67 THz.
x
z

z
x


65  65.980
 0.82
1.2

67  65 .980
 0.85
1.2
0.294
-0.82
0.302
+0.85
The probability that a Lentil
chip has a speed in the [65 –67]
THz is 0.294+0.302=0.596.
Thus only 59.6% of the chips
satisfy the criterion.
Now assume that the claim is justifiable, if 90% of the chips run faster than 65.0 THz.
z
x

The probability that a Lentil chip has a speed larger than 65THz is
0.294+0.5=0.794. That means, roughly 80% of the chips satisfy the
criterion. In any case, however, Lentil does better than DAM’s claim of
THz.
WhatRobi
% Polikar,
of Lentil
runDept.
over
99.3%)
© 2003 All63
Rights
Reserved,
Rowanchips
University,
of 63THz?
Electrical and Computer(Ans.
Engineering
65.0  65 .980

 0.82
1.2
Estimation Theory
& Confidence Intervals
 Point estimate vs. interval estimate
 Bulb wattage: 60 W vs. 60 ± 5W  55W ~ 65 W
 Part length: 5.28cm vs. 5.28 ± 0.03 cm  5.25 ~ 5.31 cm.
 Flight time: 11 hrs vs. 11 h ± 15 min  10 h 45 min ~ 11 h 15 min.
 Scientific polls: 59% will vote for XYZ (margin of error 4%)
 How confident can we be about such interval estimates?
 Are we 75% sure? 90% sure? …95% sure? What does it mean to be 95%
sure?
 Confidence level: The percentage of confidence
 Confidence interval: The interval in which we have certain
confidence that a value lies.
© 2003 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
Confidence Intervals
 Recall: For normal distribution, the mean of a statistic lies within
one, two or three sigma intervals, 68.27%. 95.45% and 99.73% of
the time, respectively.
 Example: Let’s assume that the average height at Rowan is 176
inches, with a standard deviation of 5 inches 
 68.27% of Rowan students are 176 ± 5 in  171 ~ 181 in
 95.45% of Rowan students are 176 ± 2x5in  166 ~186 in
 99.73% of Rowan students are 176 ± 3x5 in 161 ~ 191 in
 Thus, we are 95.45% sure that Rowan students are 166~186 in.
 Note that these numbers are true for variables that are Normally
distributed. In most practical scenarios, the statistic of a sample size
greater than 30 is usually normally distributed!
© 2003 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
How to Compute
Confidence Intervals
 If the statistic is the sample mean, then the confidence limits (end
points of the interval) are given by
(1)
x  zc
Sample mean
Population* std. dev.

N
(2)

x  zc
N
M N
M 1
Sample size
Use Eq. (2) for finite populations of
size M, and use Eq. (1) for infinite (very
large) populations.
Critical value obtained from
normal distribution tables based
on the desired confidence
Confidence Level (%)
99.73
99
98
96
95.45
95
90
80
68.27
50
Critical Value zc
3.00
2.58
2.33
2.05
2.00
1.96
1.645
1.28
1.00
0.675
* Since population std.
dev.
usually
unknown,
is estimated
by sample
std.
© 2003
All is
Rights
Reserved,
Robi Polikar,itRowan
University, Dept.
of Electrical
anddev.
Computer Engineering
How To…
Ex: 98% confidence means we have to be sure that the
value we estimate must be within the specified limits 98%
of the time. Thus the area under the curve on both sides of the
mean must be 0.98. Since the curve is symmetric, 0.49 on one side
of the curve. The zc value corresponding to 0.49 is 2.33.
For 93% confidence zc=1.81
© 2003 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
Example
 Measurements of the diameters of a random sample of 200 ball bearings made by a
certain machine has a mean of 0.824 in and a std. dev. of 0.042 in. What are the 95%
and 99% confidence limits for the mean diameter of the ball bearings?
 95% confidence limit  Half the area under the curve = 0.475 zc=1.96. Confidence limits are
therefore 0.824 ± zc * /√N = 0.0824 ± 1.96 * 0.042 √200 = 0.0824 ± 0.0058 in.
 99% confidence limit  Half the area under the curve = 0.495 zc=2.58. Confidence limits are
therefore 0.824 ± zc * /√N = 0.0824 ± 2.58 * 0.042 √200 = 0.0824 ± 0.0077 in.
 Note 1: Note that we will use the sample std. dev.  as an estimate of the population std. dev.
 Note 2: Our confidence interval of 0.0116 is narrower for 95% confidence, than the 0.0154 for
the 99% confidence. This makes sense, because the interval in which the true value takes place
becomes larger as we demand a higher confidence.
© 2003 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
For Populations
 If the statistic to be estimated is a proportion of “successes”, then the
confidence limits for p the proportion of success (the probability of
success) is
P  zc
p (1  p )
N
For infinite (very large) samples sizes
P  zc
p(1  p )
N
M N
M 1
For a sample size of M>30
P is the sample probability of success , and p is the population probability of success.
We will use the sample estimate P for the population estimate p in our calculations.
© 2003 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering
Example
 In an exit poll, a news network asks 300 people (from a state of 9M) for whom they voted,
and 55% says they have voted for XYZ. Can the network claim the candidate XYZ the
winner with a 95% confidence?
 For 95% confidence, the confidence interval is
P  1.96
p 1  p 
= 0.55 ± 0.056  0.494 ~ 0.606
N
 This means that the network at best, can be 95 % confidence that the actual vote the candidate
received is between 0.494 and 0.606. In other words, if 55% of 300 people said they voted for XYZ,
than there is a 95% probability (or we can be 95% sure) that the actual vote the candidate received
will lie between 49.4% and 60.6%. Since at least 50% is required to win the election, the network
cannot claim XYZ as the winner.
 The natural question to ask is then, how many people to they need to ask that they can claim XYZ’s
success with 95% confidence? Assuming again that 55% of N people ( N is now unknown) said they
voted for XYZ, and considering that XYZ needs at least 50% of the votes:
0.55  1.96
p (1  p )
 0.5  N>380. Thus if 55% of 380 people say they voted
N
for XYZ, then the confidence interval will be
0.55  1.96
p(1  p )
0.55  0.45
 0.55  1.96
 0.55  0.05  0.50 ~ 0.60
380
380
© 2003 All Rights Reserved, Robi Polikar, Rowan University, Dept. of Electrical and Computer Engineering