Download 16 - Rice University

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Statistical inference wikipedia , lookup

Misuse of statistics wikipedia , lookup

German tank problem wikipedia , lookup

Transcript
Statistics : Statistical Inference
Krishna.V.Palem
Kenneth and Audrey Kennedy Professor of Computing
Department of Computer Science, Rice University
1
Sampling distribution of X
Population
 and 
Sample 1
x1
2
Sample 2
x2
Sample 3
x3
Sample 4
x3
Sampling Distribution
……
……
Sample k
xk
Central Limit Theorem
(4) The mean of the sampling distribution of X is equal to the
population mean, i.e.
X  
(5) Standard deviation of the sampling distribution of X is the
population standard deviation divided by the square root of
sample size, i.e.
X 
3

n
Sampling distribution of X for a Normal
population)
N=1: X  1.41, SD  0.145
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
N=5: X  1.40, SD  0.065
1.8
N=10: X  1.40, SD  0.047
1.02 1.11 1.2 1.29 1.38 1.47 1.56 1.65 1.74 1.83
4
1.9
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
N=50: X  1.40, SD  0.020
1 1.05 1.13 1.2 1.27 1.351.43 1.5 1.57 1.65 1.73 1.8 1.87
Sampling dist. of X for a non-Normal population
N=1:
1
1.1
1.2
N=50:
5
1
1.1
1.2
N=5:X
X = 1.40, SD = 0.147
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
1.4
1.5
1.6
1.7
1.8
1.9
1.1
1.2
1.3
X
X = 1.41, SD = 0.021
1.3
1
N=100:
2
= 1.40, SD = 0.066
1.4
1.5
1.6
1.7
1.8
1.9
2
= 1.41, SD = 0.015
1 1.06 1.151.24 1.331.42 1.5 1.58 1.671.76 1.851.942
Computer simulation of the sampling
distribution of the sample mean
 Pick any probability distribution and specify a mean and standard
deviation.
 Tell the computer to randomly generate 1000 observations from
that probability distributions
 E.g., the computer is more likely to spit out values with high probabilities
 Plot the “observed” values in a histogram.
 Next, tell the computer to randomly generate 1000 averages-of-2
(randomly pick 2 and take their average) from that probability
distribution.
 Plot “observed” averages in histograms.
 Repeat for averages-of-10, and averages-of-100.
6
Uniform Distribution on [0,1]: average
of 1 sample (original distribution)
7
Uniform Distribution: 1000 averages
of 2 samples
8
Uniform Distribution: 1000 averages
of 5 samples
9
Uniform Distribution: 1000 averages of
100 samples
10
Exponential Distribution: 1000
averages of 2 samples
11
Exponential Distribution: average of
1 sample (original distribution)
12
Exponential Distribution: 1000
averages of 5 samples
13
Exponential Distribution: 1000 averages
of 100 samples
14
Contents
 Summary of Statistics Learnt so Far
 Statistical Inference
 Central Limit Theorem and its implications
 Estimation theory
 Interval Estimation
 What is Confidence Interval?
 Tutorial
15
Estimation Theory
 In statistics, estimation refers to the process by which one makes
inferences about a population, based on information obtained
from a sample.
 Statisticians use sample statistics to estimate population
parameters.
 For example, sample means are used to estimate population means; sample
proportions, to estimate population proportions.
16
Two types of Estimates
 Point estimate. A point estimate of a population parameter is a
single value of a statistic.
 For example, the sample mean x is a point estimate of the population mean
μ.
 When we estimate the mean (μ) by x, the probability that we are
exactly correct is close to zero, i.e. P(x= μ) ~ 0
 Assuming, the population is heterogeneous and the sample size n <<
population size N
 Hence, we are not very “confident” about our estimates we make
using point estimates
17
Two Types of Estimates (contd.)
 How can we be more confident about our estimates?
 we want P(x = μ) to be a bigger value than zero
 We can increase our confidence levels by using a less than
precise estimates instead of point estimates
 estimate in an interval instead of point
 Interval estimate. An interval estimate is defined by two
numbers, between which a population parameter is said to lie.
 For example, a < x < b is an interval estimate of the population mean μ. It
indicates that the population mean is greater than a but less than b.
18
Contents
 Summary of Statistics Learnt so Far
 Statistical Inference
 Central Limit Theorem and its implications
 Estimation theory
 Interval Estimation
 What is Confidence Interval?
 Tutorial
19
History of Interval Estimation
 Neyman (1937) identified interval estimation ("estimation by
interval") as distinct from point estimation ("estimation by
unique estimate").
 he was the first to recognize and formulate interval estimation
 work quoting results in the form of an estimate plus-or-minus a
standard deviation was the interval estimation
 his paper on this was titled "On the Two Different Aspects of the
Representative Method: The Method of Stratified Sampling and the
Method of Purposive Selection"
 given at the Royal Statistical Society on 19 June 1934
20
You can download the paper from :
http://stevereads.com/papers_to_read/on_the_two_different_aspects_of_the_representative_method.pdf
What is an Interval Estimate?
 In statistics, interval estimation is the use of sample data to
calculate an interval of possible (or probable) values of an
unknown population parameter
 in contrast to point estimation, which is a single number.
 Interval estimate. An interval estimate is defined by two
numbers, between which a population parameter is said to
lie.
 for example, a < μ < b is an interval estimate of the population mean μ.
indicates that the population mean is greater than a but less than b.
 we use x to estimate this interval

21
 Interval estimates provide
 a "best estimate" of a parameter
 an indication of the precision with which the parameter is known.
Types of Interval Estimation
 The most prevalent forms of interval estimation are:
 confidence intervals
 a frequentist method
 credible intervals
 a Bayesian method
 Other common approaches to interval estimation, which are
encompassed by statistical theory, are:
 Tolerance intervals
 Prediction intervals
 used mainly in Regression Analysis
22
Of these, confidence intervals is the most common and widely used
and hence, will be covered in more detail in this class
Contents
 Summary of Statistics Learnt so Far
 Statistical Inference
 Central Limit Theorem and its implications
 Estimation theory
 Interval Estimation
 What is Confidence Interval?
 Tutorial
23
What is a Confidence Interval?
 In statistics, a confidence interval (CI) is an interval estimate
of a population parameter.
 instead of estimating the parameter by a single value, an interval
likely to include the parameter is given.
 confidence intervals are used to indicate the reliability of an
estimate.
 How likely the interval is to contain the parameter is
determined by the confidence level
 increasing the desired confidence level will widen the confidence
interval.
 Confidence intervals and interval estimates more generally have
applications across the whole range of quantitative studies.
24
Example of Confidence Interval
 For example, a confidence interval can be used to describe how
reliable some opinion survey results are.
 In a survey of election voting-intentions, the result might be that
40% of respondents intend to vote for a certain party.
 A 95% confidence level for the proportion in the whole population
having the same intention on the survey date might be in the confidence
interval 36% to 44%.
 From the same survey date one may calculate a smaller 90% confidence
level for the proportion in the whole population of for instance in
confidence interval 38% to 42%.
All other things being equal, a survey result with a small confidence
interval with a higher confidence level is more desired
25
Video on Confidence Interval
26
Example
 In the whole of Houston, what percentage of adults do you think
will want to watch a movie sometime in the next 10 days?
 assume a variance of 0.0625 for the whole population
 Choose a random sample of 10 adults and ask their opinion
Will this be anywhere close to the actual percentage?
 Let X be the random variable denoting the percentage of adults
attending the movies out of the sample.
 Xi be the value from ith sample
How can we be sure to be closer to the actual mean?
27
Take very large
number of samples
Example (contd.)
 But, taking large number of samples is generally not feasible.
 We want to arrive at an estimate based on fewer samples.
 For example, in the previous example, if you take only 1
sample of 10 people and found that 5 of the 10 people would
like to go for a movie, then you can say
 We are pretty sure that 50% of the adult population would
want to go for a movie in the next 10 days.
Isn’t this ambiguous? How sure is pretty sure?
28
Need to be more
definitive
Example (contd.)
 We use confidence interval to remove the ambiguity
What if we want to be 100% sure?
 The only statement we can make which is 100% sure is that the
0%-100% of the adult population would want to watch a movie in
the next 10 days.
What if we want to be 50% sure?
 This statement doesn’t hold much importance as you are wrong
half the time
Then, what kind of statements make sense?
 90% sure or 95% sure or 98% sure or 99% sure
29
Confidence Levels
Calculating Confidence Level
 The general norm is to vary the interval by multiples of σ
and compute the confidence level
 σ is varied equally on the either side of the mean
 The probability that μ is correct by the interval [x- σ,x+ σ]
can be calculated as
P( [ x   , x  ])  P( x      x  )
P( [ x   , x  ])  P(  x    )
 Assuming Normal distribution, we get
P([ x   , x  ])  0.6852
What if we increase the interval from 2σ to 4σ?
P([ x  2 , x  2])  0.9544
30
Source for calculations: http://www.analyzemath.com/statistics/normal_calculator.html
Confidence Level Table
 Some of the most commonly used confidence levels in
statistics are given in the table below:
Confidence Level
Number of σs away from mean
90%
1.64
95%
1.96
98%
2.33
99%
2.575
 Less than 90% is generally not considered a strong
enough confidence level to make a statement
31
Example (Contd.)
 Let us continue with computing the confidence interval
for our movie example
 Assume that we took a random sample of 10 adults.
 Among them, 5 adults said that they would like to go for
the movie in the next 10 days
 Hence, we get, mean (x)= 0.5 (denotes 50% )
and standard deviation =   0.0625  0.1581 (Var(x) = σ /n )
2
10
 Say, we want to be 95% confident about our estimation.
32
Example (Contd.)
 From the table we can see that we have to be 1.96σ away
from the mean.
 Hence, we need to be 1.96*0.1581 = 0.31 away from the
mean
 Summarizing, we can now say with 95% confidence that the
mean of the actual population will be between [0.5-0.31,
0.5+0.31] = [0.19,0.81] which is between 19%-81% of
total population
What if you want to be 98% confident?
33
Graphical Representation of
Confidence Intervals
Example
A plot of a normal distribution (or bell curve).
Each colored band has a width of one standard deviation.
34
Confidence Interval for  when  is known
 A 95% confidence interval for  if  is known is given by:
x  1.96 
Overlay Plot

n
95% of the x ‘s lie between   1.96
0.4
Normal Density
0.3
0.2
0.1
95%
0
35
-3

  1.96
n
-2
-1

0
X
1

  1.96
n
2
3
X

n
Rationale for Confidence Interval
 From the sampling distribution of X conclude that  and
are within 1.96 standard errors (  ) of each other 95% of
n
the time
 Otherwise stated, 95% of the intervals contain 
 So, the interval x can be taken as an interval that typically
would include 
x  1.96 
36

n
Example
 A random sample of 80 tablets had an average potency of
15mg. Assume  is known to be 4mg.
 x =15,  =4, n=80
 A 95% confidence interval for  is
15  1.96 
4
80
= (14.12 , 15.88)
38