Download Z 1- /2

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
PROBABILITY & STATISTICAL
INFERENCE LECTURE 4
MSc in Computing (Data Analytics)
Lecture Outline




Recap to Statistical Inference
Central Limit Theorem
Confidence Intervals
Section Takeaways
Statistical Analysis Process
Population
Representative
Sample
Describe
Sample
Statistic
Make
Inference
Populations vs. Samples

How do Irish voters intend voting in the next election?
The voting population of Ireland:
2,680,0001
A sample of 1,008 adults
was taken and surveyed
for their voting intention
in the next election2
1. Source - http://www.nationmaster.com/graph/dem_pre_ele_vot_age_pop-presidential-elections-voting-agepopulation
2. http://redcresearch.ie/wp-content/uploads/2012/01/Report.pdf
Populations vs. Samples

How do Irish voters intend voting in the next election?
1,008 voters were
asked how they
intended to vote in
the next election
 Fine Gael: 30%
 Labour: 14%
 Fianna Fail: 18%
 Sinn Fein: 17%
 Other: 21%
Populations vs. Samples

The term population is used in statistics to represent
all possible measurements or outcomes that are of
interest to us in a particular study or piece of
analysis
 In
the example the population of interest was the voting
intentions of all voters in Ireland

The term sample refers to a subset of the
population that is selected for analysis
 In
the example the polling company selected a sample
of 1,008 voters
Sampling




In choosing a sample it is important that it is
representative of the population
No bias should exist in the sample
There are a number of sampling methods available
to ensure that your data is representative
A simple random sample is the most straight
forward of these methods
Statistical Inference



The statistical methods used to draw conclusions
about populations based on the statistics describing
a sample is known as statistical inference
We want to make decisions based on evidence from
a sample i.e. extrapolate from sample evidence to
a general population
To make such decisions we need to be able to
quantify our (un)certainty about how good or bad
our sample information is
Statistical Inference

Statistical Inference is divided into two major areas:
 Parameter
Estimation: This is where sample statistics
are used to estimate population parameters
 Hypothesis Testing: A statistical hypothesis is a
statement about the parameters of one or more
populations. Hypothesis testing tests whether a
hypothesis is supported by data collected
Population Statistics – Point Estimation


The population mean is denoted by µ (mu)
 In general, given a sufficiently large sample,
we use the sample mean
as a point
estimate of µ
The population variance is denoted by σ2 (sigmasquared)
 In general, given a sufficiently large sample,
we use the sample variance s2 as a point
estimate of σ2
Population Statistics – Point Estimation

An estimate of proportion, p, of items in a population
that belong to a class of interest is calculated as:
 where
c
p
n
c is the number of items in a random sample of size
n that belong to the class of interest

This is known as the sample proportion

Central Limit Theorem
Demonstration
Central Limit Theorem Explained by
Example


The distribution shown
is a poission
distribution with λ=3
This could represent
the distribution of the
number of clicks on a
particular link in one
second


Create 200 sample distributions each with a large sample size
Calculate the mean of each distribution
Central Limit Theorem



Explain what has happened?
As the sample sizes increased the shape of
the histogram of means tended towards a
normal distribution
As the sample sizes increased the spread
(standard deviation) between the sample
means decreased
Central Limit Theorem



These histograms are pictures of The Sampling
Distribution of the Mean
This phenomenon will happen in ALL cases
The proof of this is called the Central Limit
Theorem (CLT) and involves some fairly nontrivial mathematics
Definition: Central Limit Theorem
continued…

The sampling distribution of the mean has a average value = 
(the population mean).

The sampling distribution of the mean has a standard deviation

n


Where σ is the population standard deviation, and n is the
sample size taken.

This value is called the standard error of the mean.
The Sampling Distribution of the Mean will be a Normal
distribution if the sample size is large.
Central Limit Theorem - Definition

If a random sample is taken from a population,
where:




Each member of the sample can be considered to be
independent of each other
The are all members of the same population
That population has a mean value μ and a standard
deviation σ
Then.......
Central Limit Theorem - Definition

.........
The central limit theorem states that
given a distribution with a mean μ and
variance σ², the sampling distribution
of the mean approaches a normal
distribution with a mean μ and a
variance σ²/n as n, the sample size,
increases

This is a non-mathematical definition of the Central
Limit Theorem (CLT)
The Distribution of the Sample Means
x
s

n
Confidence Intervals
How can we use the CLT



The Central Limit Theorem avoids the necessity of
specifying a complete statistical model for all the
sampled data.
All we have to do is specify a probability model for
the sample mean.
For any sample mean, calculated from a large
independent random sample taken from any
population with a mean μ and standard deviation σ,
we know from the CLT, that this sample mean is a
random variable from a Normal distribution with a
mean = μ and a standard deviation = 
n
Practical use for the CLT continued…
___



Take a single sample and calculate X
This is an estimate of μ – the true (but unknown)
population mean.
But, how good is this estimate?
___

We assume that X is not exactly , but  is
somewhere near - but how near is it likely to be?
Confidence Intervals Introduction

We would___
like to make probability statements as to
how close X is likely to be to .
___

If sample size is sufficiently large – then the estimate X
can be considered as:
a random variable from a Normal distribution,
 so probability statements are possible.


This is how we use the CLT in practical data analysis.
Confidence Intervals Introduction


For a Normal distribution, we know that 95% of
values will be within 1.96 Standard deviations of 
So, given one estimate we can say that this estimate
is within 1.96 standard errors of the actual
population mean , with 95% confidence
95% in
shaded
area
•We can turn this knowledge on its
head: given we can be 95%
confident that the true mean  is
within 1.96 standard errors of it.
Confidence Interval
From this we can specify a range of values within which we
are 95% confident that the population mean () lies
 This is called a confidence interval
 95% Confidence Interval for a population mean
(from large enough sample):

__
x  1.96  standard error
__
x  1.96 


n
Remarkably, this result holds for samples of size 30 or
more. So, a large sample in this context, is a sample of 30
or more.
Example
One sample of size 30 from the electronic components
yields a sample mean = 5,873 hours .We know  =
3,959 so a 95% confidence interval would be:
__
x  1.96  standard error
__

3959
 x  1.96 
 5873  1.96 
n
30
 5873  1417  4456 to 7290
Interpretation: we would say that the average lifetime of all
components (μ) is between 4,456 and 7,290 hours with 95%
confidence
Confidence Intervals





Why is this any good?
Before: one estimate, = 5,873 but no idea of how
good or bad it was, i.e. how close to μ is was likely to
be.
Now: 95% confident that μ is between 4,456 and
7,290 hours.
So, using CLT leads to Confidence Intervals that
enables us to estimate a statistic with certain level of
confidence.
In other word it gives us an objective measure of the
actual amount of information contained in our sample
about the likely location of μ.
Problem with σ


All of the above assumes that the population
standard deviation (i.e. ) is known.
In practice this is not known (just like ).
So, we need to estimate  as well as 
 we get this estimate from the standard deviation of the
sample, given that the sample is large enough.


Sample Standard Deviation is called ‘s’
s 

Estimate  by s

x  x 
2
n 1
General Confidence Interval for μ
(Large Samples)

The general formula is:
__
CI1-

s
 x  z1 / 2 
n
Where:
•  is between a value between 0-1,
•
(1-)×100% is the confidence level you want
• Z1-/2 is a value from the Normal distribution table.
• Example: for a 95% CI,  = 0.05
 (1-)×100% = 95%
 Z1-/2 = 1.96
Z-Values

The value of Z1-/2 for other % confidence intervals
are given in standard tables.
Confidence Level
α/2
Z1-/2
90%
0.05 (5%)
1.6449
95%
0.025 (2.5%)
1.96
99%
0.005 (0.5%)
2.5758
99.9%
0.0005 (0.05%)
4.4172
Example

Using these we get the following results for the electronic
component example:
Confidence
Level


Z1-/2
CI
90%
1.6449
4681 to 7065
95%
1.96
4456 to 7290
99%
2.5758
4011 to 7735
99.9%
4.4172
2679 to 9067
Note as  gets smaller the CI gets wider
Also, at the same time as n gets bigger the CI narrows – So big
samples leads to more precise estimates (i.e. narrower
confidence intervals)
What CI’s and sample sizes should I
use?





You can’t control s – it is inherent in the data
(population).
You can’t control x-bar either.
You can control Z1-/2 but in practice scientific
convention sets this to reflect 90%, 95% or 99%
confidence, with 95% being the accepted default.
You can choose n – but resources may limit you.
There is a whole topic called sample size
determination which you may want to review before
collecting data or starting research
Confidence Interval Assumptions

Sample size 40 or greater

Experimental units are independent or each other

Experimental units were randomly sampled


The independence assumption requires that value of the
variable for one experimental unit should not tell us
anything about the value of another.
Randomness is required to avoid systematic bias in
selection.
Exercise

Complete Exercise 1 & 2
Calculation of CIs for small samples



What about small samples?
In the case of CIs about a mean we can use the
Student-t distribution.
The process turns of to be very similar – but the CLT
no longer works
History of the Student t test

William Gosset used the publishing pseudonym ‘Student’.
He derived the correct sampling distribution for the mean
of samples < 40 – and called it the ‘t distribution’.

In his honour, it is often called the ‘Student t’ distribution.

Gosset was a chief brewer for Guinness.

The mathematical details are complicated, but, it turns out
that we perform exactly the same calculations as before,
with the one change that the t distribution instead of the
normal distribution is used.
Assumptions



Student t’s result only referred to a mean where the
distribution of the population was normally
distributed with some mean μ and finite standard
deviation σ.
This is in contrast to the CLT for large samples that
required no such assumption about normality.
The t-test also requires the assumption regarding
independence in the sample.
Statistical Model for mean from small
samples



The experimental units are independently sampled from
a population with mean=μ and standard deviation = σ
The population is normally distributed (we don’t need
this with large samples)
So, to use the t-test for a small sample, you need to
establish that data is sampled from a population that is
normally distributed – you could look at the histogram
of the sample and see if it is symmetric and bell shaped
– or use other methods.
The t - Statistic

If Assumptions met:

The statistic:
___
X 
t
s n

Can be shown to be distributed according to a
(student) t-distribution.

The t-distribution has one parameter, called ‘degrees
of freedom’ (df).
The t-Distribution




The t-distribution itself is bell shaped and symmetric –
just like the normal distribution but is ‘flatter’.
There are many t distributions – one for each sample
size.
The rule used is: for a sample of size n – use the t
distribution with degrees of freedom = n−1
Example: if the sample size is 15, then use a t
distribution with degrees of freedom 15 − 1=14.
Note the degrees of freedom often abbreviated to df.
0.4
The t-Distribution
0.0
0.1
0.2
0.3
Normal(0,1)
t(df=4
t(df=1)
-4
-2
0
2
4
The t probability density function with k degrees of freedom:
k  1 / 2
1
f ( x) 
 2
( k 1) / 2
k k 2 x / k   1


General Confidence Interval for μ
(small Samples)

The general formula is:
__
CI1-
s
 x  t(1 / 2,n 1) 
n
Where (1-) 100% is the confidence level you want and
t(n-1, /2) is a value from the t distribution with df=n-1, and with a
specified  level.


What is t(n−1, 1−/2)?
A value from the t distribution with n−1 df such that
100(1 − )% of values lie within that range around the mean.



How do you find t(n−1, 1−/2)?
from a table specifically designed to give it to you or
use a computer
Confidence
Level
 /2
t(df=1)
t(df=10)
t(df=30)
90%
0.05 (5%)
6.314
1.812
1.697
95%
0.025 (2.5%)
12.71
2.228
2.042
99%
0.005 (0.5%)
63.66
3.169
2.750
99.9%
0.0005 (0.05%)
636.6
4.587
3.646
Note: as  gets smaller then CI gets wider
as df gets smaller then CI gets wider
Example


Internal temperature of autoclaved aerated
concrete used in building. An engineer recorded the
following data:
23.01, 22.22, 22.04, 22.62, 22.59
95% CI for the population mean?
__
s
CI1-  x  t( / 2 ,n 1) 
n
0.3793
 22.5  2.776 
5
 22.5  0.4696  ( 22.03,22.97)
Exercise

Answer Questions 3-6
Confidence Intervals for Proportions
(Large Samples)



Proportions (including %) are often a statistic of interest
Think of the proportion of defective items on a
production line, the proportion of people who respond
favourably to a survey question, to proportion of
success versus failures in some experiment
Proportions are also covered by the CLT - remember
that a proportion is a different kind of average
Confidence Intervals for Proportions
(Large Samples)


Take a sample of size n of electronic components coming
off a production line, a test each one for defects. The
statistic of interest is the proportion of defectives
produced by the production process.
The estimated proportion from the sample is,
No of Defective s in the Sample
pˆ 
n(the total sample size)

where (p-hat) is the symbol used for the estimated
proportion from the sample
Confidence Intervals for Proportions
(Large Samples)

If the sample size is sufficiently large and we repeat
the experiment a large number of times, then:



The sampling distribution of the proportion will be
normally distributed by the CLT
The mean of this distribution will be p - i.e. the 'true'
population proportion
The standard deviation of the sampling distribution of the
proportion, called the standard error of the proportion is
estimated by
S.E of proportion 
pˆ  (1  pˆ )
n
Example:

A pharmaceutical company produces 400,000
capsules per day of a particular drug. They test
200 of the capsules for defects (too much/little
active compound). If the population p = 0.05, and
they take 10,000 repeated samples this is the
histogram they would get
Sample Size



How big does the sample have to be for the CLT to
work with proportions?
The rule is different than the rule for means. Do the
following test.
A rule of thumb: the sample size is big enough if
1.
2.
np > 5 and
n(1-p) > 5
General Confidence Interval Formula for a
Population Proportion (large Sample)
CI  pˆ  z1 / 2 


pˆ (1  pˆ )
n
where  = the confidence level and Z1-/2 = a value from the
standard normal distribution such that 100(1-)% of values of
a standard normal distribution lie within that range around the
mean
So the Z1-/2 values used for a population proportion are the
same as those used for a population mean
Example

How many voters will give F.F. a first preference in the next
general election ? There are 2 different estimates




Researcher A (10 people)
Researcher B (100 people)
=> 40%
=> 25%
How much 'better' is estimate B than estimate A ?
Step one: Can we use the formula for large numbers
1.
2.
Researcher A: np = 10 * 0.4 = 4 => 4 is not greater than 5
therefore you cannot used the large number method
Researcher B: np = 100 * 0.25 = 25
n(1-p) = 100 * (1-0 .25) = 75
both figures are greater than 5 therefore you can used the
large number method
Example Continued

Researcher B - 95% Confidence Interval
CI  pˆ  z1 / 2 
pˆ (1  pˆ )
n
0.25  0.75
CI95  0.25  1.96 
100
CI95  0.25  1.96  0.04
CI95  0.25  0.08
CI95  0.17 to 0.33

So, the 95% CI is 17% to 33%.
Example Continued





NB: If fact we can get a 95% CI for researcher A's
findings using small sample theory (exact CI) - this is
available in SAS and other software:
Exact CI’s are often based on direct use of probability
models.
The method is based directly on calculations for the
binomial distribution (see lecture 3)
What do we have to do?
Using the CLT, we found, that the 95% CI was composed
of the set of values for the mean, such that an
hypothesis test would not reject the null hypotheses for
any of those values in the set using the α = 0.05 level.

Using SAS we can calculate a 95% CI for
Researcher A:
 CI
95% for Researcher A = 12% to 74%
 which is too wide to be informative anyway!

If we use the same technique for researcher B we
get:
 CI95

for Researcher B = 17% to 35%
Which is virtually the same as before using the CLT.
Exact CI and tests for population
proportions


These work for small samples as well as large
samples
With large sample will give essentially the same
results as CLT

Must be used for small samples, however

Based on the binomial probability distribution.
Difference between Exact and CLT
based methods



When sample sizes are ‘large’ they will give the
same results – but exact tests can be very hard to
compute even with modern PCs
When sample sizes are small exact methods must
be used
The CIs from small samples tend to be very wide –
there is no short cut from collecting as much high
quality data as you can manage.
Exercise

Answer Question 7-9