Download Inferential Statistics - DBS Applicant Gateway

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Sampling (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Gibbs sampling wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Inferential Statistics
Part 1 – Sampling Distributions, Point Estimates & Confidence Intervals
Inferential statistics are used to draw inferences (make conclusions/judgements) about a
population from a sample. Consider an experiment in which 10 students who sat an exam
after 24 hours of sleep deprivation scored 12% lower than 10 students who sat the same exam
after a normal night's sleep. Is the difference real or could it be due to chance? How much
larger could the real difference be than the 12% found in the sample? These are the types of
questions answered by inferential statistics.
There are two main methods used in inferential statistics: estimation and hypothesis testing.
In estimation, the sample is used to estimate a parameter and a confidence interval about the
estimate is constructed.
In the most common use of hypothesis testing, a null hypothesis is put forward and it is
determined whether the data is strong enough to reject it. For the sleep deprivation study, the
null hypothesis would be that sleep deprivation has no effect on performance.
Sampling Error
When we looked at primary data collection and sampling methods, we stressed the
importance of selecting a random sample so that every item or individual in the population
had a known chance of being selected.
To accomplish this, we could choose a simple random sample, a systematic sample, a
stratified sample, a cluster sample, or a combination of these methods. However, it is unlikely
that the mean of a sample would be identical to the population mean. Likewise, the sample
standard deviation or other measure computed from a sample would probably not be exactly
equal to the corresponding population value.
We can therefore expect some difference between a sample statistic, such as the sample mean
or sample standard deviation, and the corresponding population parameter. The difference
between a sample statistic and a population parameter is called sampling error.
Example:
Suppose that a population of five students had exam results of 68, 72, 67, 69 and 74. Suppose
that a sample of two results – 68 and 74 - is selected to estimate the population mean result.
The mean of that sample would be 70.7. Another sample of two is selected – 72 and 67 - with
a sample mean of 69.65.
The mean of all the results (the population mean) is 70.
The sampling error for the first sample is 0.7, determined byX - µ = 70.7 – 70
The second sample has a sampling error of -0.35
Each of these differences, 0.7 and -0.35, is the error made in estimating the population mean
based on a sample mean, and these sampling errors are due to chance. The amount of these
errors will vary from one sample to the next.
So given the possibility of a sampling error when sample results are used to estimate a
population parameter, how can we make accurate inferences/conclusions about the
population based only on sample results? To begin with we develop a sampling distribution
of the sample means.
Sampling Distribution of the Sample Means
The exam results example showed the means for samples of a specified size vary from
sample to sample. The mean exam result of the first sample of two students was 70.7, and the
second sample mean was 69.65. A third sample would probably result in a different mean.
The population mean was 70. If we organised the means of all possible samples of two results
into a probability distribution, we would obtain the sampling distribution of the sample
means.
Example:
A firm has seven production workers (considered the population). The hourly earnings of
each worker are given below.
Employee No.
10001
10002
10003
10004
10005
10006
10007
1.
2.
3.
4.
Hourly Earnings
€
7
7
8
8
7
8
9
What is the population mean?
What is the sampling distribution of the sample means for samples with a size of 2?
What is the mean of the sampling distribution?
What observations can be made about the population and the sampling distribution?
The population mean is found by:
To get the sampling distribution of the sample means, all possible samples of 2 are selected
without replacement from the population, and their means are computed. There are 21
possible samples, found by:
where N is the number of observations in the population and n is the number of observations
in the sample.
The 21 distinct sample means from all possible samples of 2 that can be drawn from the
population are shown below.
Sample
Employees
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
10001 + 10002
10001 + 10003
10001 + 10004
10001 + 10005
10001 +10006
10001 + 10007
10002 + 10003
10002 + 10004
10002 + 10005
10002 + 10006
10002 + 10007
10003 + 10004
10003 + 10005
10003 + 10006
10003 + 10007
10004 + 10005
10004 + 10006
10004 + 10007
10005 + 10006
10005 + 10007
10006 + 10007
Hourly Earnings
€
7+7
7+8
7+8
7+7
7+8
7+9
7+8
7+8
7+7
7+8
7+9
8+8
8+7
8+8
8+9
8+7
8+8
8+9
7+8
7+9
8+9
Sum
€
14
15
15
14
15
16
15
15
14
15
16
16
15
16
17
15
16
17
15
16
17
Mean
€
7.00
7.50
7.50
7.00
7.50
8.00
7.50
7.50
7.00
7.50
8.00
8.00
7.50
8.00
8.50
7.50
8.00
8.50
7.50
8.00
8.50
This probability distribution is the sampling distribution of the sample means and can be
summarised as follows.
Sampling Distribution of the Sample Mean for n = 2
Sample Mean
€
7.00
7.50
8.00
8.50
No. Means
Probability
3
9
6
3
0.1429
0.4286
0.2857
0.1429
21
1.0000
The mean of the sampling distribution of the sample mean is obtained by summing the
various sample means and dividing the sum by the number of samples. The mean of all the
sample means is usually written . The µ reminds us that it is a population value because we
have considered all possible samples. The subscriptX indicates that it is the sampling
distribution of mean.
These observations can be made:
a. The mean of the sample means (€7.71) is equal to the mean of the population:
b. The spread in the distribution of the sample means is less than the spread in the
population values. The sample means range from €7 to €8.50, while the population
values vary from €7 to €9. In fact, the standard deviation of the distribution of the
sample means is equal to the population standard deviation divided by the square root
of the sample size. So the formula for the standard deviation of the distribution of
sample means is:
Therefore, as we increase the size of the sample, the spread of the distribution of the
sample means becomes smaller.
c. The shape of the sampling distribution of the sample means and the shape of the
frequency distribution of the population values are different. The distribution of
sample means tends to be more bell-shaped and to approximate the normal
probability distribution.
In summary, we took all possible random samples from a population and for each sample
calculated a sample statistic (the mean). Because each possible sample has a chance of being
selected, the probability that the mean amount earned will be values such as €7.27,
€8.50, €6.50 etc. can be determined. The distribution of the mean amounts earned is called
the sampling distribution of the sample means.
Even though in practice we see only one particular random sample, in theory any of the
samples could arise. Consequently, we view the sampling process as repeated sampling of the
statistic from its sampling distribution. This sampling distribution is then used to measure
how likely a particular outcome might be.
The Central Limit Theorem
Applying the central limit theorem to the sampling distribution of the sample means allows us
to use the normal probability distribution to create confidence intervals for the population
mean.
The central limit theorem states that, for large random samples, the shape of the sampling
distribution of the sample means is close to a normal probability distribution. The
approximation is more accurate for large samples than for small samples (most statisticians
consider a sample of 30, or more, large enough for the central limit theorem to be employed)
General Procedure
Sampling requires that we draw successive samples from a defined population. The samples
must be randomly selected and of the same size.
Calculate the mean for each sample and plot the sample means. This produces a distribution
of sample means. A plot of an " infinite" number of sample means is called the sampling
distribution of the mean.
Successive Sampling
Frequency distributions of sample means quickly approach the shape of a normal distribution,
even if we are taking relatively few, small samples from a population that is not normally
distributed.
As we randomly select more and more samples from the population, the distribution of
sample means becomes more normally distributed and looks smoother.
With " infinite" numbers of successive random samples, the sampling distributions all have a
normal distribution with a mean that is equal to the population mean (μ).
Increasing Sample Size
As sample sizes increase , the sampling distributions approach a normal distribution. With "
infinite" numbers of successive random samples, the mean of the sampling distribution is
equal to the population mean (μ).
As the sample sizes increases, the variability of each sampling distribution decreases. The
range of the sampling distribution is smaller than the range of the original population.
Taken together, these distributions suggest that the sample mean provides a good estimate of
μ and that errors in our estimates (indicated by the variability of scores in the distribution)
decrease as the size of the samples we draw from the population increase.
Population Distributions
The principles of successive sampling and increasing sample size work for all distributions.
We can count on the sampling distribution of the mean being approximately normally
distributed, no matter what the original population distribution looks like as long as the
sample size is relatively large.
The central limit theorem states that when an infinite number of successive random samples
are taken from a population, the distribution of sample means calculated for each sample will
become approximately normally distributed with mean μ and standard deviation σ/√n as the
sample size (n) becomes larger, irrespective of the shape of the population distribution.
This is one of the most useful conclusions in statistics. We can reason about the distribution
of the sample means with absolutely no information about the shape of the original
distribution from which the sample is taken. In other words, the central limit theorem is true
for all distributions.
The central limit theorem applies only to sample means; its tenets cannot be applied to any
other statistic.
The central limit theorem tells that what to expect of the distribution of sample means when
we take an infinite number of relatively large samples of a given size from a population.
The central limit theorem works no matter what how the population distribution is shaped.
The central limit theorem helps us test hypotheses about means because it tells us what to
expect when we draw samples from a population.
Point Estimates & Confidence Intervals
In statistics, point estimation involves the use of sample data to calculate a single value
(known as a statistic) which serves as a "best guess" for an unknown population parameter.
For example, the sample meanX, is a statistic, and is a point estimate of the population
mean, a parameter, μ.
While we expect the point estimate (statistic) to be close to the population parameter, we
would like to measure how close (accurate) it is. A confidence interval serves this purpose.
In contrast to point estimation, which is a single number, with confidence intervals we use
sample data to construct an interval (range) of possible (or probable) values of an unknown
population parameter, so that the parameter occurs within that range at a specified
probability. The specified probability is called the level of confidence.
The information developed about the shape of a sampling distribution of the sample means,
that is the sampling distribution of X, allows us to locate an interval that has a specified
probability of containing the population mean μ. for reasonably large samples, we can use the
central limit theorem and state the following:
1. 95% of the sample means selected from a population will be within 1.96 standard
deviations of the population mean μ.
2. 99% of the sample means will lie within 2.58 standard deviations of the population
mean.
How are the values of 1.96 and 2.58 obtained?
The 95% and 99% refer to the percent of the time that similarly constructed intervals would
include the parameter being estimated. For example, 95% refers to the middle 95% of the
observations. Therefore, the remaining 5% are equally divided between the two tails.
The central limit theorem states that, for large random samples, the shape of the sampling
distribution of the sample means is approximately normal. Therefore, we use z-tables to look
at areas under the normal curve (see z-table)
When the sample size, n, is at least 30, it is generally accepted that the central limit theorem
will ensure a normal distribution of the sample means. This is an important consideration. If
the sample means are normally distributed, we can use the standard normal distribution, that
is, z, in our calculations. When n ≥ 30 the formula for getting the confidence interval for a
mean is:
where:
X
=
the sample mean
z
=
appropriate z value for level of confidence
s
=
sample standard deviation
n
=
sample size
Example:
An experiment involves selecting a random sample of 256 managers. One item of interest is
annual income. The sample mean is €45,420, and the sample standard deviation is €2,050.
1. What is the estimated mean income of all managers (the population) i.e. what is the
point estimate?
2. What is the 95% confidence interval for the population mean (rounded to the nearest
€10)?
3. Interpret the findings.
1. The point estimate of the population mean is €45,420
2. The confidence interval is:
onfidence Interval for a
X
=
€45,420
z
=
1.96
s
=
€2,050
n
=
256
onfidence Interval for
ean
ean
€45,420
€45,420
€45,1 8.875 and €45, 71.125
€45,170 and
€45,170≥ μ ≥
3. We can say that we are 95% confident that the unknown population mean income (μ) is
between €45,170 and €45, 70.
If we had time to select many samples of size 256 from the population of managers
and compute sample means and confidence intervals, the population mean and annual
income would be in about 95 of every 100 confidence intervals. Either an interval
contains the population mean of not. About 5 out of every 100 confidence intervals
would not contain the population mean annual income, μ.