Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Chapter 4
Statistics
4.1 – What is Statistics?
Definition 4.1.1 Data are observed values of
random variables. The field of statistics is a
collection of methods for estimating distributions
and parameters of random variables through the
collection and analysis of data.
4.1 – What is Statistics?
Definition 4.1.2 The population is the set of all
objects of interest in a statistical study. A sample
is a subset of the population.
Definition 4.1.3 Data are information that has
been collected. The field of statistics is a
collection of methods for drawing conclusions
about a population by collecting and anlyzing
data from a sample.
Types of Data
Definition 4.1.4 A parameter is a number
calculated using information from every member
of a population. A statistic is calculated using
information from a sample.
Definition 4.1.5 Quantitative data consist of
numbers. Qualitative data are nonnumeric
information that can be separated into different
categories.
Types of Data
Definition 4.1.6 Discrete data are observed
values of a discrete random variable. They are
numbers that have a finite or countable set of
values. Continuous data are observed values of a
continuous random variable. They are numbers
that can take any value within some range.
Levels of Measurement
Definition 4.1.7
– Data are at the nominal level of measurement if they consist
of only names, labels, or categories. They cannot be
ordered (such as smallest to largest) in a meaningful way.
– Data are at the ordinal level of measurement if they can be
ordered in a meaningful way, but differences between data
values cannot be calculated or are meaningless.
– Data are at the interval level of measurement if they can be
ordered in a meaningful way and differences between data
values are meaningful.
– Data are at the ratio level of measurement if they are at the
interval level, ratios of data values are meaningful, and
there is meaningful zero starting point.
Types of Studies
Definition 4.1.8
– In an observational study, data is obtained in a way
such that the members of the sample are not
changed, modified, or altered in any way.
– In an experiment, something is done to the
members of the sample and the resulting effects
are recorded. The “something” that is done is
called a treatment.
Types of Observational Studies
Definition 4.1.9
– In a cross-sectional study, data are collected at one
specific point in time.
– In a retrospective study, data are collected from
studies done in the past.
– In a prospective study, data are collected by
observing a sample for some time into the future.
Blocks
Definition 4.1.10 A block is a subset of the
population with a similar characteristic. Different
blocks of a population have different
characteristics that may affect the variable of
interest differently. A randomized block design is
a type of experiment where:
1. The population is divided into blocks.
2. Members from each block are randomly chosen
to receive the treatment.
Sampling Techniques
Definition 4.1.11
– A convenience sample is a sample that is very easy
to get.
– A voluntary response sample is obtained when
members of the sample decide whether to
participate or not.
– A systematic sample is obtained by arranging the
population in some order, then selecting a starting
point, and then selecting every kth member (such as
every 20th).
Sampling Techniques
– A cluster sample is obtained by dividing the
population into subsets (or clusters) where the
members of each cluster have a common
characteristic, then randomly choosing some of the
clusters, and surveying every member of the chosen
clusters.
– A stratified sample is obtained by dividing the
population into subsets and then randomly choosing
some members from each of the subsets.
– A multistage sample is obtained by successively
applying a variety of sampling techniques. At each
stage the sample becomes smaller, and at the last stage,
a clustersample is chosen.
Random Samples
Definition 4.1.12
– A random sample is chosen in a way such that
every individual member of the population has the
same probability of being chosen.
– A simple random sample of size n is chosen in a
way such that every group of size n has the same
probability of being chosen.
4.2 – Summarizing Data
Example 4.2.3 Shown below are the waiting
times of 30 customers at a supermarket checkout stand
Relative frequency distribution
Histograms
The “shape” of a relative frequency histogram is
an approximation of the graph of the p.d.f. (or
p.m.f.) of the underlying random variable.
Summary Statistics
Definition 4.2.1 Let {x1, x2,…, xn} be a set of
quantitative data collected from a sample of the
population
1.
2.
1 n
mean of the data: x xi
n i 1
1 n
2
variance of the data: s
xi x
n 1 i 1
2
3. standard deviation of the data: s s 2
4. range of the data: (max value) – (min value)
Example 4.2.4
1
(0.0 0.0 5.1 7.3) 2 min
30
1
2
(0.0 2) 2 (0.0 2) 2 (5.1 2) 2 (7.3 2) 2 2.946 min 2
s
30 1
s 2.96 1.72 min
Range : 7.3 0.0 7.3 min
x
Percentiles
Definition 4.2.2 Let p be a number between 0 and 1. The
(100p)th percentile of a set of quantitative data is a
number, denoted πp, that is greater than (100p)% of the
data values.
– The 25th, 50th, and 75th percentiles are called the first,
second and third quartiles and are denoted p1 = π0.25,
p2 = π0.50, and p3 = π0.75, respectively.
– The 50th percentile is also called the median of the data
and is denoted m = p2.
– The mode of the data is the data value that occurs most
frequently.
– The 5-number summary of a set of data consists of the
minimum value, p1, p2, p3, and the maximum value.
Calculating Percentiles
1. Arrange the data in increasing order:
𝑥1 ≤ 𝑥2 ≤ ⋯ ≤ 𝑥𝑛
2. Calculate 𝐿 = 𝑛𝑝
3. If 𝐿 is not an integer, then round it up to the
next larger integer and 𝜋𝑝 = 𝑥𝐿
4. If L is an integer, then 𝜋𝑝 = 12 𝑥𝐿 + 𝑥𝐿+1
Example 4.2.5
• Calculate the first quartile, p1 = π0.25
L 0.25(30) 7.5
p1 x8 0.5
• Calculate the median m = p2 = π0.5
L 0.5(30) 15
1
1
p2 x15 x16 (1.7 1.9) 1.8
2
2
Example 4.2.5
• 5-number summary
0, 0.5, 1.8, 2.9, 7.3
• Box Plot
4.3 – Sampling Distributions
Definition 4.3.1 A random variable Θ whose
values are used to estimate the value of a
parameter 𝜃 is called an estimator of 𝜃. A value
of Θ , 𝜃, is called an estimate of 𝜃. An estimator
Θ is called an unbiased estimator of 𝜃 if
ˆ
E
If this equation is not true, then Θ is called a
biased estimator.
Sample Proportion
Suppose we want to know the proportion p of a
population who support a particular political
candidate
– p is a parameter
We survey 735 voters and find 383 that support
the candidate
– The sample proportion is 𝑝 = 383
≈ 0.521
735
– This is an estimate of p
Sample Proportion
Let 𝑋 denote the number who support the
candidate in a sample of 𝑛
– Define the random variable 𝑃 =
– Called the “sample proportion”
– 𝑝 is an observed value of 𝑃
– 𝑝 is an estimate of p
– 𝑃 is an estimator of p
𝑋
𝑛
Sampling Distribution of the
Proportion
Theorem 4.3.1 Let 𝑋 be b(n, p). Then as 𝑛 → ∞
the distribution of the sample proportion
X
ˆ
P
n
approaches
p (1 p )
N p,
n
𝑝 1−𝑝
𝑁(𝑝,
𝑛
– Meaning: 𝑃 is approximately
) for n
“large enough”
– “Large enough” - 𝑛𝑝 ≥ 5 and 𝑛(1 − 𝑝) ≥ 5
Example 4.3.3
By examining the spending habits of one
particular consumer, a credit card company
observes that during the course of normal
transactions 37% of the charges exceed $150.
Out of 50 charges made in one particular month,
27 exceeded $150. Does it appear that these
charges were made in the course of normal
transactions?
Example 4.3.3
Sample prop. that exceed $150: 𝑝 =
27
50
= 0.54
– Is this unusually large?
– Assume normal transactions: 𝑃 is approximately
N (0.37, 0.37 0.63 / 50) N (0.37, 0.004662)
0.54 0.37
ˆ
P P 0.54 P Z
P ( Z 2.49) 0.0064
0.004662
– This probability is small (< 0.05)
• Reject the assumption
Sample Mean
Suppose we want to know the mean IQ score of
all college students in the US, 𝜇
– Estimate it with a sample mean 𝑥
– Let 𝑋 denote the IQ of a randomly selected student
–𝐸 𝑋 =𝜇
– 𝑥 is an observed value of the sample mean 𝑋𝑛
– 𝑥 is an estimate of 𝜇
– 𝑋𝑛 is an estimator of 𝜇
Sampling Distribution of the Mean
• By the Central Limit Theorem
2
X n is approximately N ,
n
– where 𝜎 2 = 𝑉𝑎𝑟(𝑋)
Idea
4.4 – Confidence Intervals for a
Proportion
Definition 4.4.1 Let Z be 𝑁(0, 1) and p be a
number between 0 and 0.5. A critical z-value 𝑧𝑝
is a positive number such that
P Z zp 1 p
Practice
Practice
Critical Values
Let 𝛼 be between 0 and 1. Then 𝑝 = 𝛼/2 is
between 0 and 0.5, so that the critical z-value
𝑧𝛼/2 is a positive number such that
P Z z /2 1 / 2 P Z z /2 / 2
P z /2 Z z /2 1
Confidence Interval
Definition 4.4.2 Let 0 < 𝛼 < 1 and let x be a
number of successes in n observed trials of a
Bernoulli experiment with unknown probability
of a success p. Define 𝑝 = 𝑥/𝑛 and let 𝑧𝛼/2 be a
critical z-value. The interval
pˆ 1 pˆ
pˆ 1 pˆ
, pˆ z
pˆ z
/2
n
/2
n
is called a 100(1 − α)% confidence interval
estimate for p.
Confidence Interval
Margin of error: E z /2
pˆ 1 pˆ
n
Standard error of the proportion:
pˆ 1 pˆ
n
Significance level:
Confidence level: 100(1 )%
Different forms
pˆ E , pˆ E ,
pˆ E p pˆ E ,
or
pˆ E
Requirements
1. The sample must be random.
2. The conditions for a binomial distribution
must be satisfied (at least approximately).
3. There are at least 5 successes and at least 5
failures observed in the n trials.
Example 4.4.2
Suppose 383 out of 735 surveyed voters support
a particular political candidate. Calculate a 95%
confidence interval estimate for the proportion of
all voters who support the candidate.
1. Define the population proportion being estimated:
p = The proportion of all voters who support the candidate
2. Calculate the sample proportion
pˆ
383
0.521
735
Example 4.4.2
3. Find the critical value: 𝛼 = 0.05
z /2 z0.05/2 z0.025 1.96
4. Calculate the margin of error
E z /2
pˆ 1 pˆ
0.5211 0.521
1.96
0.0361
n
735
5. Calculate the confidence interval
pˆ E p pˆ E 0.521 0.0361 p 0.521 0.0361
0.485 p 0.557
Example 4.4.2
Correct interpretation
– We are 95% confident that the value of p is
between 0.485 and 0.557.
Meaning
– If we were to survey many different samples of
voters and calculate the corresponding 95%
confidence interval using the statistics from each
sample, then about 95% of the intervals would
contain the true value of p.
Incorrect Meaning
• There is a 95% chance that the actual value of
p is between 0.485 and 0.557
• Why is this incorrect?
– p is a number that we don’t know. It has a value. It
is between 0.485 and 0.557 or not. There is no
probability involved.
What does this mean
• The confidence level refers to the process of
constructing, not the interval themselves
– If we constructed many intervals, then about 95%
of them would contain the true value of p.
4.5 – Confidence Intervals for a
Mean
Definition 4.5.1 Let 𝑥 be the mean of a sample
of size n taken from a population with known
variance 𝜎 2 and unknown mean μ. The interval
, x z /2
x z /2
n
n
is called a 100(1 − α)% confidence interval
estimate for μ.
Z-Interval
Margin of error: E z /2
n
3 different forms:
x E, x E ,
x E x E,
or
xE
Requirements
1. The sample is random.
2. The population variance 𝜎 2 is known.
3. The population is normally distributed or 𝑛 > 30.
T-Interval
Definition 4.5.2 Let 𝑥 be the mean and 𝑠 2 be the
variance of a sample of size n taken from a population
with unknown variance 𝜎 2 and mean μ, and let 𝑡𝛼/2 be a
critical Student-t value with (𝑛 − 1) degrees of
freedom. The interval
s
s
, x t /2
x t /2
n
n
is called a 100(1 − α)% confidence interval estimate for
μ when 𝜎 2 is unknown.
T-Interval
Margin of error: E t /2
s
n
Standard error of the mean:
s
n
Requirements
1. The sample is random.
2. The population is normally distributed or n > 30.
Which Type of Interval?
Suggestions
1. If n > 30 or 𝜎 2 is known, then use a Z-interval.
2. If 𝜎 2 is unknown, and the population is normally
distributed (at least approximately), then use a Tinterval.
3. If n ≤ 30, 𝜎 2 is unknown, and the population is
not normally distributed, then see Chapter 7.
Example 4.5.3
A random sample of 15 “1-pound” packages of
shredded cheddar cheese has a mean weight of
𝑥 = 1.05 lb. and standard deviation of s = 0.02
lb. Calculate a 99% confidence interval estimate
for the mean weight of all such packages.
1. Define the population mean being estimated:
𝜇 = The mean weight of all “1-pound” packages of shredded
cheddar cheese.
Example 4.5.3
2. Find the critical value: α = 0.01 and n = 15
t /2 (15 1) t0.005 (14) 2.977
3. Calculate the margin of error:
E t /2
s
0.02
2.977
0.0154
n
15
4. Calculate the confidence interval:
x E x E 1.05 0.0154 1.05 0.0154
1.0346 1.0654
4.6 – Confidence Intervals for a
Variance
Definition 4.6.1 Let 𝑠 2 be the variance of a sample of
size n taken from a normally distributed population with
unknown variance 𝜎 2 and let
a 12 /2 (n 1) and b 2 /2 (n 1)
be critical 𝜒 2 values. The interval
(n 1) s 2 (n 1) s 2
,
b
a
is a 100(1 − α)% confidence interval estimate for 𝜎 2 .
Confidence Intervals for a Variance
2
(n 1) s 2
(
n
1
)
s
Alternate form:
2
b
a
Requirements
1. The sample is random.
2. The population is normally distributed.
Example 4.6.2
The proportion of butterfat in 20 batches of
butter were measured. The resulting data have a
sample variance of 𝑠 2 = 0.001102. Construct a
95% confidence interval estimate of the variance
in the proportion of butterfat of all batches.
1. Define the population variance being estimated:
𝜎 2 = The variance in the proportion of butterfat of
all batches of butter
Example 4.6.2
2. Find the critical values: 𝛼 = 0.05 and 𝑛 = 20
2
a 120.05/2 (19) 0.975
(19) 8.907
2
2
b 0.05/2
(19) 0.025
(19) 32.852
3. Calculate the confidence interval:
2
(n 1) s 2
(
n
1)
s
(19)0.001102
(19)0.001102
2
2
b
a
32.852
8.907
0.000637 2 0.00235
4.7 – Confidence Intervals for
Differences
Definition 4.7.1 Consider two populations with
respective proportions 𝑝1 and 𝑝2 . Let
– 𝑛1 and 𝑛2 be the sample sizes
– 𝑝1 and 𝑝2 be the sample proportions
Then pˆ1 pˆ 2 pˆ z /2 p1 p2 pˆ1 pˆ 2 pˆ z /2
pˆ1 1 pˆ1 pˆ 2 1 pˆ 2
ˆ
where p
n1
n2
is a 100(1 − α)% confidence interval estimate for
𝑝1 − 𝑝2
2-Proportion Z-Interval
Margin of error: E pˆ z /2
Requirements
1. Both samples are random and independent.
2. Each sample contains at least 5 successes and 5
failures.
2-Sample T-Interval
If two populations are (approximately) normally
distributed and their variances are unknown, then
an approximate 100(1 − α)% confidence interval
for the difference of their means 𝜇1 − 𝜇2 using
data from two independent samples of the
respective populations is
x1 x2 E 1 2 x1 x2 E
Equal Variances
E s p t /2
2
2
n
1
s
n
1
s
1
1
2
2
1/ n1 1/ n2 , s p
n1 n2 2
• 𝑠𝑝 - pooled standard deviation
• 𝑡𝛼/2 - critical t-value with 𝑛1 + 𝑛2 − 2
degrees of freedom
Non-equal Variances
E t /2 s12 / n1 s22 / n2
where 𝑡𝛼/2 is a critical t-value with r degrees of
2
2
2
freedom where
s1 s2
n1 n2
r
2
2
2
2
s1
s2
1
1
n1 1 n1 n2 1 n2
If r is not an integer, then round it down to the
nearest whole number.
Requirements
1. Both samples are random and independent.
2. Both populations are normally distributed or both
sample sizes are greater than 30.
4.8 – Sample Size
Sample size for estimating a population
proportion
z2 /2 pˆ 1 pˆ
n
E2
– 𝑝 - an estimate of the population proportion
– E - desired margin of error
Mean
Sample size for estimating a population mean
z2 /2 2
n
E2
– 𝜎 2 - an estimate of the population variance
– E - desired margin of error
4.9 – Assessing Normality
Constructing a Normal Quantile Plot
1. Arrange the data values in increasing order:
𝑥1 ≤ 𝑥2 ≤ ⋯ ≤ 𝑥𝑛
2. For each 𝑘 = 1, 2, … 𝑛, define
k
pk
n 1
3. Calculate 𝑧𝑘 = Φ−1 𝑝𝑘 for each 𝑘 where Φ is
the standard normal c.d.f.
Normal Quantile Plot
1. Plot the points 𝑥𝑘 , 𝑧𝑘
2. If the points form a straight-line pattern, then
conclude that the population appears to be
normal. If the points do not form a straight-line or
exhibit some other type of non-linear pattern,
then conclude that the population is not normal.
Example 4.9.2
The second row of the table below gives the average
daily temperatures in the month of November for the
city of Lincoln, NE for nine different years (data
collected by Brandon Metcalf, 2009). Determine if the
population of all such temperatures is normally
distributed.
Example 4.9.2
Roughly a straight line
– Population is normal
Straight Line
1. Calculate the sample mean and standard deviation
of the data, 𝑥, and s.
2. For each k, calculate the following quantity:
xk x
yk
s
3. Plot the points 𝑥𝑘 , 𝑦𝑘 on the quantile plot and
connect them with a straight line.
Straight Line
Fuzzy Central Limit Theorem
If the population is influenced by many small,
random, unrelated effects, then the population
may be normally distributed.