Download 1 Overview of Statistics/Data Classification

Document related concepts

Inductive probability wikipedia , lookup

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
1 Overview of Statistics/Data Classification
Statistics: the science of collecting, organizing, analyzing, and interpreting data in order to make decisions
Two branches of statistics
Descriptive statistics: organizing, summarizing, and displaying data
Inferential statistics: use of a sample to draw conclusions about a population
To make an inference means to draw a general conclusion about a population from a sample.
Probability is importantly involved in inferential statistics and not at all involved in descriptive statistics.
Population: the entire set of individuals of interest
The population is determined by a problem.
Variable: a characteristic of an individual, to be measured or observed
E.g., if the individuals are people, variables might be height, eye color, ...
Sample: a subset of a population
i.e., members of the population which you actually know something about
E.g. What’s the population and what’s the sample?
(a) A 2010 survey of 8000 U.S. adults showed that 42% of respondents considered themselves conservative (Gallup).
(b) A 2010 poll showed that 43% of Texans believed that the country was on the right track.
Can’t generalize beyond TX: number was 31% for the country as a whole (Texas Tribune)
Data: information from counting, measuring, or observing
i.e., what you write down about members of your sample or population
Parameter: a value computed from population data
Statistic: a value computed from sample data
A statistic is an estimate of a parameter.
In real life, you hardly ever have a parameter. But we’re interested in parameters (on average, how much does a baby weigh at
birth?), and we use statistics to estimate them, so we need words to distinguish them.
E.g. Jake measured the diameters of 100 ball bearings chosen from a shipment of 10,000. The average diameter
was 1.1mm.
Population:
Sample:
Variable:
Data:
Statistic:
A statistic changes when the sample changes.
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
1
E.g. Parameter or statistic?
A 2009 survey of 218 law firms with at least 50 lawyers found that 69% of firms had cut personnel in the
previous year. (Altman-Weil)
Data Classification
Just words to describe types of data so we can talk about it sensibly.
Two basic types of data:
quantitative (or numerical): numbers that are the results of measuring or counting
It makes sense to do arithmetic on quantitative data. It usually makes sense to average it.
qualitative (or categorical): everything else
It does not make sense to do arithmetic on qualitative data.
E.g. Qual or quant?
(a) diameters of Eastern White Pines
(b) eye color
(c) numbers on jerseys of starting team
Data is univariate if is made up of a list of individual values.
→All the data mentioned above is univariate.
Paired or bivariate data is made up of ordered pairs of numbers (often presented in a table).
E.g. John measured the boiling point of water at various altitudes.
Altitude (ft)
0
1000
2000
5280
Corwin
Boiling point (◦ F)
212
210.2
208.4
202.45
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
2
2 Experiments and Sampling
An experiment is any activity with measurable outcomes, e.g., rolling a die or drawing a card.
Sampling
census: use the entire population (rarely possible)
We want to be able to use information about a sample to infer something about a population characteristic, so we want a sample
that is representative of the population.
A sampling method is biased if it tends to produce samples that are not representative of the population. Sometimes we
refer to such samples as “biased samples.”
What does it mean for a sample to be “not representative”? It means that if you compute statistics based on many samples
chosen by the method, then on average they won’t correctly estimate the parameters they’re supposed to estimate.
sampling error: difference in a calculation made from population data and one made from sample data
A simple random sample is one in which every possible sample of the same size has the same chance of being selected.
This is not quite the same thing as saying that every individual has the same chance of being selected. E.g., a coin is flipped.
If it comes up heads, Alice and Bob are both chosen; if it comes up tails, Carol is chosen. Then each individual has a 50/50
chance of being chosen, but Bob and Carol cannot both be chosen.
You get a simple random sample by assigning a number to every member of the population and then using
a random number generator to choose.
A simple random sample is almost always best, but sometimes you cannot afford a large enough one.
We will assume that all samples are simple random samples unless told otherwise.
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
3
3 Frequency Distributions and Histograms
Frequency distribution tables
Divide a set of data into classes or intervals. A frequency distribution is a table that shows the number of data points in
each class or interval.
→for univariate data
→although these are very useful for qualitative data, we only make them for quantitative data
E.g. Attendance in an Intro Stat section on each day of one semester:
45, 47, 43, 40, 38, 36, 23, 35, 44, 26, 32, 35, 40, 38,
38, 39, 36, 37, 45, 35, 36, 37, 38, 36, 33, 35, 36, 40, 45
Make a frequency distribution using the classes 20–30, 30–40,
40–50.
Class
# days
20–30
−→
30–40
40–50
Class
# days
20–30
2
30–40
18
40–50
9
The frequency of a class is the number of data values in it. Notation: f.
For the class a–b, a is the lower class limit and b is the upper class limit. b − a is the class width.
E.g. For the class 30–40 in the example above, the lower class limit is 30, the upper class limit is 40, and the
class width is 40 − 30 = 10.
If a data value falls on the boundary between two classes, it is always put in the higher class.
How to construct a frequency distribution table
1. Find the range of the data, i.e., (max value) − (min value).
2. The number of classes to use should be approximately the square root of the number of data points
—round up to the nearest integer
—but if this number is less than 5, use 5; if it’s greater than
20, use 20.
3. Approximate class width = approx #range
classes to use
—round up to the number of places the data has
4. Choose a starting value equal to or a little less than the smallest data value.
5. Count the number of data values in each class and make the table.
E.g. Construct a frequency distribution table for the attendance data.
(1) √
range = 24
(2) 29 ≈ 5.4; round up to 6
(3) approx. class width = 24
6 = 4; no need to round
(4) starting value = 21
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
4
Class
21–25
25–29
29–33
33–37
37–41
41–45
45–49
Recall that the midpoint of the interval [a, b] is
# days
1
1
1
10
10
2
4
Σ f = 29
a+b
2 .
Frequency Histograms
A frequency histogram is a bar graph of a frequency distribution table.
• horizontal axis: classes
class boundaries must coincide: if, in the table, classes are separated by an amount x, extend all boundaries
by 2x (this means that bars must touch)
• label with midpoints
•
• vertical axis: frequencies
E.g. Construct a frequency histogram for the table above.
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
5
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
6
4 Relative Frequency; Cumulative Frequency; Distribution Shapes
Relative frequency distributions and histograms
The relative frequency of a class is
# data points in the class
# data points in the whole set
A relative frequency distribution table shows percentages of the whole for each class instead of the number in each class.
A relative frequency histogram is a histogram for a relative frequency distribution.
Note that the percentages must add up to 100.
E.g. Compute the relative frequencies and make a relative frequency histogram for the following data:
Class
Frequency
Relative
Frequency
(%)
21–25
1
3.4
25–29
1
3.4
29-33
2
6.9
33-37
9
31.0
37–41
10
34.5
41–45
2
6.9
4
14.0
45–49
Σ f = 29
Σ f means “sum of all frequencies”
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
7
Cumulative frequency
The cumulative frequency at any row of a frequency table is the sum of all frequencies up to and including that row.
E.g. Add a column for cumulative frequency to the table just constructed.
Class
Frequency
21–25
1
25–29
1
29-33
2
Cumulative
Frequency
33-37
9
37–41
10
41–45
2
45–49
4
1
2
4
13
23
25
29
29
A cumulative frequency graph is a line graph joining the points
(upper class boundary, cumulative frequency of class)
E.g. Make a cumulative frequency graph for the table just constructed.
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
8
Common distribution shapes
Note that (i) distributions don’t have to be exact and (ii) lots of distributions are symmetrical, but only the one that is symmetrical and has a single central hump is called symmetric in statistics.
Memorize:
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
9
5 Graphs and Displays
These not only present data to others, but help the statistician to understand the data so that he can choose the best ways to
analyze it.
Dot plots
→For univariate, numerical data
Use one dot for each data value. Make your dots all the same size.
E.g. Make a dot plot for the data set shown.
1
7
10
1
2
3
1
7
10
1
9
13
1
9
13
1
10
13
4
10
11
4
10
11
5
6
7
8
9
10
4
7
10
15
11
7
10
12
13
14
15
Bar charts
Pretty simple. The bars should be separated.
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
10
Pie charts
Pie charts are widely used for categorical data.
→For univariate data
Pie charts are best for showing the relationship of the size of each category to the whole.
Time series charts
A line graph of a quantity or quantities at regularly spaced times.
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
11
Scatter plots
→For bivariate, numerical data
Just graph the ordered pairs
Very useful for seeing whether data seem to be correlated
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
12
E.g. Using the pie chart below, answer the following questions:
•
Which major has the most students?
•
Which major has the fewest students?
•
Which major represents more than one third of all majors studied?
E.g. The pie charts below give some information about U.S. government finances in 1999. Use them to answer
the following questions:
(a) About how many dollars were paid in income tax for each dollar paid in corporate tax?
(b) Can you determine whether the U.S. government was taking in enough revenue from Social Security
payments to cover its outlays for that program? If so, what is the answer?
(c) If the government’s outlay was exactly equal to its income, did corporate and excise taxes together generate enough income to cover the cost of defense?
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
13
6 Mean, Median, Mode
Suppose you were forced to describe a quantitative data set using just one number. Which number would you use?
Measures of central tendency
A measure of central tendency is an average: a single value intended to be typical of the data set.
→These are for univariate data only
Notation: Number of data points in a population: N
Number of data points in a sample: n
Data: the number of points scored in RU Women’s Basketball games in the 2007–8 season:
52, 52, 54, 57, 58, 63, 63, 65, 67, 67, 70, 71, 72, 73, 75, 76, 77, 93
Note that this is population data.
Mean
→For quantitative data only
If the data are x1 , x2 , . . . , xk , then the mean is
x1 +···+xk
k
(for population or sample).
Notation: population mean: µ
sample mean: x̄
The mean has the same units as the data.
E.g. Find the mean of the basketball data.
Think of x̄ as an estimate of µ. For any one sample, it might happen that x̄ is far from µ, but it can be proved that if many
samples are taken from the same population, then we can expect the average value of x̄ to be very close to µ. This means
that x̄ is an unbiased estimator for µ.
Median
→for quantitative data only
The median Q2 is the number halfway up a sorted list of data.
To find the median:
1. Sort the data in ascending order.
2. For k data points, the middle position in the list is the k+1
2 position.
3. If k is odd, then the median is the middle data value.
If k is even, then the median is the mean of the two middle values.
E.g. Find the median of the dataset 2, 7, 4, 3, 8.
E.g. Find the median of the basketball data.
Either the mean or the median may be called an average.
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
14
Mode
Sometimes you must give a typical value for a qualitative data set.
E.g. A car dealership sold 60 cars in the past week of which 42 were red, 12 were green, and 8 were blue. If
forced to describe the “average” color of car sold using one or two values or the phrase “no typical value,”
what response would you give?
What if 25 were red, 25 were green, and 10 were blue?
What if 20 were red, 20 were green, and 20 were blue?
If there is a data value that occurs most frequently, it is called the mode of the data. If there are two values that occur most
frequently, the data is bimodal and we report both values. If there are more than two values that occur most frequently,
we report no mode.
Note that the mode can be used for qualitative data.
E.g. Find the mode(s) of each dataset.
(a) a, a, b, c, c, c, d, d, e, f, g
(b) a, b, c, d, d, e, e, f, g
(c) a, b, c, c, d, d, e, f, f, g, g, h
(d) a, b, c, d, e
Which measure of central tendency?
E.g. Annual compensation for ten RU employees with faculty rank (in $): 47,561, 49,687, 52,375, 53,626,
60,573, 63,716, 73,832, 96,666, 105,719, 508,299. Mean = $111,205.40, median = $62,144.50.
Which average is most representative of the center of this data? (If you were recruiting a prospective
faculty member, which would you feel most honest reporting?)
A data value is an outlier if it is extremely high or low compared to the rest of the data.
We’ll get a proper definition later.
Which measure of central tendency should you use?
• If the data set contains qualitative data, use the mode.
• If there is an outlier (or two) in a set of data, use the median.
• Use the mean in all other situations.
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
15
Sigma notation
Σxi means “sum all the numbers xi ”
3
E.g.
∑ xi means x1 + x2 + x3 .
i=1
3
E.g. If x1 = 4, x2 = 3, and x3 = 1, then
∑ xi =
i=1
Go to the course website and follow the “Stats on the TI84+” link. Print the whole document and keep it with your notes.
(The PDF is best for printing.)
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
16
7 Range, Variance, Standard Deviation
dispersion = spread = variability = how spread out the data is
→for univariate, quantitative data only
E.g. The example datasets below all have the same mean and median, namely, 7. It’s the “spread” or “variability” that’s different–the range of the data points and how far they tend to be from their center.
Data Set A
3, 5, 5, 7, 7, 7, 9, 9, 11
Data Set B
1, 3, 5, 7, 7, 9, 11, 13
Data Set C
−17, −17, −17, −10, −10, 5.0, 5.3, 5.6, 5.7, 5.8, 5.9, 6.1, 6.3, 6.4, 6.5, 6.7, 6.8, 6.9, 7.0, 7.0,
7.1, 7.2, 7.3, 7.5, 7.6, 7.7, 7.9, 8.1, 8.2, 8.3, 8.4, 8.7, 9.0, 24, 24, 31, 31, 31
Range
range = (max data value) – (min data value)
computed the same way for population & sample data
E.g. range of A : 11 − 3 = 8
range of B : 13 − 1 = 12
range of C : 31 − (−17) = 48
Variance
The variance will measure how far the data tends to be from its mean. We need to do a little work before we can define it. We
will define the sample variance s2 and then the population variance σ2 .
If x is a data point in a sample, then the deviation of x (from the mean) is x − x̄.
E.g. For Dataset A, x̄ = 7. Thus the deviation of the data point 3 from the mean is 3 − 7 = −4.
The deviation measures how far x is from the mean. You might think that we could use the average deviation to measure how
far the data tends to be from its mean, but there’s a problem.
Problem: the sum of all the deviations is always 0.
Solution: use the squares of the deviations to measure how far data points are from x̄.
The sample variance is
s2 =
sum of (deviations squared)
number of data points−1
=
∑(x−x̄)2
n−1
The denominator is n − 1 because it can be shown that if n is used, then for samples that are small compared to the size of the
population, the result is a biased estimator for σ2 , while using n − 1 makes it unbiased. Again, “unbiased” means that if many
samples are taken and s2 is computed for each one, then we can expect the average value of s2 to be very close to the population
variance.
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
17
E.g. Find the variance of the sample 1.5, 1.7, 1.9, 2.1, 2.3.
x
x − x̄
(x − x̄)2
1.5
1.7
1.9
2.1
2.3
−0.4
−0.2
0
0.2
0.4
0.16
0.04
0
0.04
0.16
x̄ =
9.5
5
s2 =
= 1.9
0.4
5−1
= 0.1
∑(x − x̄)2 = 0.4
The variances of Datasets A, B, and C are 6, 16, and 125.44, respectively.
The population variance is σ2 =
Σ(x−µ)2
.
N
The units of the variance are the units of the data squared.
Larger variance corresponds beautifully to greater variability, but the units are wrong. E.g., if the data points represent the
number of shoes in a man’s closet, then the units of the variance are “shoes squared.”
We can fix this.
Standard deviation
The standard deviation is the positive square root of the variance:
√
For population data: σ = σ2
√
For sample data: s = s2
E.g. The standard deviations of Datasets A, B, and C are (approximately) 2.45, 4, and 11.2, respectively.
The units of the std dev are the units of the data.
Interpret std dev as giving the average distance of a data point from the mean.
E.g. The table below shows the numbers of pairs of shoes in four men’s closets. Find the mean, median, range,
and standard deviation. Interpret the standard deviation.
Person
A
B
C
D
Pairs of shoes
12
4
8
7
x = # pairs of shoes
mean:
median:
range:
Corwin
x
12
4
8
7
x − x̄
(x − x̄)2
∑(x − x̄)2 =
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
18
How to find the mean, median, range, pop. and sample std dev on the TI: all online.
Chebyshev’s Theorem
The standard deviation is computed from the data. If the data is bunched together, the standard deviation will be small; if the
data is spread out, the standard deviation will be large. But even though the standard deviation changes with the data, because
of the way in which it is computed, a certain fraction of the data must be within two standard deviations of the mean, a different
fraction must be within three standard deviations, etc.
Chebyshev’s Theorem: No matter how the data are distributed, the portion of the data lying within k standard deviations
of the mean (k > 1) is at least 1 − k12 .
k
1 − k12
2
1 − 212 =
3
1 − 312 ≈ 88.9%
88.9% lies within three std devs of the mean
4
1 − 412 = 93.75%
..
.
5
1 − 512 = 96%
96%
3
4
= 75%
so at least
75% of the data lies within two std devs of the mean
→C’s Thm tells us how likely it is to find a data point far from the mean in terms of the standard deviation
Illustrations of Chebyshev’s Theorem for a few data sets
Chebyshev’s Theorem is actually very conservative. When we know more about a distribution, we can usually get much better
estimates.
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
19
z -score
The z-score of a data point x measures how far x is from the mean in units of the standard deviation.
z=
x−µ
σ
To understand this, note that the signed distance from x to µ on the number line is x − µ and that the number of lengths σ in a
length x − µ is (x − µ)/σ.
E.g. A data set has mean 17.5 and standard deviation 6. What is the z-score of the data value 21.2?
z=
x − µ 21.2 − 17.5
=
≈ 0.6167
σ
6
z-scores are useful in comparing populations which have similar probability distributions but different means and standard
deviations.
E.g. For a long time, the mean SATV score was 500 with a standard deviation of 100, and the mean score on
the verbal portion of the ACT was 18 with a standard deviation of 6. The distributions are the same for
the two tests. Which is better, a 630 on the SAT or a 25 on the ACT?
z-score of 630 =
630−500
100
≈ 1.3
z-score of 25 =
25−18
6
≈ 1.17
The 630 is better because it is farther above average.
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
20
8 Quantiles, IQR, Five-number Summary, Box Plots
Quantiles are numbers that split an ordered list of numbers into parts each with approximately the same number of data
points.
The simplest is the median, which splits an ordered list into two parts.
Quartiles
About
1/4
1/2
3/4
of the data points are less than
Q1
Q2 = median
Q3
To find these:
• List the data in ascending order
• Find Q
2
• Q is the median of the lower half of the data set
1
• Q is the median of the upper half of the data set
3
If n is odd, don’t include the middle data point in either sublist.
E.g. Find Q1 , Q2 , and Q3 for the data set
2 3 3 4 7 8 9 9 11 11 11 11 11 12 13 14
E.g. If a data point x is chosen at random from any data set, then
Probability(x < Q1 ) =
Probability(Q1 < x < Q3 ) =
Five-number summary of a data set
(min, Q1 , Q2 , Q3 , max)
E.g. Construct the 5-number summary for the data set
2 3 3 4 7 8 9 9 11 11 11 11 11 12 13 14
Box plots (boxplots, box-and-whisker plots)
Just a graph of a 5-number summary:
min
Corwin
Q1
Q2
Q3
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
max
21
E.g. Construct the box plot for the data in the previous example.
Interquartile range
IQR = Q3 − Q1
The values Q1 − 1.5(IQR) and Q3 + 1.5(IQR) are called inner fences.
A data value x is a suspected outlier if it is outside the inner fences, i.e., if it is either less than Q1 − 1.5(IQR) or greater
than Q3 + 1.5(IQR).
Percentiles
We won’t compute these—it’s rather complicated—but we’ll learn to interpret them.
Percentiles apply to a list of numerical data arranged in ascending order. They split the list into 100 approximately equal
parts.
Notation: P1 , P2 , . . . , P99
About n percent of the data is to the left of Pn .
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
22
E.g. Consider the cumulative frequency graph of SAT scores at a particular school.
(a) What test score corresponds to the 70th percentile?
(b) About what percentage of all test-takers got a score higher than 1200?
(c) If a test-taker is chosen at random, what’s the probability that his score is less than 1200?
For each total SAT score x, the corresponding y is the percent of students receiving that score
or less.
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
23
9 Probability
When an experiment is performed, an outcome is observed. The sample space for the experiment is the set of all possible
outcomes. A set of outcomes is an event .
An experiment must have an outcome. Nevertheless we regard the empty set as an event—the impossible event.
E.g. Roll a die once, observe the number of pips on the top face. Possible outcomes: {1, 2, 3, 4, 5, 6}
Some possible events: {}, {1}, {5}, {2, 4, 6}, {1, 2, 3, 4, 5, 6}
An event occurs if, when the experiment is performed, the outcome is in that event.
E.g. Roll one die. Let E = {get an even number}, F = {get a number less than 4}.
•
•
•
•
If you get a 1, then only F has occurred.
If you get a 4, then only E has occurred.
If you get a 2, both E and F have occurred.
If you get a 5, neither E nor F has occurred.
In order to model what happens in the physical world, we accept the following rules about probability:
•
•
•
P({}) = 0
P(sample space) = 1
If x and y are different outcomes, then P(x and y) = P(x) + P(y)
These rules imply that if x1 , x2 , . . . , xn are all the possible outcomes, then P(x1 ) + P(x2 ) + · · · P(xn ) = 1.
Now roll one fair die. There are six outcomes; the s.s. = {1, 2, 3, 4, 5, 6}, and we must have P(1) + P(2) + · · · + P(6) = 1.
All the outcomes are equally likely, so if P(1) = p, then P(2) = p, etc. Thus we have 6p = 1, or p = 16 . That is, each of
the outcomes has probability 16 .
In general, if a sample space consists of n equally likely outcomes, then the probability of any one outcome is 1n .
E.g. Roll two dice, one red and one green. Then the sample space consists of 36 equally likely outcomes:
(1,1)
(2,1)
(3,1)
(4,1)
(5,1)
(6,1)
Thus each outcome has probability
(1,2)
(2,2)
(3,2)
(4,2)
(5,2)
(6,2)
(1,3)
(2,3)
(3,3)
(4,3)
(5,3)
(6,3)
(1,4)
(2,4)
(3,4)
(4,4)
(5,4)
(6,4)
(1,5)
(2,5)
(3,5)
(4,5)
(5,5)
(6,5)
(1,6)
(2,6)
(3,6)
(4,6)
(5,6)
(6,6)
1
36 .
Theoretical probability
When there are only finitely many outcomes and all are equally likely, we define the theoretical probability of an event
E by:
P(E) =
# outcomes in the event
total # possible outcomes
— Can only be used in ideal/mathematical situations
E.g. A bag holds ten identical marbles numbered 1, 2, 3, . . . , 10. One is drawn out. What’s P(get marble #5)?
• all outcomes equally likely
• one outcome in event: “get #5” = {5}
• ten outcomes total
• P(get #5) = 1⁄10
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
24
E.g. A fair die is rolled once. What’s P(get a 5)?
• all outcomes equally likely
• one outcome in event “get a 5”
• six outcomes total
• P(get a 5) = 1⁄6
E.g. A fair die is rolled once. What’s P(get an odd number)?
• all outcomes equally likely
• three outcomes in event: {1, 3, 5}
• six outcomes total
• P(get an odd number) = 3⁄6 = 0.5
E.g. The probability of drawing an ace from a standard deck is 4⁄52.
In the real world, “the probability of rolling a 1 with a fair die is 1⁄6” does not mean that one in every six rolls will be a 1. It
means that if we roll enough times, almost exactly 1⁄6 of the rolls will be 1s.
Empirical probability
Suppose that an experiment is performed repeatedly under very similar conditions. Then we define the empirical probability of an event E by:
P(E) =
number of times E occurs
total # of observations
I.e., an empirical probability is a relative frequency.
E.g. The table below shows the times at which Jill has gotten up on ten different school mornings. What’s the
probability that she’ll get up before 7:05 on any given school morning?
Day
1
2
3
4
5
6
7
8
9
10
Time
7:01
7:03
7:01
7:00
7:10
7:07
7:03
7:01
7:15
7:06
event: Jill gets up before 7:05
# outcomes in which event occurred: 6
total # observations = 10
prob = 6⁄10 = 0.6
The Law of Large Numbers says that as an experiment is repeated more and more times, the empirical probability of
events approaches their theoretical probability.
Subjective probability
Subjective probability is a probability judgment.
E.g. The doctor says there’s a 60% chance John will survive the surgery.
Subjective probability is used when an experiment cannot be repeated or in a situation too complicated to be directly compared
to other cases.
Properties of probability
All probabilities are between 0 and 1.
E impossible: P(E) = 0
E just as likely to happen as not: P(E) =
E certain: P(E) = 1 (or 100%)
1
2
If the probability that an event E will occur is p, then the probability that E will not occur is 1 − p.
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
25
Mutually exclusive events
Events E and F are mutually exclusive if they cannot both happen at the same time.
E.g. Roll a single die. Let E = “get an even number,” F = “get an odd number.” Then E and F are mutually
exclusive.
E.g. Roll one red die and one green die. Let E = “get an even number on the red,” F = “sum of dice is an odd
number.” Then E and F are not mutually exclusive, because they can both happen at the same time (e.g.,
if you rolled a 2 on the red and a 1 on the green).
If E and F are mutually exclusive events, then
P(E occurs or F occurs) = P(E) + P(F)
“Or” here is the “inclusive or”: it means “one or the other or both.”
Note that mutual exclusivity is not defined in terms of probability. It is a set property: E and F are mutually exclusive if there
is no outcome that is in both.
Independent events
Two events are statistically independent (or just independent) if the occurrence or non-occurrence of either of them does
not affect the probability that the other will occur.
E.g. Dependent or independent?
Roll one red die, one green.
(a) E = get a 1 on red, F = get a 6 on green.
(b) E = get a total of 12, F = get a 6 on green.
True fact: events E, F are independent if and only if
P(both E and F occur) = P(E)P(F)
E.g. Roll one die and flip one fair coin. What’s the probability of getting heads and a 4?
Solution 1. All possible outcomes:
{(1, H), (2, H), (3, H), (4, H), (5, H), (6, H), (1, T ), (2, T ), (3, T ), (4, T ), (5, T ), (6, T )}
There are 12 possible outcomes, all equally likely. The event “get heads and a 4” contains the single
1
outcome (4, H), so its probability is 12
.
Solution 2. P(H) = 12 , P(4) = 16 , and the two events are independent, so P(E and F) = 12 · 16 =
1
12 .
In general, probabilities should be rounded to five decimal places. (We’ll often break this rule.)
Note that independence is defined in terms of probability.
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
26
10 Random Variables; Discrete Probability Distributions
A random variable is a variable that takes on numerical values associated with the outcome of an experiment.
E.g. Choose an apple at random and let x = weight of the apple.
Choose a box of blueberries at the supermarket, let x = the number of (whole) blueberries in the box.
Choose a sample of ten apples at random, and let x = the mean weight for that sample.
Continuous vs. Discrete
A r.v. is discrete if all of its (possibly infinitely many) possible values can be listed, at least in principle.
A r.v. is continuous if it can take on any value in some interval of real numbers.
Recall: an interval is the set of all real numbers between two real numbers.
To figure out whether a r.v. is discrete or continuous, think of plotting all its possible values on a number line. If there
would be gaps between values, it’s discrete; otherwise, it’s continuous.
E.g. Discrete or continuous? What are the possible values?
(a) x = number of (whole, unbroken) eggs in one carton
(b) y = weight of one carton of eggs, in ounces
(c) time since the last customer arrived (at some counter), in seconds
(d) number of stocks in the DJIA the share prices of which closed higher than they opened yesterday
Independent random variables
If x is a discrete r.v., its probability distribution is a table or function that gives, for each possible value of x, the probability
that x takes on that value.
A discrete probability distribution is usually a table of relative frequencies.
If x is a continuous r.v., its probability distribution is a function from which it is possible to compute the probability that
the value of x is in any given interval. (We’ll come back to these.)
E.g. A psychological test was administered to 150 people. Possible final scores were 1, 2, 3, 4, and 5. Results:
x (score)
Frequency
1
24
2
33
3
42
4
30
5
21
→Note that the r.v. x takes on a numerical value for each person who took the test.
(a) Construct a discrete probability distribution and relative frequency histogram.
(b) What is the probability that a randomly chosen participant’s score was (i) 3? (ii) 4 or 5?
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
27
x (score)
P(x)
24
150
1
= 0.16
2
0.22
3
0.28
4
0.20
5
0.14
Mean of a discrete r.v. or probability distribution
“The mean of a discrete probability distribution” and “the mean of a r.v. with that distribution” are the same thing.
If the possible values of a r.v. x are x1 , x2 , . . . , and if the probabilities of these are p1 , p2 , . . . respectively, then the mean
of x is
E[x] = p1 x1 + p2 x2 + · · ·
E[x] is often called the expected value of x.
E.g. Find the mean of the following frequency distribution.
Value
1
2
3
4
Freq
12
10
8
6
Notice that Σ f = 36. Next, get the relative frequencies = the probabilities:
Value
1
2
3
4
Rel freq = Probability
12
36
10
36
8
36
6
36
Extend the table with a column for (Value × Prob) and sum these:
Value
1
2
3
4
Prob
Value × Prob
12
36
10
36
8
36
6
36
12
36
20
36
24
36
24
36
Σ(Value × Prob) =
80
36
The mean of the probability distribution is
Corwin
80
36 .
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
28
Independent random variables
Two r.v.s are independent if neither variable’s probability distribution depends on the value of the other.
The r.v. equivalent of “E occurs” is “x takes on a value”.
Our method of determining independence is once again “educated guess.”
E.g. Choose a person at random.
x = number of atoms in the person’s left hand
y = the person’s SAT score
E.g. Choose a day from the year 2014.
x = temperature on that day
y = snowfall on that day
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
29
11 The Binomial Distribution
A binomial experiment is made up of smaller experiments called trials.
An experiment is binomial if
•
•
•
It consists of a fixed number of independent trials (called Bernoulli trials).
The only possible outcomes for each trial are success (S) and failure (F).
The probability p of success is the same in each trial.
When working with binomial experiments, we let n be the number of trials, set q = 1 − p = P(F), and let x = number of
successes.
E.g. An experiment consists of flipping one fair coin once. Success is defined as getting heads. What are all
the possible outcomes? What are n, p, and q?
There’s only one trial, so n = 1, and the possible outcomes of the experiment are just H and T . The coin
is fair, so p = 12 , whence q = 1 − p = 12 .
E.g. An experiment consists of flipping one fair coin twice. Success is defined as getting heads. What are all
the possible outcomes? What are n, p, and q?
The possible outcomes on any one trial are just H and T , but this is not what is meant. The possible
outcomes of the whole experiment are {HH, HT, T H, T T }. There are two trials, so n = 2. The coin is
fair, so p = 12 , whence q = 1 − p = 12 . (Remember that success is defined per trial, not for the whole
experiment, so only the outcomes of one trial are important when computing p.) Finally, the possible
values of x are 0, 1, and 2.
E.g. Each trial of an experiment consists of flipping a coin and throwing a die. The experiment itself consists
of two trials. List all possible outcomes of one trial, and all possible outcomes of the whole experiment.
Possible outcomes of one trial:
H1, H2, H3, H4, H5, H6, T 1, T 2, T 3, T 4, T 5, T 6
Possible outcomes of the experiment:
(H1, H1)
(H1, T 1)
(H2, H1)
(H1, H2)
(H1, T 2)
(H2, H2)
(H1, H3)
(H1, T 3)
(H2, H3)
(H1, H4)
(H1, T 4)
(H2, H4)
..
.
(H1, H5)
(H1, T 5)
(H2, H5)
(H1, H6)
(H1, T 6)
(H2, H6)
(T 6, T 1)
(T 6, T 2)
(T 6, T 3)
(T 6, T 4)
(T 6, T 5)
(T 6, T 6)
Examples of binomial distributions
Don’t worry about where the probabilities in these examples come from just yet.
E.g. A fair coin is flipped five times. S = get heads.
Here n = 5, p = 12 , q = 12 . The possible values of x are 0,1,2,3,4, and 5. Probability distribution:
x
P(x)
Corwin
0
0.03125
1
0.15625
2
0.3125
3
0.3125
4
0.15625
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
5
0.03125
1
Experiment: flip coin 5 times
Trial: one flip
p=
q=
1
2
1
2
n=5
x is a random variable
Value of x = number of successes
E.g. Same for 10, 20, 50 flips:
Ain’t that purty.
E.g. A fair, six-sided die is thrown ten times.
Success = get a 1 or 3
Here n = 10, p = 26 , q =
4
6
Distribution (to four decimal places) and histogram:
x
0
1
2
3
4
5
6
7
8
9
10
P(x)
0.0173
0.0867
0.1951
0.2601
0.2276
0.1366
0.0569
0.0163
0.003
0.0003
0
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
2
Same, for 50 throws:
Now let’s see where those probabilities came from.
E.g. A binomial experiment with n = 3 trials is performed.
Possible outcomes: FFF, FFS, FSF, SFF, FSS, SFS, SSF, SSS
Because the trials are independent, we have, for example,
P(FFS) = P(F)P(F)P(S) = (1 − p)(1 − p)p = (1 − p)2 p
In fact:
Outcome
FFF
FFS
FSF
SFF
FSS
SFS
SSF
SSS
Probability
(1 − p)3
p(1 − p)2
p(1 − p)2
p(1 − p)2
p2 (1 − p)
p2 (1 − p)
p2 (1 − p)
p3
Remember that x = the number of successes in this experiment. The probability distribution of x is:
x
Prob
0
(1 − p)3
1
3p(1 − p)2
2
3p2 (1 − p)
3
p3
x is said to be binomially distributed, sometimes written x ∼ B(n, p).
What’s E[x]? It’s
E[x] = 0 · (1 − p)3 + 1 · 3p(1 − p)2 + 2 · 3p2 (1 − p) + 3 · p3 = 3p
It’s possible to show that for a binomial experiment with n trials, E[x] = np. It’s also possible to show that the standard
√
deviation is σ = npq.
This explains the locations of the peaks in the histograms.
E.g. A certain surgery is successful in 80% of cases. The surgery is performed four times. What’s the probability of exactly two successes?
Remember that x = 2 when exactly two successes occur. This can happen in any of the following ways:
SSFF, FSSF, FFSS, SFSF, SFFS, FSFS
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
3
Each of these outcomes has probability p2 q2 = (0.8)2 (0.2)2 = 0.0256. They’re mutually exclusive, so the probability
that one of them occurs is
0.0256 + 0.0256 + · · · + 0.0256 = 6 · 0.0256 = 0.1536
Now, what’s the probability of at least two successes?
Using the same technique, we can compute that the probability of exactly three successes is 0.4096 and the probability
of exactly four successes is 0.4096, so the probability of at least two successes is 0.1536 + 0.4096 + 0.4096 = 0.9728.
But what if 40 surgeries were performed and we wanted the probability of at least 35 successes?
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
4
Using the TI, we find that the probability of at least 35 successes out of 40 surgeries is about 0.2858.
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
5
12 The Normal Distribution, Part I
The image below shows a relative frequency histogram for 2781 numbers associated with the outcome of some experiment. The mean is 1600.
The height of a bar is
the bars is A.
h
2781
when there are exactly h numbers in the interval spanned by the bar. Suppose the total area of
Divide the bars up into equal-sized boxes, one for each number:
Notice that the area of each box must be
A
2781 .
Now choose a number x at random. If 101 boxes are to the left of 1300, what’s P(x ≤ 1300)?
P(x ≤ 1300) =
Corwin
number of numbers no greater than 1300
101
=
total number of numbers
2781
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
6
Because each little box has the same area, we could also find P(x ≤ 1300) by considering areas:
P(x ≤ 1300) =
area of boxes at or to the left of 1300
total area of boxes
A
101 · 2781
101
=
A
2781
Really, we don’t need the boxes or the rectangles, because the probability that a randomly chosen value of x is less than
1300 is just the fraction of the total area which is to the left of 1300 under the curve made up of the tops of the rectangles:
=
The curve along the tops of the rectangles serves as a probability distribution function (pdf): it lets us compute the
probability that a randomly chosen value of x is in any given interval. We’ll insist, though, that things have been fixed
up so that the total area is 1, so that we don’t have to multiply and divide by it.
This allows us to answer questions like, “Where are there lots of values and where are there not very many?” and “If I
choose a value at random, what’s the probability that it’s between 13 and 14?”
If, instead of the tops of bars, a pdf is a smooth curve, we treat it in the same way: the total area under the curve is
1, and when we want a probability, we compute an area. Actually finding areas in the continuous case is technically
complicated, but we won’t learn that part; we’ll use the calculator. And we’ll work only with the two most important
families of pdfs.
Don’t forget:
Probabilities are areas
Areas are probabilities
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
7
We’ll be most interested in analyzing data that is heaped up around its mean—data that is approximately “normally distributed.”
The normal distribution:
2
1
− (x−µ)
y = √ e 2σ2
σ 2π
Properties of the normal distribution
mean = median = x-coordinate of highest point
curve changes shape at inflection points—i.e., above [µ − σ, µ + σ]
• total area under curve is 1
• data values/values of random variable on x-axis
• height above any particular point unimportant
• curve is always above x-axis but gets closer and closer as x → ±∞
• almost all the area is above [µ − 3.5σ, µ + 3.5σ]
• std dev controls width/height of central hump
•
•
We’ll write x ∼ N(µ, σ) to mean that the r.v. x is normally distributed with mean µ and std dev σ. (Not everyone uses this
in the same way.)
Even though there are infinitely many different normal curves, one for each possible µ and σ, we just talk about “the normal
curve” because they’re all the same in the ways that interest us. Similarly, people say “the normal distribution.”
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
8
Influence of std dev on shape of normal curve:
True fact: If a r.v. x is normally distributed, then P(x < c) = the area under the normal curve to the left of c
Area =
DISTR
normalcdf(−10 000, c, µ, σ)
It’s really the area from −∞ to c, but calculators can’t manage −∞. We’ll use −10 000 for −∞ and 10 000 for +∞.
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
9
Also, P(c < x < d) = area under curve between x = c and x = d
Area =
DISTR
normalcdf(c, d, µ, σ)
You could get P(c < x < d) by subtracting areas:
P(c < x < d) =
=
=
Corwin
area under curve between x = c and x = d
(area left of d) − (area left of c)
P(x < d) − P(x < c)
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
10
E.g. Suppose that x is normally distributed with mean 12 and standard deviation 0.3, and that an x-value is
chosen at random. Then
(a) P(x ≤ 12) =
(b) P(x < 12.2) =
(c) P(11.9 < x < 12.1) =
E.g. John has measured the actual amount of soda in many 12-oz bottles and found that it is normally distributed with mean µ = 12 oz and std dev σ = 0.3 oz.
If a bottle is chosen at random, then
(a) P(it contains no more than 12 oz) =
(b) P(it contains less than 12.2 oz)=
(c) P(it contains between 11.9 and 12.1 oz)=
What is the 75th percentile?
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
11
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
12
Finally, P75 is the point such that, if a bottle is chosen at random, the probability that the bottle contains less than P75
ounces of soda is 0.75.
When you know an area or probability and need x, use the invNorm function:
P75 ≈
DISTR
invNorm(0.75, 12, 0.3) ≈ 12.2
normalcdf and invNorm
When you know µ and σ and you need a probability:
P(a < x < b) = normalcdf(a, b, µ, σ)
When you know a probability (or area) p and you need an x-value c such that P(x < c) = p:
c = invNorm(p, µ, σ)
The normal can approximate the binomial
When a healthy adult is given the cholera vaccine, the probability that he will contract cholera if exposed is known to be
0.15. Five hundred vaccinated tourists, all healthy adults, were exposed while on a cruise. What is the probability that
more than 25 will contract the disease?
This is clearly a binomial experiment with p = 0.15, q = 0.85, and n = 500. We certainly don’t want to computed
p(x > 25) directly. Fortunately, any time both np and nq are greater than 5, the binomial distribution is very close to a
√
normal distribution with µ = np and σ = npq.
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
13
13 The Normal Distribution, Part II
True fact: the area under any normal curve between µ and µ + σ is the same as under any other.
normalcdf(0, 1, 0, 1) ≈ 0.3413
normalcdf(5, 5.3, 5, 0.3) ≈ 0.3413
This is a mathematical result the proof of which is beyond us.
This is true for any number of σs. E.g., the area between µ and µ − 1.25σ under one normal curve is the same as under
another:
normalcdf(µ1 − 1.25σ1 , µ1 , µ1 , σ1 ) ≈ 0.3944
normalcdf(µ2 − 1.25σ2 , µ2 , µ2 , σ2 ) ≈ 0.3944
This means that if we measure the distance of a point from the mean in units of the standard deviation, we can use any
normal curve to find probabilities.
How do we find the distance from µ to c in units of σ? It’s the z-score of c:
z=
Corwin
c−µ
σ
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
14
Since we can use any normal curve to find the probability that a point x is in an interval under any other normal curve,
why not use the one with µ = 0 and σ = 1?
Standard Normal Distribution N(0, 1)
Horizontal axis is usually labeled z
Sometimes called the z-distribution
Because σ = 1, the distance from µ = 0 to any point z in std devs is z
Suppose c is any number on the x-axis under the normal curve with mean µ and standard deviation σ.
Then there’s a point w on the z-axis such that P(x < c) = P(z < w).
The point w is w =
Corwin
c−µ
σ ,
the z-score of c.
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
15
In fact, w is the z-score of c if and only if P(x < c) = P(z < w):
Another way to say this: if x ∼ N(µ, σ), then the variable
Corwin
x−µ
σ
is N(0, 1).
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
16
Suppose x ∼ N(100, 16) and we want to know P(x < 124). One way to proceed:
1. Measure how many std devs 124 is from the mean:
124−100
16
= 1.5
2. “Transform to z”: on the z-axis, the point that is 1.5 standard deviations from the mean is just 1.5 :
3. Then the area under N(100, 16) between −∞ and 124 is equal to the area under N(0, 1) between −∞ and 1.5 :
P(x < 124) = P(z < 1.5) = normalcdf(−10 000, 1.5, 0, 1) ≈ 0.933
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
17
E.g. Suppose that x ∼ N(10, σ) and that the z-score of the x-value 8.7 is −0.95. What is P(x > 8.7)?
The Empirical Rule
For normally distributed data,
About 68%
About 95%
About 99.7%
Corwin
of the data will lie within
1 standard deviation of the mean
2
3
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
18
E.g. In a recent study, the heights of American women 20–29 years old were found to be normally distributed
with mean x̄ = 64 inches and std dev s = 2.71 inches. The Empirical Rule says that
About 68%
About 95%
About 99.7%
of the women will be between 61.27 and 66.71 inches tall
58.58
69.42
55.87
72.13
A rule of thumb
An interval centered at the mean and four standard deviations wide will, in practice, contain nearly all the data (if it’s
approximately normally distributed). Because of this, a common quick calculation takes the range to be four standard
range
deviations, which says that the standard deviation of a data set will be approximately
.
4
Using the Empirical Rule to test whether data is normally distributed
Compute x̄ and s for your data
See whether all three of the following are true:
approximately 68% falls in x̄ ± s
approximately 95% falls in x̄ ± 2s
approximately 99% falls in x̄ ± 3s
If not, it’s probably not from a normally distributed population.
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
19
15 The Normal Distribution, Part III
Quantiles for continuous distributions
For a discrete r.v. x, the quartile Q1 was defined as a number such that about 25% of the values of x were less than Q1 .
For a continuous r.v. x, the quartile Q1 is defined to be a number such that P(x < Q1 ) = 0.25.
Similarly for Q2 , Q3 , and percentiles.
E.g. A r.v. x is normally distributed with mean 38 and std dev 10. What is the third quartile Q3 ?
Q3 = invNorm(0.75, 38, 10) ≈ 44.74
E.g. The number of beet seeds in a one-ounce box is normally distributed with mean 1600 and std dev 114.
What is the 33rd percentile P33 ?
P33 = invNorm(0.33, 1600, 114) ≈ 1549.85
E.g. Scores on a civil service exam are normally distributed with mean µ = 75 and std dev σ = 6.5. What is
the lowest score you can earn and still be in the top 5%?
If we let the r.v. x represent scores, we are being told that x ∼ N(75, 6.5). We need the score P95 :
P95 = invNorm(0.95, 75, 6.5) ≈ 85.69
Cumulative Probability Distribution
We think of the cumulative distribution function in terms of its graph.
For a numerical data set, we can graph the cumulative relative frequency, i.e., what fraction of the data is to the left of
each x:
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
20
In general, suppose u is a r.v. For each point x on the x-axis, the y-value on the graph of the cumulative distribution
function of u is P(u < x).
That is, for a given x, the y-coordinate on the graph of the cdf is the fraction of the total area which is to the left of x
under the pdf for u.
The graph of a cdf is called an ogive.
E.g. Let z ∼ N(0, 1). Tabulate a few values of the cdf of z. Sketch its graph.
x = point on z-axis
−3.49
−2.0
−1.0
0.0
1.0
2.0
3.49
Corwin
Fraction of area to the left of x
P(z < −3.49) ≈ 0.00024
P(z < −2.0)
≈ 0.023
P(z < −1.0)
≈ 0.159
P(z < 0.0)
≈ 0.5
P(z < 1.0)
≈ 0.84
P(z < 2.0)
≈ 0.977
P(z < 3.49)
≈ 0.99976
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
21
Ogive and Quantiles
E.g. Given the ogive below:
(a) Estimate Q2 .
(b) Estimate P33 .
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
22
The Normal Probability Plot
We often need a way to tell whether data is approximately normally distributed. A reasonable way to tell is the normal
probability plot.
After following this procedure, if the data seems not to lie approximately on a straight line, then it is probably not normal.
E.g. Some data sampled from a normal distribution and the TI’s normal probability plot for it:
90.4
98.5
100.9
78.5
55.7
Corwin
108.3
80.2
78.0
81.6
90.8
76.8
92.4
72.2
75.8
108.6
91.6
96.9
96.7
112.5
109.0
94.2
122.2
81.5
113.1
100.7
92.7
105.1
85.8
94.1
88.4
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
23
16 Sampling Distributions and the Central Limit Theorem
We’re going to be very concerned with using the mean of a sample to estimate the mean of a population. In particular, we want
some way of estimating how far a sample mean is likely to be from the population mean.
Suppose we want to estimate the average height µ of a twenty-year-old American. The natural thing to do is to take a
reasonably large random sample, compute its mean x̄, and use x̄ as our estimate of µ. But if we took a different sample,
we’d get a different x̄. Which one should we use? Or should we use neither, but take the average of the two instead?
We’re in the position of the man with two watches who doesn’t know what time it is.
What we need to know is what happens if we take a bunch of different samples and compute their means. Do the sample
means mostly fall near the population mean µ, or are they spread out? Is their distribution predictable?
Here’s how we get at this. First, think of taking many samples, all of the same size. Next, instead of thinking of
each sample having its own x̄, think of defining a r.v. named x̄. Each time a new sample is chosen, the value of the
r.v. x̄ becomes the mean of that sample. If we call our samples Sample 1, Sample 2, and so on, then we can label the
corresponding values of x̄ as x̄1 , x̄2 , etc. If we plotted these on an x̄ axis, we’d expect them to cluster around µ:
In fact, not only do they cluster around µ, they’re normally distributed with mean µ, and we can give the standard
deviation exactly. The theorem that does it is called the Central Limit Theorem.
We want a little vocabulary before stating the theorem.
•
•
•
The mean of the sample means is denoted µx̄
The std dev of the sample means is denoted σx̄
σ is called the standard error of the mean
x̄
The distribution of sample means is called the sampling distribution of the sample mean.
C ENTRAL L IMIT T HEOREM. Suppose that samples of size n ≥ 30 are repeatedly drawn from a population with mean µ
and standard deviation σ. Then
1.
2.
3.
The sampling distribution of the sample mean is approximately normal, and the approximation gets
better as the sample size increases.
µx̄ = µ and σx̄ = √σn
If the population is itself normally distributed, then (1) and (2) are true even if n < 30.
You’re going to have to answer this question quite a lot:
Question: When can you apply the CLT?
Answer: When n ≥ 30 OR the population is approximately normal (or both).
We’ll call a sample large if n ≥ 30, small if n < 30.
The CLT rephrased: when n ≥ 30 or the population is approximately normal, then µx̄ = µ and σx̄ =
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
√σ .
n
24
E.g. For residents of Snowburg, the average phone bill is $64. Bill amounts are normally distributed, and the
std dev is $9.
(a) What is the probability that a randomly chosen bill is less than $58?
If x represents the phone bill amount, then x ∼ N(64, 9), so the probability that a randomly
chosen bill is less than $58 is
P(x < 58) = normalcdf(−10 000, 58, 64, 9) ≈ 0.2525
(b) What is the probability that the mean of a randomly chosen sample of 36 phone bills is less than $58?
If x̄ represents the mean of a randomly chosen sample of size 36, then x̄ ∼ N(64, √936 ) =
N(64, 1.5), so the probability that the mean of a randomly chosen sample is less than $58 is
P(x̄ < 58) = normalcdf(−10 000, 58, 64, 1.5) ≈ 0.00003
E.g. A certain quantity is normally distributed with mean 42 and std dev 12. You take a sample of size n = 9.
What is the probability that the mean of your sample is between 38 and 46?
The sample is small, but because the population is known to be normally distributed, the sample means are
normally distributed with mean 42 and std dev √129 = 4. So the probability is normalcdf(38, 46, 42, 4) ≈ 0.68
E.g. To meet a customer’s requirements, the diameters of the ball bearings made by a certain machine must
be normally distributed with an average of 2 mm and a standard deviation of no more than 0.1 mm. An
inspector measures 35 bearings and finds a mean diameter of 1.95 mm. Should he consider the machine
to be operating within specifications?
If the machine is operating as it ought, the sample means of samples of size 35 should be normally distributed
around 2 mm with a standard error of √0.1
mm. Thus if the machine is working correctly, the probability of
35
observing a sample mean of 1.95 mm or less is
normalcdf −10 000, 1.95, 2, √0.1
≈ 0.0015
35
Thus, it is very unlikely that the machine is within spec.
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
25
Study the last example carefully. It is a standard way of approaching such problems: assume that the distribution is what
it’s claimed to be and compute the probability of seeing a sample mean at least as extreme as the one actually observed.
We’ll need this next time:
By the CLT, x̄ is normally distributed with mean µ and std dev √σn , so
x̄ − µ
the variable σ √ is normally distributed with mean 0 and std dev 1.
( / n)
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
26
17 Confidence Intervals, Part I
Interval notation
Recall:
Any of the finite ones of these may be written 2 ± 1.
Introduction to confidence intervals
Our estimate for µ is always x̄, the mean of some sample, but we’d like to be able to add something like, “We’re 90% confident
that µ is in the interval (x̄ − E, x̄ + E).”
A point estimate of a population parameter is a one-number estimate (like x̄ for µ).
An interval estimate of a population parameter is an interval supposed to contain the parameter with a given probability
called the level of confidence.
We’ll be concerned primarily with interval estimates for the population mean.
You won’t be tested on the following explanation.
Think of choosing samples of size 50 from a much larger population with standard deviation σ = 7.3. Each sample will
have its own sample mean x̄.
Suppose we want a number E such that, for 90% of the samples, the interval (x̄ − E, x̄ + E) contains the population mean
µ.
Clearly E will depend on the spread of the population, which is to say, on σ. If σ is large, E will have to be large; if σ is
small, E can be small.
Suppose we knew E. As the diagrams above show, µ will be in the interval (x̄ − E, x̄ + E) if and only if x̄ is in the interval
(µ − E, µ + E), so if µ is in (x̄ − E, x̄ + E) 90% of the time, then x̄ must be in (µ − E, µ + E) 90% of the time. Thus, E is
the number such that P(µ − E < x̄ < µ + E) = 0.9.
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
27
We can find E by transforming to the standard normal: the z-score of µ + E must be the number z0.9 such that 90% of
the area under the standard normal curve is between −z0.9 and z0.9 :
If 90% of the area is between −z0.9 and z0.9
then 95% of the area is to the left of z0.9 ,
so z0.9 = invNorm(0.95, 0, 1) ≈ 1.6449.
So the z-score of µ + E is z0.9 ≈ 1.6449. We have
z0.9 =
⇒
(µ+E)−µ
σx̄
=
E
=
z0.9
≈
1.6449
E
√
(σ/ n)
√σ
n
7.3
√
50
≈
1.6982
So, for 90% of samples of size 50, µ will lie in the interval (x̄ − 1.6982, x̄ + 1.6982).
Note that E = 1.6449
confidence interval.
Corwin
7.3
√
50
says that we need about 1.6449 × the standard error of the mean on either side of x̄ to get a 90%
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
28
Generalizing, if we want confidence level c, our interval will be
x̄ − zc √σn , x̄ + zc √σn
We call zc a critical value and E = zc √σn a margin of error.
Critical values
The critical values for a level of confidence c are the points −zc and zc such that c is the area under the standard normal
curve between −zc and zc .
From the picture, the total area to the left of zc is 1 − 1−c
2 =
E.g. z0.9 = invNorm
1+0.9
2
1+c
2 ,
so zc = invNorm
1+c
2
.
≈ 1.6449
zc is the number of standard errors needed for confidence level c.
Margin of error
Margin of error E = zc √σn
E is the accuracy needed on the x-axis for confidence level c.
To find a z -confidence interval for the mean with confidence level c
σ must be known
You must be able to apply the CLT
1.
2.
3.
4.
Find σ, n, x̄, and c (you may need to compute x̄)
Compute the critical value: zc = invNorm 1+c
2
Using the critical value, compute the margin of error: E = zc √σn
The interval is x̄ ± E
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
29
E.g. A sample of size n = 36 from a population with σ = 3.1 has x̄ = 18. Find a 95% confidence interval for
the population mean.
1. From the problem, σ = 3.1, n = 36, x̄ = 18, and c = 0.95
2. zc = z0.95 = invNorm 1+0.95
= invNorm 1.95
≈ 1.96
2
2
3. E = 1.96 √3.1
≈ 1.01
36
4. The interval is 18 ± 1.01, or (16.99, 19.01)
E.g. Fifty measurements of visibility through the water at one particular location averaged 25 feet. Find a 90%
confidence interval for the mean visibility, assuming that the population standard deviation is 5 ft.
1. From the problem, σ = 5, n = 50, x̄ = 25, and c = 0.9
2. zc = z0.9 = invNorm 1+0.90
≈ 1.64
2
3. E = 1.64 √550 ≈ 1.16
4. The interval is 25 ± 1.16, or (23.84, 26.16)
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
30
18 Confidence Intervals, Part II
z confidence intervals on the TI (ZInterval)
The TI will let you compute a z-interval directly from sample data. Don’t do it.
Only use ZInterval if
•
•
you know the population standard deviation σ AND
you can apply the CLT
Sample size
Suppose x is approximately normally distributed with std dev σ, and we must estimate µ to within a margin of error E
with confidence level c. How large a sample must we use?
Recall: margin of error E = zc √σn
√
⇒ n = zEc σ
2
⇒ n = zEc σ
←− formula for sample size
Always round the result UP to the nearest integer.
N.B.: In order to compute n using the formula, you must be able to read c, σ, and E out of the problem.
E.g. The lengths of movies in Jesse’s collection have standard deviation σ = 14.7 minutes. How large a sample
is needed to be 95% confident that the population mean will be within five minutes of the sample mean x̄?
We need to be 95% confident that µ is within 5 of x̄. I.e., E = 5.
2
z0.95 ≈ 1.96 ⇒ n ≈ 1.96(14.7)
≈ 33.2, so the smallest sample has n = 34.
5
The difficulty in sample size problems is identifying E. Remember that E is how close you have to come to the population
mean.
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
31
Confidence intervals: normal population, σ unknown (TInterval)
Suppose that either we have n ≥ 30 or a reason to think that a population is approximately normally distributed (because similar
populations are), but we don’t know σ. What do we do?
Approximate σ by s, of course; but this changes things a bit.
Recall that what makes the z-confidence interval work is that x̄ is approximately normally distributed with mean µx̄ = µ
and standard deviation σx̄ ≈ √σn . When this is true, the variable z = x̄−µ
σ/√n has the standard normal distribution. When you
must approximate σ by s, you have to make an adjustment to this. What you get instead is:
If n ≥ 30 or the population is approximately normally distributed, then the statistic
t=
x̄ − µ
s/√n
has what is called a t-distribution. The t-distribution is a lot like the standard normal distribution and can be used in
much the same way.
Properties of the Student t -distribution
bell-shaped, like the normal curve
total area under curve is 1
• exact curve depends on the number of degrees of freedom (d.f.)
• for us, d.f. is always n − 1 (where n is the sample size)
• central hump is lower than the normal curve, tails are thicker
• when n ≥ 30, the t and the standard normal are very close
•
•
You won’t be tested on properties of the t-distribution.
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
32
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
33
When to use z and when to use t
If you can use the CLT, then
•
•
if the std dev is from the population, use z;
if the std dev is from the sample, use t.
If you can’t use the CLT, don’t construct a confidence interval.
E.g. In a random sample of fifteen CD players brought in for repair, the average repair cost was $80 and the
standard deviation was $14. Assuming that repair costs are approximately normally distributed, use your
calculator to construct a 90% confidence interval for µ.
You don’t know σ, so you should use a t interval.
The interval is (73.633, 86.367).
E.g. Construct 90% and 95% confidence intervals for the mean of the population from which the following
sample data was taken.
90.4
98.5
100.9
108.3
80.2
78.0
76.8
92.4
72.2
91.6
96.9
96.7
94.2
122.2
81.5
92.7
105.1
85.8
78.5
113.1
108.6
81.6
94.1
109.0
75.8
55.7
100.7
112.5
90.8
88.4
Now use STAT, TESTS, 8:TInterval (with Data, not Stats):
For the 95% confidence interval
The 90% interval: (87.97, 96.91)
The 95% interval: (87.059, 97.821)
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
34
19 Hypothesis Testing, Part I
Suppose that someone claims that the average hourly rate charged by lawyers in your area is $200. To test this claim,
you survey 30 firms. If you find an average of $205 per hour, would you say that you had enough evidence to reject the
claim?
Probably not. But what if you found an average of $300 per hour for your sample? Then, you would almost certainly
conclude that the claim of $200 per hour was wrong. To understand the idea of hypothesis testing, you have to realize
why you would be so sure that the claim was wrong: it’s because your intuition tells you that if the $200 per hour claim
were right, then it would be extremely unlikely ever to find a largish sample with a mean as high as $300.
To make this precise, suppose that hourly rates really have mean $200 and standard deviation $50. Then sample means
for samples of size 30 will be normally distributed with mean $200 and standard deviation √5030 ≈ 9.13. That makes a
sample mean of $300 more than ten standard deviations away from we we expect to find it! If we go ahead and calculate
the probability of finding a sample with a mean of $300 or more, we find that it is normalcdf 300, 10 000, 200, √5030 ≈
3.4 × 10−28 —a probability so small that we normally just think of it as zero.
We just did a hypothesis test of a claim about a population mean µ. Specifically, we
1.
2.
3.
temporarily accepted the claim about µ and σ (i.e., as a hypothesis);
computed the mean of a specific sample from the population; and
used the claim to compute the probability of observing a sample mean at least as extreme as the one
observed.
The probability of observing a sample mean at least as extreme as the one observed was very small, so we rejected the
claim.
We’ll refine and formalize this process, but we need some other stuff first.
Statistical hypotheses
In statistics, a null hypothesis H0 is a statement about the equality or inequality of two quantities. A null hypothesis
always contains one of the symbols ≤, =, ≥.
E.g. H0 : µ ≤ 1,
H0 : µ = 50, H0 : σ2 ≥ 42
We’ll only be concerned with hypotheses about a single mean.
An alternative hypothesis H1 is the negation of a null hypothesis.
These are the only types of null and alternative hypotheses we will see:
H0 is
if and only if H1 is
µ ≤ µ0
µ > µ0
µ = µ0
µ 6= µ0
µ ≥ µ0
µ < µ0
Choosing hypotheses
A professional statistician would choose hypotheses based on what types of errors could occur and the seriousness of each
type. We’ll just use a few simple rules.
1. A claim is being made about the relation of the population mean to some number µ0 . Find this claim and use it to
identify µ0 .
2. The alternative hypothesis must be one of µ < µ0 , µ 6= µ0 , or µ > µ0 . Choose the one that you want to support (or
that the person in the problem wants to support).
3. The null hypothesis is the negation of the alternative hypothesis.
Note that the word “claim” in a problem sometimes corresponds to H0 and sometimes to H1 .
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
35
E.g. A certain chain of movie theaters claims that the average price for a movie ticket in its theaters is no more
than $7.25. By querying people on the internet, John has put together a sample of thirty ticket prices with
a mean of $7.50.
(a) What is the population here?
(b) What claim is being made about the population mean µ?
(c) What is µ0 ?
(d) What are the possibilities for the alternative hypothesis?
(e) What is John’s alternative hypothesis?
(f) What is John’s null hypothesis?
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
36
E.g. A rendering company claims that on average, each ounce of shmoo oil contains no more than one gram
of saturated fat. Jane does not believe this claim, and wishes to do an experiment to prove her point.
(a) What is the population here?
(b) What claim is being made about the population mean µ?
(c) What is µ0 ?
(d) What are the possibilities for the alternative hypothesis?
(e) What is Jane’s alternative hypothesis?
(f) What is Jane’s null hypothesis?
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
37
More specific idea of the tests
Assume that σ is known, that we can apply the CLT, that H0 : µ ≤ 3, and that we have x̄ for a specific sample.
•
•
•
If H0 is true, then µ is less than or equal to 3.
x̄ should be near µ, so x̄ should be less than or equal to 3, or at least not much greater than 3
We’ll reject H0 if x̄ is too much greater than 3
i.e., if x̄ − 3 is too much greater than 0
i.e., if zx̄ = x̄−3
σ/√n is too much greater than 0
Because zx̄ has the standard normal distribution, we can compute the probability of finding a value of zx̄ that is larger
than some cutoff value. In practice, we won’t ever find the cutoff value explicitly. Instead, we’ll find the area to the right
of our actual zx̄ , and if that is too small, we’ll reject H0 .
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
38
For the null hypothesis H0 : µ ≥ 3, we want to reject H0 if zx̄ is too much less than 0:
For H0 : µ = 3, we want to reject H0 if zx̄ is too far from 0 in either direction:
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
39
20 Hypothesis Testing, Part II
Procedure for testing H0 : µ ≤ µ0 when σ is known
To use this, you must know σ and be able to apply the CLT.
1. Decide on a level of significance α (in problems, this is given)
• typical values are 0.1, 0.05, 0.01
2. Get x̄ for your sample
•
x̄ is called the test statistic
3. Determine the z-score of this x̄
•
•
Formula: zx̄ =
x̄−µ0
σ/√n
zx̄ is the standardized test statistic
4. Determine the p-value for this x̄
•
•
Formula: p = normalcdf(zx̄ , 10 000)
p is the probability of observing a sample mean at least as extreme as this x̄ if the population mean and std
dev are really µ0 and σ
5. If p ≤ α, reject H0 . Otherwise, don’t.
E.g. A certain chain of movie theaters claims that the average price for a movie ticket in its theaters is no more
than $7.25. By querying people on the internet, John has put together a sample of thirty tickets with a
mean price of $7.50. Does he have enough evidence to reject the chain’s claim at the 0.10 significance
level? Assume that the population is normally distributed with standard deviation $1.30.
Clearly, John wants to show that µ > 7.25, so H1 : µ > 7.25, whence H0 : µ ≤ 7.25. Also, n = 30 and
σ = 1.30. Steps:
1.
2.
3.
4.
5.
α = 0.10
x̄ = 7.50
√
zx̄ = 7.50−7.25
(1.30/ 30) ≈ 1.05
p = normalcdf(1.05, 10 000) ≈ 0.147
It’s not true that p ≤ α, so don’t reject H0
Note that when we reject H0 , we don’t say that we have proved H1 . Rather, we have found H0 unlikely, and accept H1 as more
likely.
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
40
Procedure for testing H0 : µ ≥ µ0 when σ is known
This is the same as the previous case except for:
4. Determine the p-value for this x̄
p = normalcdf(−10 000, zx̄ )
•
Procedure for testing H0 : µ = µ0 when σ is known
This is the same as the previous case except for:
4. Determine the p-value for this x̄
•
•
If x̄ < µ0 , p = 2 · normalcdf(−10 000, zx̄ )
If x̄ > µ0 , p = 2 · normalcdf(zx̄ , 10 000)
p is always computed in such a way that we can reject H0 if p ≤ α.
When p is low, reject the H0 .
E.g. A company claims that the amount of cereal in its 24-ounce boxes is normally distributed with mean 24
oz and standard deviation 1 oz. You have checked 30 boxes and found a mean of 24.5 oz. Do you have
enough evidence to reject the company’s claim at the 0.01 significance level?
You want to reject the claim that µ = 24, so your alternative hypothesis should be µ 6= 24. Thus, H0 : µ =
24.
1.
2.
3.
4.
5.
Corwin
α = 0.01 is given
x̄ = 24.5
√
zx̄ = 24.5−24
≈ 2.7386
1/ 30
p = 2 · normalcdf(2.7386, 10 000) ≈ 0.00617
p ≤ α, so reject H0 —i.e., reject the company’s claim
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
41
21 Hypothesis Testing, Part III
Type I and Type II errors
In the American justice system, a defendant is innocent until proven guilty. Possible circumstances and outcomes of a
trial:
Man is really
innocent
Man is really
guilty
Jury finds man
guilty
Error
Justice
Jury finds man
innocent
Justice
Error
Note the two very different kinds of possible error. Note also that either error might be made by a completely honest and
scrupulous jury: the conclusion they reach depends on the evidence they are given, so if the evidence does not correctly
represent the situation, they will be led into error even with perfect attention and judgement.
If we let H0 be “The man is innocent” and H1 be “The man is guilty,” then we can translate the table to:
H0 is actually
true
H0 is actually
false
We reject H0
Type I Error
Correct
We accept H0
Correct
Type II Error
For a trial, we regard a Type I error as the more serious type, so we require a number of people to agree that guilt is proved
“beyond a reasonable doubt”.
Usually, the harder we try to prevent a Type I error, the more likely we are to make a Type II error.
The names we used for the two different kinds of errors are standard. Roughly:
Type I error: accept H1 even though it is false (false positive)
Type II error: accept H0 even though it is false (false negative)
The trial situation is perfectly general. Even with the best techniques, an unusual sample can be chosen, yielding a
sample mean that is not near the population mean (the equivalent of bad evidence).
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
42
To figure out what a Type I error would mean in the context of a particular problem, ask yourself:
(i)
(ii)
What would it mean to accept H1 ?
What would it mean for H1 to be false?
Use similar questions for Type II errors.
Answers to “Describe the practical consequences of making a Type X error in the context of the problem” questions
should have the form
Type I : Accept
when in fact
Type II: Accept
.
H1 is false
H1
when in fact
H0 is false
H0
E.g. You are in charge of drug testing for your company. Write (in words) an H0 and an H1 suitable for showing
the presence of a drug, and describe the practical consequences of making Type I and Type II errors.
H0 : no drug present
H1 : drug present
Type I : Accept that a drug is present when in fact there is no drug
Type II: Accept that no drug is present when in fact a drug is present
Note that “false positive” and “false negative” are perfect descriptions in this case.
E.g. A regulation requires an average bacteria count µ of 70 as the maximum acceptable for fishing waters. If
the average is above 70, the site is considered unsafe and is closed.
(a) Suppose that you work for the agency which monitors these waters, and you are interested in showing
that they are unsafe. Write appropriate null and alternative hypotheses for testing the waters.
H0 :
(
)
H1 :
(
)
(b) Describe the practical consequences of making Type I and Type II errors in this situation.
Type I:
Type II:
It is better and more satisfactory to acquit a thousand guilty persons than to put a single innocent
man to death once in a way.
—Maimonides, The Commandments (Negative Commandment 290)
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
43
22 Hypothesis Testing, Part IV
Z-Test on the TI
Only use Z-Test if
•
•
you know the population standard deviation σ AND
you can apply the CLT
E.g. A company claims that the amount of cereal in its 24-ounce boxes is normally distributed with mean 24
oz and standard deviation 1 oz. You have checked 30 boxes and found a mean of 24.5 oz. Use Z-Test on
the TI calculator to test the company’s claim at the 0.05 significance level.
Note. We found earlier that µ0 = 24, σ = 1, x̄ = 24.5, n = 30, H0 : µ = 24, and H1 : µ 6= 24.
Note that it is the alternative hypothesis H1 that you indicate on the penultimate line.
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
44
When σ is unknown
If you are able to apply the CLT, use T-Test on the TI (see below for instructions).
E.g. The XYZ Corporation has told its staff that the average salary of a secretary is normally distributed with
a mean of $22,000. Janet believes that the average is really less than $22,000 and has found that the ten
secretaries in her area have the salaries shown below. Write her null and alternative hypotheses and test
her claim at the 0.05 significance level.
21178
21569
22424
21806
22814
20834
20509
22259
20727
21555
We don’t know σ, so we can’t use a z-test, but because we expect the salaries to be normally distributed,
we can use a t-test.
We have H0 : µ ≥ 22 000,
H1 : µ < 22 000.
→
The p-value for your x̄ can be read from the T-Test results.
Which test?
Just like confidence intervals:
•
•
•
If σ is known and you can apply the CLT, use a z-test.
If σ is unknown and you can apply the CLT, use a t-test.
Otherwise, don’t use either—call a professional.
In problems, ask yourself: where does the standard deviation come from—from the sample or from the population?
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
45
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
46
23 Linear Regression
Linear models
Hooke’s Law tells us that if we stretch a spring a distance x, then the force with which the spring pulls against the stretch
is kx for some number k (which depends on the spring, of course). But suppose we take a real spring, stretch it various
distances x, and measure the force y. Will we see an exact relationship y = kx? Of course not; there are bound to be
errors. Instead, we would see something like this:
Our measurements should have given us points (xi , yi ) where each yi = kxi , but they didn’t quite do so. Instead, we have
yi = kxi + εi , where εi is a random error. A relationship like this is a type of linear (statistical) model.
Notice that if we subtract the linear part (the kxi ) from the data, we are left with just the error. In many practical cases,
this error is approximately normally distributed with mean 0. The correlation and regression techniques we will study
depend on this, so we will assume it.
When the data points don’t lie very close to a line, it can still make sense to use a linear model, as the following example
shows.
Instead of thinking of two variables x and y as simply being related, we often think of using a model to explain y by x
or to predict y from x. For example, if we wanted to explain patients’ responses to medication by the dosages they take,
we would let x be the dosage and y be the response. Then, if responses really do depend only on dosages, and if the
dependence is linear, the data points (xi , yi ) from a study should lie very close to a straight line.
In many situations, however, more than one variable must be used to explain what is observed. If we give patients various
doses of a drug, for example, their responses may well depend on both the dosage and the patients’ ages, so we might
have yi = k1 xi + k2 zi + εi where xi is the dosage amount and zi is the patient’s age.
Imagine that you have been given just the dosages and responses from such an experiment, as in the following picture.
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
47
If you try to explain the response y using only the dosages, you will find that your formula yi = k1 xi isn’t very accurate.
As the scatter plot shows, there’s a linear trend, but with quite a lot of variation. If you add the line y = k1 x to the plot, it
will track the trend, but many data points will be fairly far from it.
This makes sense, because the dosage only partially explains the response. It also makes sense to try to model the
response using a linear function of only the dosage, because the linear dependence is really there. It’s just not the only
dependence.
Statisticians have developed methods for developing multi-variable models, but we won’t study them in this course; we
will be content with finding one line. Ultimately, we will need to answer three questions:
•
•
•
Supposing that a linear model can be used, which line is best?
When is it reasonable to use a linear model?
How good is the model?
The answer to the first question is the regression line.
The regression line
If there is a linear relationship between x and y, the line that best
fits the (x, y) points is called the regression line or trend line.
How can we find this line? If, truly, yi = axi +b+εi , then for each
xi , the difference between each yi and axi + b is just a random
error εi . (See the picture). The total error is the sum ∑ |yi −
(axi + b)|, and the best line should be the one that minimizes the
total error.
Graphically, this means picking the line for which the sum of the
vertical distances from each data point to the line is as small as
possible.
In practice, we minimize ∑[yi − (axi + b)]2 , so the method is
called least squares. This seems reasonable, as x2 is always positive and varies like |x| (i.e., is small when |x| is small
and large when |x| is large).
In fact, it can be proved that the method of least squares gives the best linear fit most practical situations.
Specifically: if, for each x, the y-values are approximately normally distributed and, for each y, the x-values are approximately
normally distributed, then least squares will give the best fit.
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
48
E.g. The table shows the ages and salaries of ten secretaries at the XYZ corporation. Give the equation of the
trend line.
Age
Salary
20
21178
21
21569
26
22424
32
21806
25
22814
21
20834
21
20509
22
22259
20
20727
24
21555
The calculator gives a ≈ 109.68 and b ≈ 19 022.89 (see the LinReg output below), so the equation of the
line is y = 109.68x + 19 022.89.
E.g. Make a scatter plot of the data in the previous example.
→
Corwin
→
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
49
Using Linear Regression
We use the regression line to predict y-values: simply plug a number in for x and compute ŷ = ax + b. (The notation ŷ is
traditional for predicted values of y.)
Strictly speaking, we should use the regression line only when the correlation between an independent variable x and a dependent variable y is significant (see the section on correlation). We will simply assume that this is true unless told otherwise. (If
the correlation is insignificant, then the best linear predictor for y is simply y = ȳ.)
Only use the regression equation with values of x that are within, or very near, the range of x values used to compute the
trend line coefficients.
Finding ŷ for an x within the range of x-values used to compute the regression line coefficients is called interpolation;
finding ŷ for an x outside that range is called extrapolation.
Interpolation Example
The table below shows the density of pure liquid water at various temperatures.
Temp (◦ C)
Density (g/cm3 )
0
0.9999
4
1.000
20
0.9982
40
0.9922
60
0.9832
80
0.9718
Use linear interpolation to estimate the density at 30◦ C.
To do this problem, we use the TI, putting the temperature values in the L1 list and the corresponding density values in
the L2 list. We get a ≈ −0.000345, b ≈ 1.00261, so the regression line is D = −0.000345T + 1.00261, where T is the
temperature and D is the density. Thus the density at T = 30◦ C is estimated to be −0.000345(30) + 1.00261 = 0.99226.
The TI’s scatter plot and regression line with the interpolated value
added.
Extrapolation Example
The table below shows the global temperature anomaly (difference between annual average global temperature and the
mean of that quantity for the period 1901–2000) for the years 1950–1969.
Year
Difference
1950
−0.207
1951
−0.196
1952
−0.186
1953
−0.184
1954
−0.186
1955
−0.186
1956
−0.176
Year
Difference
1957
−0.155
1958
−0.131
1959
−0.111
1960
−0.100
1961
−0.101
1962
−0.113
1963
−0.131
Year
Difference
1964
−0.148
1965
−0.157
1966
−0.153
1967
−0.138
1968
−0.120
1969
−0.103
Find the coefficients for the trend line. Use the trend line to predict the anomaly for 2009.
Let x = 0 correspond to 1950, so that List L1 in the TI contains the numbers 0, 1, 2, . . . 19. With this we find that
a ≈ 0.00427 and b ≈ −0.1897, so the regression equation is 0.00427x − 0.1897. Plugging in 59 for x yields
ŷ2009 = 0.06223.
However, the real y2009 was 0.415 — a huge difference.
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
50
Moral: Beware of extrapolation!
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
51
24 Correlation
Remember the questions from the previous section:
•
•
•
If a linear model can be used, which line is best?
When is it reasonable to use a linear model?
How good is the model?
We want to answer the second and third of these.
Given some (x, y) data, we would like to try to find out whether there is a linear model that fits the data reasonably well.
That is, we’d like to detect when there are numbers a and b such that, to a large extent, y = ax + b.
Now, we can draw a line through any bunch of points, but that doesn’t make the line a reasonable model. When is there
a reasonable linear model? The answer is given by defining a quantity called correlation which measures how close the
(x, y) points are to being on a straight line. The closer they are, the more reasonable it is to assume that a linear model
can be used.
For a population, it’s possible to define a quantity called the Pearson correlation coefficient ρ which varies between −1
and 1 and to prove that an exact linear relationship y = ax + b exists if and only if ρ is −1 or 1. This is very important
theoretically, but in practice we almost never have a population, only samples. For sample data, we compute r, an
estimate for ρ.
Properties of r
•
r is always between −1 and 1.
•
When r is near −1, there is a strong negative linear correlation between x and y. That is, the (x, y) points
lie near a straight line with negative slope.
When r is near zero but not equal to zero, we say that the correlation is weak. When r is near 0, there is
no linear correlation between x and y.
When r is near 1, there is a strong positive linear correlation between x and y. That is, the (x, y) points
lie near a straight line with positive slope.
•
•
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
52
r is the answer to the second question. We’ll compute it with the calculator:
Note that r near zero does not mean that x and y are not correlated; it means that there is no linear relationship.
In the social sciences, small, medium, and large ‘effect sizes’ are about 0.1, 0.3, 0.5.
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
53
r2
Think of the regression line as predicting the value of y, given the value of x. What we’d like to know is: for a given case,
how good is this prediction? If the data is very nearly on a line, then the prediction should be very good: the predicted
values of y shouldn’t be too different from the actual values. If the data doesn’t lie very near a line, then the prediction
won’t be very good. We know that the correlation coefficient r measures how close the data comes to being on a line, so
if r is near ±1, the prediction will be good, and if r is near zero, it won’t be.
We don’t use r directly for this, however. To understand the language that is used for this, consider the set of (x, y) data
pictured below, in which there’s an obvious positive linear correlation.
If we think of moving from left to right along the x-axis and watching what happens to the y values, we see that the
changes in y are really made up of two parts: an overall linear trend and something else that makes the y value jump up
and down from the trend line. If we subtract out the linear part, all that is left is the “something else”—i.e., the rest of
the variation in y:
If this remaining variation is small, then the trend line explains most of the variation in y, and so is a good fit. If it is
large, then there’s a large amount of variation in y which is not explained by the trend line, i.e., the line is not a good fit.
It turns out that the value of r2 is exactly the fraction of the variation in y which is explained by the trend line.
r2 , called the coefficient of determination, is the answer to our third question. When r2 is small (near 0), the linear model
does not explain much of the variation, and so is not very good; when it is large (near 1), the model explains most of the
variation and so is a good model.
Note that r2 is always between 0 and 1.
E.g. A study of the correlation between high school GPA and first-year college GPA for 63,482 students found
that r = 0.39 at the 0.05 significance level. How much of the variation in first-year college GPA is
explained by high school GPA?
We have r = 0.39, so r2 ≈ 0.15. We say that about 15% of the variation in first-year college GPA is
explained by high school GPA.
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
54
E.g. The table shows the ages and salaries of ten secretaries at the XYZ corporation. Determine the correlation
coefficient. About what percentage of the variation in salary is explained by age?
Age
Salary
20
21178
21
21569
26
22424
32
21806
25
22814
21
20834
21
20509
22
22259
20
20727
24
21555
As the TI screen shows, r ≈ 0.53 and r2 ≈ 0.28. We say that about 28% of the variation in salary is
explained by age.
A bad mistake
Correlation does not imply causation!
For example, the number of metropolitan readers of the weekday New York Times between 1993 and 2007 and the
percentage of American adults who smoked cigarettes regularly in those years are very highly correlated (r = 0.96, my
calculation). Did either of these cause the other?
This stuff is used everywhere...
Figure 6: a) Scatter graph of E2F4 and RBL2 expression levels. The linear correlation coefficient is −0.36. Clearly, there
is little relationship between the two sets of expression data.
b) Scatter graph of the predicted E2F-4 and p130 TF activities.
The linear correlation coefficient is found to be
−0.80. The training sets of E2F-4 and p130 included 12 and 43
interactions, respectively. Only three of the genes were coregulated by both TFs.
Taken from Transcriptional regulatory networks via gene ontology and expression data, In Silico Biology 7 (2006)
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
55
Significance of a correlation
If a correlation coefficient r is determined from a small sample, then even if |r| is large, the correlation detected may
not be significant. (If n = 2, for example, then r = ±1 no matter what.) If r is determined from a large sample, the
correlation may be significant even if it is fairly weak. Significance is determined by a hypothesis test. We won’t do the
test, just use the results.
Let n be the number of ordered pairs used to determine r. To determine significance, we use a table of what are called
the critical values for r. Note that the critical value depends on both n and r.
α = 0.05
α = 0.01
n=4
0.950
0.990
5
0.878
0.959
6
0.811
0.917
7
0.754
0.875
8
0.707
0.834
9
0.666
0.798
10
0.632
0.765
11
0.602
0.735
12
0.576
0.708
α = 0.05
α = 0.01
13
0.553
0.684
14
0.532
0.661
15
0.514
0.641
16
0.497
0.623
17
0.482
0.606
18
0.468
0.590
19
0.456
0.575
20
0.444
0.561
21
0.433
0.549
Testing for significance
1) Using the table, determine the critical value c for your α and n.
2) If |r| > c, the correlation is significant; otherwise, it isn’t.
On a test, you will have to be given a critical value or a table from which to determine it.
E.g. A correlation coefficient of r = −0.5 was computed using a sample of size n = 20. Test the significance
of the correlation at the α = 0.05 level.
1) From the table, the critical value for n = 20, α = 0.05 is 0.444.
2) | − 0.5| = 0.5 and 0.5 > 0.444, so the correlation is significant.
Corwin
S TAT 200
©2011-2014 Stephen Corwin — Do not distribute
56