Download What Is Statistics? STATISTICAL METHODS I

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
What Is Statistics?
STATISTICAL METHODS I
Ou Zhao
First Encounter To Statistics
University of South Carolina
In order to talk about statistics, or formally give a definition,
one has to mention data. I hope everybody in this room has
seen data of some sort, either from newspaper, or from your
school coursework. One possible dataset I can collect right
away is by asking you whether you like president Obama or
not, and apparently I would be interested in the proportion of
those people who indeed like the current president.
Definition: Statistics is the science of data. This involves
collecting, classifying, summarizing, organizing, analyzing,
and interpreting numerical information.
Remark: You may instantly argue with me that if your answer
to my previous question is yes, that is not numerical information. But how about coding it by 1 ?
STAT 515 – p.1
STAT 515 – p.2
More About The Definition
Sample Matters
•
There are, in general, two ‘kinds’ of statistics:
Descriptive Statistics uses a variety of means to look for
patterns in a data set, to summarize the information revealed
in a data set, and to present that information in a convenient
way.
Inferential Statistics utilizes something, called sample
data, to form estimates, decisions, predictions, or other
generalizations about a larger set of data, commonly referred
to as a population.
Suppose one is interested in the average age of viewers
of ABC World News Tonight, how do you best address
this question?
• Apparently this involves a big population in America (or
even worldwide), so one cannot ask everybody. The idea
of drawing ‘small’ sub-populations comes up naturally.
• A Statistical Inference is an estimate, prediction, or
some other generalization about a population based on
information contained in a sample.
Remark: As you can best guess, a population is a set of units
(say, people, transactions, or events) that we are interested in
studying.
STAT 515 – p.3
STAT 515 – p.4
Random Sample
EPA Car Mileage Rating Data
•
By sample, we mean a subset of the units of a
population. This subset can be big, or small, depending
on situations.
• Selection bias, nonresponse bias, measurement error.
• what is a good sample? Intuitively, we want something
representative of the population. In statistics, it is
formalized as a random sample: a sample selected from
the population in such a way that every different sample
of size n has an equal chance of selection.
• Random samples sometimes can be hard to get.
•
However, one can easily get samples like EPA. See R
output for a histogram using hist().
• This is a good time to introduce R, a free statistical
package, which is available from
http://cran.r-project.org/
on which, you can also find introductions, both quick
and comprehensive.
STAT 515 – p.5
Software Comparison
•
STAT 515 – p.6
Stem-and-Leaf Plot
Advantages of R over minitab:
(1) free; so good for students.
(2) written by research statisticians who are working at
the frontiers, which means more built-in modern
statistical packages.
(3) interactive interface; and many other features.
However, it is not as commecialized as minitab, so less
popular in industry.
STAT 515 – p.7
The EPA car Mileage data consists of 100 observations; you
may find this data set in the file ‘epagas.xls’ on your textbook
CD. To read them into R, you can first transport them into a
txt file, and try x<-scan(‘‘epagas.txt’’), then x should
acquire all those observations.
Stem-and-leaf plot is very similar to the histogram plot by
purpose, it shows how the data is distributed. You may try
the command stem(x) in R, and the output is displayed on
the next slide.
STAT 515 – p.8
Stem() output
Output of hist()
> stem(x)
car mileage histogram
The decimal point is at the |
30
32
34
36
38
40
42
44
|
|
|
|
|
|
|
|
08
5799126899
02458801235667899
01233445566777888999000011122334456677899
012234567800345789
0123557002
1
9
STAT 515 – p.9
Patterns
STAT 515 – p.10
Exercise
•
For a picture like histogram, one may look for interesting
features.
• For instance, where does the observations center
around? Is the picture symmetric? If it is not symmetric,
then it is called skewed (right or left).
• We shall contrive an example by adding more
observations to the previous example.
Would you expect the following data sets to possess
histograms that are symmetric, skewed to the right, or skewed
to the left?
a The ages of automobiles on a used-car lot
b Repeated measurements of the same chemical constant
c The grades on a difficult test
STAT 515 – p.11
STAT 515 – p.12
Data Type
Mean
•
As we can see, data can come in different ways; they
may be easily identified as two types.
• Quantitive data are measurements that are recorded
on a naturally occuring numerical scale.
1. The current unemployment rate in each of the 50
states.
2. The scores of a sample of 150 law school applicants
on the LSAT, a standardized law school entrance
exam.
• Qualitive data are measurements that cannot be
measured on a natural numerical scale; they can only be
classified into one of a group of categories.
1. The political party affiliation in a sample of 100
voters.
•
Numerical measures of central tendency: One obvious
choice is the mean, which is defined as
Pn
xi
x¯ = i=1 ,
n
where xi ’s are data points.
• Look at the EPA data, one can get the sample mean by
using mean(EPA). You can check that with
sum(EPA)/100. Mean tells you where most of the
observations tend to center around.
STAT 515 – p.13
STAT 515 – p.14
Mean and Median
Quick Facts
•
The other competitive notion is median: suppose you
have odd number of data points, the median is defined
to be the value right in the middle of the sorted data;
but if your sample has even number of points, the
median is the average of those two values in the center
of your sorted data.
• compare median and mean for the data: 2.3, 4.5, 6.4,
8.4, 3.4, 5.3, 4.7,3.8. Claim: median is robust to
outliers. In this regard, median is more accurate in
measuring the center.
• Indeed one may have skewed data due to measurement
error, which may bring in outliers. We will see some of
those datasets later on, so be careful when measuring
the center.
STAT 515 – p.15
Describe how the mean compares with the median for a
histogram as follows:
•
Skewed to the left
• Skewed to the right
• Symmetric
STAT 515 – p.16
Variability of Your Data
•
You may think of the range, i.e., max-min. What if there
are outliers due to measurement error. Will range reflect
the true spread out?
• Statisticians tend to use the so-called sample variance.
By formula it is given by
Pn
(xi − x¯)2
2
s = i=1
.
n−1
•
As you can imagine, if the whole population is observed,
the population variance and its standard deviation would
be defined in the similar way. Statisticians tend to
denote them by σ 2 , σ . But keep in mind, these are
usually not available, because the population is
unmanageable. So they are parameters (or
characteristics, as you may call ) that need to be
estimated. Look at the EPA data.
Alternatively, a commonly used related quantity is the
sample standard deviation, which is the square root of
the sample variance:
√
s = s2 .
STAT 515 – p.18
STAT 515 – p.17
Standard Deviation
Relative Standing
•
(a) Approximately 68% of the measurements will fall
within one standard deviation of the mean [i.e., within
the interval (¯
x − s, x¯ + s) for samples and (µ − σ, µ + σ)]
for populaions].
•
(b) Approximately 95% of the measurements will fall
within two standard deviations of the mean [i.e., within
x − 2s, x¯ + 2s) for samples and
the interval (¯
(µ − 2σ, µ + 2σ)] for populations.
•
(c) Analogously, the 3−deviation rule, which is about
99.7% of the measurements.
STAT 515 – p.19
•
Percentile ranking: For example, suppose you scored an
80 on a test and you want to know how do you compare
with other students in the class. If the instructor tells
you that you scored at the 90th percentile, what does
that mean?
• Definition: For any set of n measurements, the pth
percentile is a number such that p% of the
measurements fall below that number and (100-p)% fall
above it.
• It is a useful summary for a single observation if the
dataset is particularly large.
STAT 515 – p.20
Outliers
•
Another measure of relative standing is the z-score. The
sample z-score for a measurement y in a sample x is
defined as
y − x¯
z=
;
s
and the population z-score for a measurement y is
z=
•
y−µ
.
σ
So it is really measuring how many standard deviations
away is that particular measurement from the mean.
•
By outliers we mean observations which are either
unusually big or small compared to other measurements
in the sample.
• How do they appear? (1) The measurement is observed
incorrectly, recorded incorrectly, or even entered into the
computer incorrectly. (2) The measurement comes from
a different population. (3) The measurement is correct,
but represents a rare event. Like Albert Einstein, his IQ
score is incredibly high, does not belong to any
population!
• Two useful methods for detecting outliers: boxplots and
z-scores.
STAT 515 – p.21
STAT 515 – p.22
Data in CD
•
Method 1 (Boxplot): The Lower quartile QL is the
25-th percentile of a data set. The middle quartile M is
the median. The upper quartile QU is the 75th
percentile.
A box plot is based on the interquartile range
IQR = QU − QL
Now look at boxplot(EPA), did you see any potential
outliers?
• More accurate ways detecting outliers is through z-score:
Observations with z-scores greater than 3 in absolute
value are considered outliers. For some highly skewed
data sets, observations with z-scores greater than 2 in
absolute value may be outliers.
STAT 515 – p.23
To make a boxplot and see whether there are potential
outliers, we shall use a data file, called LM2_126 contained in
your text CD.
• This file contains two columns, and it comes in different
formats, but not in .txt.
• So one may have to copy-paste the contents and put
them .txt format.
• Because there are two columns, the command scan()
no longer works, we may try read.table().
STAT 515 – p.24
Exercise
Exercise
Suppose a female bank employee believes that her salary is
low as a result of sex discrimination. To substantiate her
belief, she collects information on the salaries of her male
colleagues in the banking business. She finds that their
salaries have a mean of $ 54,000 and a standard deviation of
$ 2,000. Her salary is $ 47,000. Does this information
support her claim of sex discrimination?
Suppose a female bank employee believes that her salary is
low as a result of sex discrimination. To substantiate her
belief, she collects information on the salaries of her male
colleagues in the banking business. She finds that their
salaries have a mean of $ 54,000 and a standard deviation of
$ 2,000. Her salary is $ 47,000. Does this information
support her claim of sex discrimination?
z=
47, 000 − 54, 000
= −3.5
2, 000
STAT 515 – p.25
STAT 515 – p.25
Graphing Bivariate Relationship
Quadratic Relationship
6
y
4
2
Contrived examples: x<-seq(1,3,by=.1), y<-x^2,
what is the relationship? After adding some background
noise, did the relationship change? For real data with
two variables, you can do exactly the same plot, which is
called scatterplot in statistics. It tells you quick
information about two variables.
• Add lines and colors.
plot(x,y,t="l", col="aquamarine4")
8
•
1.0
1.5
2.0
2.5
3.0
x
STAT 515 – p.26
STAT 515 – p.27
Elementary Probability
Discrete Probability Model
Probability is essential for understanding statistical inference:
•
•
Definition 1: An Experiment is an process of
observation that leads to a single outcome that cannot
be predicted with certainty.
• Like a coin tossing, the consequence may be a head, or a
tail. The likelihood may depend on the way you toss it,
and the nature of the coin. In perfect situation, it should
be equally likely. So this is a experiment, because its
outcome cannot be predicted with certainty.
• Definition 2: A Sample point is the most basic
outcome of an experiment.
• Definition 3: The Sample space of an experiment is
the collection of all its sample points.
•
Let pi represent the probability of sample point i. Then
(1) All sample point probabilies must lie between 0 and
1; (2) The probabilities of all the sample
points within a
P
sample space must sum to 1 (i.e.,
pi = 1).
Example (Rolling a fair die): All the outcomes are
equally likely, so it is 1/6 since S = {1, 2, 3, 4, 5, 6}. You
may ask yourself the following question: what is the
probability of seeing an even number? This introduces
the notion of an event, which is more complicated than
one particular outcome.
STAT 515 – p.28
STAT 515 – p.29
Combinatorial Analysis
•
An event is a specific collection of sample points.
Generically, it can be denoted by A = {2, 4, 6}. The
probability of an event A is calculated by summing the
probabilities of the sample points in the sample space for
A.
• In order to calculate the probability of an event, it is
important to know the sample space by listing the
sample points, their respective probabilities. You should
also determine the collection of sample points contained
in that event.
STAT 515 – p.30
•
Since calculating probabilities usually involve counting,
let us review some combinatorial analysis, the art of
counting in mathematics. Suppose a sample of n
elements is to be drawn from a set of N elements. The
the number of different samples possible is denoted by
!N n and is equal to
N
N!
.
=
n
n!(N − n)!
note that n! = 1 · 2 · · · n
• If a fair die is rolled 5 times, the probability of getting
exactly 1 spot on the first and last rolls and more than 1
spot on the other three rolls is
. What if,
additionally we require all those rolls give different spots?
STAT 515 – p.31
Unions and Intersections
•
The union of two events A and B is the event that
occurs if either A or B (or both) occurs on a single
performance of the experiment. We denote the union of
events A and B by A ∪ B . The intersection is defined
to be the event that occurs if both A and B occur on a
single performance, denoted by A ∩ B .
• Using Venn diagrams, you can see more easily.
• Problem: Consider a die-toss experiment in which the
following events are defined: A=Toss an even number.
B=toss a number less than or equal to 3.
(a) Describe A ∪ B for this experiment
(b) what about A ∩ B ?
(c) Calculate P (A ∩ B), P (A ∪ B), assuming the die is
balanced.
•
•
The Complement of an event A is the event that A
does not occur— that is, the event consisting of all
sample points that are not in event A. We denote the
complement by Ac . So P (A ∪ Ac ) = 1.
It is easy to hypothesize P (A ∩ B) = 0 if A and B are
mutually exclusive, i.e., A ∩ B contains no sample
points. Using Venn diagram, one can show
P (A ∪ B) = P (A) + P (B) − P (A ∩ B).
It follows easily that P (A) + P (Ac ) = 1
STAT 515 – p.33
STAT 515 – p.32
Conditional Probability
•
Given today is a rainy day, what is the chance that it is
going to rain tomorrow? This likelihood presumably
should be higher than asking without any conditioning
knowledge. Given two events A and B in a sample space
S , what is the probability A occurs given that B already
occurred. It is defined to be
P (A|B) =
•
Independent Events
P (A ∩ B)
.
P (B)
•
There are situations where two events A and B are both
empirically believed, or theoretically calculated as being
independent of each other. By definition, if
P (A) = P (A|B), then A and B are said to be
independent; otherwise, they are dependent.
• Consequently, for independent events, the multiplicative
rule holds, i.e., P (A ∩ B) = P (A)P (B). In many
applications, this will simplify the calculations to a great
extent.
When rolling a fair die, what is the likelihood of seeing
2? It is certainly different than the probability of seeing
2 given that we already knew the outcome is an even
number. So A = {2}, B = {2, 4, 6}; P (A) = 1/6,
however P (A|B) = 1/3.
STAT 515 – p.34
STAT 515 – p.35
Die example revisited
Bayes Rule
Problem: Consider the experiment of tossing a fair die , let
Given k mutually exclusive and exhaustive events,
B1 , B2 , . . . , Bk such that P (B1 ) + P (B2 ) + · · · + P (Bk ) = 1,
and given an observed event A, it follows that
A = {observe an even number},
and
B = {observe a number less than or equal to 4}.
Are A and B independent events?
P (Bi |A) =
P (Bi ∩ A)
P (A)
which is equal to
P (Bi )P (A|Bi )
.
P (B1 )P (A|B1 ) + · · · + P (Bk )P (A|Bk )
Solution: P (A) = 1/2, P (B) = 2/3, P (A ∩ B) = 1/3. What is
the conditional probability P (A|B)? ( one may be inclined to
check P (B|A) as well).
The preceding statement is called the Bayes Theorem.
STAT 515 – p.36
Problem Setup: An unmanned monitoring system uses
high-tech equipment and microprocessors to detect intruders.
One such system has been used outdoors at a weapons
munitions plant. The system is designed to detect intruders
with a probability of .90, however, its performance may vary
with the weather. Naturally design engineers want to test
how reliable the system is. Suppose after a long sequence of
tests, the following information has been available: Given that
the intruder was indeed detected by the system, the weather
was clear 75% of the time, cloudy 20% of the time, and
raining 5% of the time. When the system failed to detect the
intruder, 60% of the days were clear, 30% cloudy, and 10%
rainy. Use this information to find the probability of detecting
an intruder, given rainy weather. (Assume an intruder has
already been released)
STAT 515 – p.38
STAT 515 – p.37
•
So what is the experiment? More importantly, what is a
possible outcome? (detected, rainy), or (nondetected,
clear) .... etc.
• Let D denote the event that the intruder is detected,
then D is the collection of possible outcomes (detected,
* ), with the second component unconstrained. Similarly
we can define: Clear be the event that includes all those
outcomes with the second component being clear; and
also those events for Cloudy, Rainy.
STAT 515 – p.39
•
From the problem setup, P (D) = .90, P (Clear|D) =
.75, P (Cloudy|D) = .20, P (Rainy|D) = .05; moreover,
P (Clear|Dc ) = .60, P (Cloudy|Dc ) = .30, P (Rainy|Dc ) =
.10
• It follows from conditional probability that
P (Rainy ∩ D) = P (D)P (Rainy|D) = .9 ∗ .05 = .045.
Simultaneously,
P (Rainy ∩ Dc ) = P (Dc )P (Rainy|Dc ) = .1 ∗ .1 = .01 .
•
By Bayes Rule, P (D|Rainy) should be equal to
P (D)P (Rainy|D)
,
P (D)P (Rainy|D) + P (Dc )P (Rainy|Dc )
and it can be computed as
(.90)(.05)
= .818.
(.90)(.05) + (.10)(.10)
So, the system is not that reliable, but not too bad!
STAT 515 – p.40
STAT 515 – p.41
Homework Question
Homework Question
A straight flush in poker. Consider 5-card poker hands
dealt from a standard 52-card bridge deck. Two important
events are
A straight flush in poker. Consider 5-card poker hands
dealt from a standard 52-card bridge deck. Two important
events are
A = {You draw a flush},
B = {You draw a straight}.
Note: an ace may be considered as having a value of 1 or,
alternatively, a value higher than a king.
1. How many different 5-card hands can be dealt from a
52-card bridge deck?
2. Find P (A).
3. Find P (B).
4. The event that both A and B occur is called a straight
flush. Find P (A ∩ B).
STAT 515 – p.42
A = {You draw a flush},
B = {You draw a straight}.
Note: an ace may be considered as having a value of 1 or,
alternatively, a value higher than a king.
1. How many different 5-card hands can be dealt from a
52-card bridge deck?
2. Find P (A).
3. Find P (B).
4. The event that both A and B occur is called a straight
flush. Find P (A ∩ B).
Ans: 2,598,960; .002; .00394; .0000154.
STAT 515 – p.42
Random Variables
Discrete random variables
•
Definition: A random variable is a variable that
assumes numerical values associated with the random
outcomes of an experiment, where one numerical value
is assigned to each sample point.
• One easy random variable would be the so called
Bernoulli random variable, which assigns 1 to a head,
and 0 to a tail for a coin-tossing experiment.
•
Another one may be from the following experiment: A
panel of 10 experts for the Wine Spectator is asked to
evaluate a new wine and give their ratings of 0,1,2, or 3.
A score is then obtained by adding those ratings
together. What is the random variable of interest here?
Can you justify the wording, random, here?
• Note: there is some common feature for the previous
two examples; that is, those random variables can only
assume some countable values. These random variables
certainly inherited the randomness from the
corresponding experiments because they depend on the
outcomes which are not certain.
STAT 515 – p.43
Probability Distributions
•
Let us toss two coins, and let X be the number of heads
observed. Can you find the probability associated with
each value of the random variable assuming that the two
coins are fair?
• The foregoing description of the random variable is
called the probability mass function, it is a complete
characterization for a discrete random variable. By
notation, we use p(x) := P (X = x), where x is any
possible P
value X . So naturally, we have (i) p(x) ≥ 0;
and (ii)
p(x) = 1.
STAT 515 – p.45
STAT 515 – p.44
One Example
Suppose you roll two balanced dice, and you are interested in
the summation of upper face values. Can you identify a
random variable and quantify its randomness?
STAT 515 – p.46
One Example
Exercise
Suppose you roll two balanced dice, and you are interested in
the summation of upper face values. Can you identify a
random variable and quantify its randomness?
Five men and 5 women are ranked according to their scores
on an examination. Assume that no two scores are alike and
all 10! possible rankings are equally likely. Let X denote the
highest ranking achieved by a woman ( for instance, X = 2 if
the top ranked person was male and the next ranked person
was female). Find P (X = i), for i = 1, 2, 3, . . . , 8, 9, 10.
0.10
0.08
0.04
0.06
Mass Function
0.12
0.14
0.16
Summation of Two Dice
2
4
6
8
10
12
STAT 515 – p.46
STAT 515 – p.47
Exercise
Mean and Variance
Five men and 5 women are ranked according to their scores
on an examination. Assume that no two scores are alike and
all 10! possible rankings are equally likely. Let X denote the
highest ranking achieved by a woman ( for instance, X = 2 if
the top ranked person was male and the next ranked person
was female). Find P (X = i), for i = 1, 2, 3, . . . , 8, 9, 10.
P (X = 1) =
P (X = 3) =
5 × 9!
,
10!
P (X = 2) =
5 × 5 × 8!
,
10!
5 × 4 × 3 × 5 × 6!
5 × 4 × 5 × 7!
, P (X = 4) =
,
10!
10!
5! × 5 × 5!
,
P (X = 5) =
10!
5! × 5!
P (X = 6) =
.
10!
STAT 515 – p.47
•
The mean (or expected value)
P of a discrete random
variable X is µ = E[X] = xp(x). As you can see, the
mean comes out of a summation, so it may not be a
possible value for X at all; but it certainly tells roughly
where X would very much like to take values.
• The variance of a random variable X is
X
σ 2 = E[(X − µ)2 ] =
(x − µ)2 p(x),
does that equal
X
x2 p(x) − µ2 ?
Again, the standard deviation is defined to be
√
σ2.
STAT 515 – p.48
One Toy Example
•
Real problem on the mean
Example: Consider the mass function shown below:
•
x
1 2 4 10
p(x) .2 .4 .2 .2
what is the mean and variance?
Suppose you work for an insurance company and you sell
a $ 10,000 one-year term insurance policy at an annual
premium of $ 290. This premium is targeted on those
customers (with certain age, sex, health, etc), for whom
the probability of death in the forthcoming year is
calculated as .001 based on actuarial tables. What is the
expected gain in the next year for a policy of this type?
STAT 515 – p.49
STAT 515 – p.50
Real problem on the mean
•
Suppose you work for an insurance company and you sell
a $ 10,000 one-year term insurance policy at an annual
premium of $ 290. This premium is targeted on those
customers (with certain age, sex, health, etc), for whom
the probability of death in the forthcoming year is
calculated as .001 based on actuarial tables. What is the
expected gain in the next year for a policy of this type?
Ans = $280
STAT 515 – p.50
Some Empirical Rule
•
Just like the sample case, for random variables one also
has the relationship
• P (µ − σ < X < µ + σ) ≈ .68
• P (µ − 2σ < X < µ + 2σ) ≈ .95
• P (µ − 3σ < X < µ + 3σ) ≈ 1.00
• Some commands to revisit the summation of dice
x<-2:12
y<-c((1:6)/36,5/36,4/36,3/36,2/36,1/36)
me<-sum(x*y)
stdv<-sqrt(sum((x^2)*y)-me^2)
low<-me-2*stdv
up<-me+2*stdv
c(low,up)
sum(y[c(-1,-11)])
STAT 515 – p.51
Fitness Test Example
Binomial Random Variables
The Heart Association claims that only 10% of U.S. adults
over 30 years of age meet the President’s Physical Fitness
Commission’s minimum requirements. Suppose three adults
are randomly selected and each is given the fitness test.
• Find the probability that none of the adults pass the test.
• Find the probability that two of the three adults pass the
test.
• Let X denote the number of passes, what is the mean
and variance of X .
• Can you verify mean=np, variance=np(1-p)?
•
Consider n independent Bernoulli trials, let us count the
number of heads in those trials. Apparently, it will be a
random quantity, what is the probability mass function?
We denote the number of heads by X , then for
x = 0, 1, . . . , n
x n−x
n
1
1
p(x) := P (X = x) =
x
2
2
P
It is easy to see
p(x) = 1 because of the Binomial
Theorem, as given below
n k n−k
n
(a + b) =
a b
.
k
STAT 515 – p.52
Characteristics of a Binomial
1. The experiment consists of n identical trials.
2. There are only two possible outcomes on each trial. We
can denote by S for success, and by F for failure; or just
simply code them by 1 and 0.
3. The probability of seeing S remains the same from trial
to trial.
4. The trials are independent.
5. The binomial random variable X is the number of S ’s in
n trials.
STAT 515 – p.53
Is it binomial?
Before marketing a new product on a large scale, many companies conduct a consumer-preference survey to determine
whether the product is likely to be successful. Suppose a company develops a new diet soda and then conduct a survey in
which 100 randomly chosen consumers state their preferences
among the new soda and two other leading sellers. Let X be
the number of those people who choose the new brand over
the two others. Is X binomial?
STAT 515 – p.54
STAT 515 – p.55
Binomial cont’d
•
Properties of Binomial
Noting that the Binomial theorem is true for any
a, b ∈ R, we can generalize those Bernoulli trials to the
bias coin design situation; in other words, we can
consider the experiments of tossing an unbalanced coin
such that the probability of getting head is p ∈ (0, 1).
After repeating n times, we can still count the number
of heads which yields the so-called Binomial random
variable, with mass function given by
n k
P (X = k) =
p (1 − p)n−k
k
•
Mean: E[X] = np;
• Variance:
Var[X] = E[(X − np)2 ]
= E[X 2 ] + n2 p2 − 2npE[X]
= E[X 2 ] − n2 p2 ,
where
E[X 2 ] =
k=0
for k = 0, 1, . . . , n.
•
STAT 515 – p.56
R Commands
•
n
X
n k
k2
p (1 − p)n−k = p2 n(n − 1) + pn
k
Very useful, and very common in real life, particularly in
survey sampling where many questions only involve yes
or no answers.
STAT 515 – p.57
Cumulative Binomial Probabilities
In R you can easily compute the mass function of a
Binomial random variable. For example, you can try
dbinom(3,20,0.6), which should return the value of
20
(0.6)3 (0.4)17 ,
3
or pbinom(6,20,0.6), for the value of
•
Recall Binomial random variable:
1. an experiment consisting of n independent identical
trials, say n = 20;
2. depends on a parameter p, the success probability;
3. counting the number of successes.
• Usually denoted by X ∼ Binomial(n, p).
P (X ≤ 6) = P (X = 0) + · · · + P (X = 6).
•
Can you verify from the table?
STAT 515 – p.58
STAT 515 – p.59
Table II on page 785
•
How do we describe a discrete random variable? Use
mass function
p(x) := P (X = x).
•
For X ∼ Binomial(20, .6), what is its mass function
p(x)?
20
p(x) =
(0.6)x (0.4)20−x
x
where
20
20!
=
x
x!(20 − x)!
and x! = 1 ∗ 2 ∗ 3 · · · x.
•
Because of the significance of Binomial distributions,
their mass functions are usually well known and very well
tabulated.
• Those listed values are cumulative probabilities,
P (X ≤ k) = P (X = 1) + · · · + P (X = k).
•
Remark: Knowing mass function is equivalent to
knowing cumulative probabilities.
• Suppose X ∼ Binomial(6, 0.3), by looking at the table
(P.785 in the textbook) please find out P (X = 4), what
about P (X ≤ 3) and P (X ≤ 4)?
STAT 515 – p.60
STAT 515 – p.61
Assigning a passing grade
Assigning a passing grade
A literature professor decides to give a 20-question true-false
quiz to determine who has read an assigned novel. She wants
to choose the passing grade such that the probability of
passing a student who guesses on every question is less than
.05. What score should she set as the lowest passing grade?
A literature professor decides to give a 20-question true-false
quiz to determine who has read an assigned novel. She wants
to choose the passing grade such that the probability of
passing a student who guesses on every question is less than
.05. What score should she set as the lowest passing grade?
Ans=15.
STAT 515 – p.62
STAT 515 – p.62
Mean of Binomial (revisited)
•
What is the mean of Binomial R.V.
X ∼ Binomial(n, p)?
• Recall Bernoulli random variable Y1 , namely its mass
function is given by
x
0 1
p(x) 1-p p
•
Mean of Binomial (continued)
•
If Y1 , Y2 , . . . are from independent trials, what should be
the expected value
E[Y1 + Y2 ] =
?
•
In this way can you easily compute the mean of X ,
which is Binomial(n,p) ?
•
If not, try Binomial(10,0.6) first.
Suppose there is another Bernoulli trial with its outcome
?
denoted by Y2 , compute the mean of E[Y2 ] =
STAT 515 – p.63
STAT 515 – p.64
Exercise 2
•
Simulation setup
Suppose a poll of 20 voters is taken in a large city. The
purpose is to determine X , the number who favor a
certain candidate for mayor. Suppose that 60% of all the
city’s voters favor the candidate.
a. Find the mean and standard deviation of X .
b. Use the binomial probability tables to find the
probability that X ≤ 10.
c. Use the table to find the probability that X > 12.
d. What is the likelihood of seeing 8 ≤ X ≤ 16.
STAT 515 – p.65
•
If X ∼ Binomial(10, 0.6), then X = Y1 + Y2 + · · · + Y10
where all those Y s are independent Bernoulli trials with
success probability .6.
• Population mean can be approximated by sample mean;
we will check that.
• Computer can help us to draw samples by using the
so-called simple random sampling algorithm.
STAT 515 – p.66
Getting Bernoulli Observations
•
Recall that soda-drink example, suppose the company
was targeting the southern states. Let us imagine the
situation that the whole population has been perfectly
surveyed, and 60% said yes. Now if you randomly came
across somebody on a street of Columbia, and ask the
question. How likely you will have an answer, yes? If yes,
then you code it by 1.
• You can draw samples like this in R easily, Using
rbinom(1,size=1,p=.6) .
Computer Experiment
•
Based on the very nature of Binomial experiment, one
can approximate the mean of Binomial(10,0.6), say, by
setting up a small computer experiment.
• One can draw many samples, with each sample
consisting of 10 observations from independent Bernoulli
trials (with success probability .6).
• For each sample, we can count the number of successes,
then average across samples.
STAT 515 – p.67
What about continuous?
•
Continuous random variables certainly abound. For
example, the length of time between arrivals at a
hospital clinic: 0 ≤ x < ∞; or the length of time it takes
a student to complete a one-hour exam.
• Definition: Random variables that assume values
corresponding to any of the points contained in an
interval are called continuous.
STAT 515 – p.69
STAT 515 – p.68
Cumulative Distribution Function
•
Sometimes you may find it is easier to work with
P (X ≤ x), the cumulative distribution function (CDF)
of X . It is particularly true when X is continuous as we
will see in the future.
• Probability distribution is a notion to characterize a
Random variable. It can mean a variety of things, but in
this course we will be referring to CDF mostly.
STAT 515 – p.70
Uniform Distribution
Probability Density Function
•
The most simplest continuous random variable.
• A random quantity may assume values in an interval
[c,d] equally likely, say [c,d]=[0,1].
• How to describe the distribution for a continuous
random variable? This is not too difficult; recall
histogram. On the top, there is usually a curve. How
would you interpret it?
•
It is usually some kind of curve, with area underneath
equal to 1.
• We tend to denote the density function by f (x), as
plotted in the picture.
• What is the probability that X = a, a ≤ X ≤ b, for
0 ≤ a ≤ b ≤ 7, and X ≤ 7?
0
50
100
150
f(x)
200
250
300
350
Density Curve
0
1
2
3
4
5
6
7
x
STAT 515 – p.71
STAT 515 – p.72
Density Function of Uniform
•
Suppose you are shooting at an interval [c, d], with equal
chance of getting one position in the interval. How
would you expect your curve should be?
• Can someone draw a picture of this density function?
• What is the probability that you hit a point in
[a, b] ⊂ [c, d] ?
Density Function of Uniform
•
Suppose you are shooting at an interval [c, d], with equal
chance of getting one position in the interval. How
would you expect your curve should be?
• Can someone draw a picture of this density function?
• What is the probability that you hit a point in
[a, b] ⊂ [c, d] ?
0.00
0.05
0.10
f(x)
0.15
0.20
0.25
Uniform Density Curve
0
1
2
3
4
5
6
x
STAT 515 – p.73
STAT 515 – p.73
Mean and standard deviation
•
For the previous example, please compute the mean and
the standard deviation.
• If you have learned calculus, then it is true that
P (a < X < b) =
Z
b
f (x)dx,
a
and always remember, the area underneath any density
curve is 1.
• Indeed, by calculus one may verify, for X ∼Uniform(c,d),
E[X] =
c+d
2
Var[X] =
Exercise
•
An unprincipled used-car dealer sells a car to an
unsuspecting buyer, even though the dealer knows that
the car will have a major breakdown within the next 6
months. The dealer provides a warranty of 45 days on all
cars sold. Let X represent the length of time until the
breakdown occurs. Assume that X is a uniform random
variable with values between 0 and 6 months.
(a). Calculate the mean and standard deviation of X .
(b). Calculate the probability that the breakdown occurs
while the car is still under warranty.
(d − c)2
12
STAT 515 – p.74
Normal Distribution
STAT 515 – p.75
How normal density looks like?
•
0.2
f(x)
0.3
0.4
mean=0,variance=1
0.1
One of the most commonly used distribution in both
probability and statistics. It was first discovered by Carl
F. Gauss, so Gaussian distribution can also be used in
place of normal.
• The probability density function is given by
0.0
2
1
f (x) = √ e−(1/2)[(x−µ)/σ] ,
σ 2π
−4
and it is perfectly bell shaped. This fact is very useful to
fit the data, because most of the errors occurring in real
life assume a bell-shaped distribution. For example, the
error made in measuring somebody’s blood pressure, or
the distribution of yearly rainfall data in a certain region.
STAT 515 – p.76
−2
0
2
4
x<-seq(-5,5,by=0.01)
x
y<-(1/sqrt(2*pi))*exp(-(.5)*(x^2))
plot(x,y,xlab="x",ylab="f(x)",main="mean=0,
variance=1",t="l")
history()
When $\mu=0, \sigma^2=1$, it is called a
standard normal distribution.
STAT 515 – p.77
Some Comments
•
EPA data
In the normal density function, µ=mean of the
ditribution, σ is the standard deviation. π = 3.1416 . . .
and e = 2.71828 . . ..
•
Histogram of two samples with size 100 each for car
mileage ratings.
Histogram of EPA
Histogram of EPAn06
30
20
Frequency
0
37
39
41
35
37
39
41
EPAn06
density.default(x = EPA)
density.default(x = EPAn06)
0
5
Density
0.2
Density
0.0
0.1
0.1
0.0
0.00
−5
−5
0
36
5
38
40
42
N = 100 Bandwidth = 0.3743
STAT 515 – p.78
0.0 0.1 0.2 0.3 0.4
EPA
0.3
0.2
0.10
normal density
0.3
0.15
35
0.05
normal density
10
20
0
0.4
10
mean=0,standard deviation=1
0.20
mean=1,standard deviation=2
Frequency
30
40
Below is another plot:
34
36
38
40
42
N = 100 Bandwidth = 0.3275
STAT 515 – p.79
Properties of Normal
•
Plotting its density in R:
x<-seq(-4,4,by=.01); dnorm(x,mean=0,sd=1);
plot(x,y,t="l",col="red").
• How do you get probabilities like P (X ≤ x), when X is
normal? Use pnorm()
• The cumulative probabilities for Normal distribution are
very important, but not easily computable in any
analytic way. They are usually numerically computed,
and very well documented in tables.
STAT 515 – p.80
•
Find the probability that a standard normal random
variable exceeds 1.96 in absolute value.
Solution:
P (|Z| > 1.96) = P (Z < −1.96 or Z > 1.96).
•
For the command pnorm(), you have the choice of
specifying mean and variance for your normal
distribution, in standard notation N (µ, σ 2 ).
• Find the probability that the standard normal random
variable Z falls between -1.33 and 1.33.
STAT 515 – p.81
Transformation
•
Normal Quantiles
Let X ∼ N (µ, σ 2 ), so it has density
1
(x − µ)2
exp −
f (x) = √
.
2σ 2
2πσ 2
•
It is best to introduce this notion by doing an example.
•
Example: Find the value of z —call it z0 – in the standard
normal distribution that will be exceeded only 10% of
the time. That is, find z0 such that P (Z ≥ z0 ) = .10 .
Can you tell me the distribution of
z0 = 1.28
X −µ
?
σ
•
or in other words, what is its density?
Upper α-percentile (quantile).
qnorm(.1,lower.tail=F)
STAT 515 – p.82
STAT 515 – p.83
Normal Curve Areas
Exercise
•
STAT 515 – p.84
Problem: Suppose the scores x on a college entrance
examination are normally distributed with a mean of 550
and a standard deviation of 100. A certain prestigious
university will consider for admission only those
applicants whose scores exceed the 90th percentile of
the distribution. Find the minimum score an applicant
must achieve in order to receive consideration for
admission to the university.
STAT 515 – p.85
Assessing Normality
Methods Assessing Normality
•
Many future chapters will talk about statistical inference
methods for normal populations. These procedures will
perform well only when you have reasonable populations.
•
So it is important for us to determine whether the
sample data come from a normal population, before we
apply those techniques properly.
•
A natural method you may think of would be using a
histogram, or a stem-and-leaf plot, and look at the
shape. One should be cautious though it may not be
that reliable.
•
Method 1: Find the interquartile range IQR and
standard deviation s for the sample, then calculate the
ratio IQR/s. If the data (sample) come from a normal
population, then IQR/s≈1.34. Why? ... Because for a
standard normal random variable, the 25th and 75th
percentiles are -.67 and .67. So what is the theoretical
IQR/σ ?
•
Method 2: Q-Q normal plot, comparing sample quantiles
with theoretical normal quantiles.
STAT 515 – p.86
STAT 515 – p.87
Sample Test Problem
1.6
8.4
3.5
6.5
7.4
5.9
3.1
1.1
8.6
6.7
4.3
5.0
3.2
4.5
3.3
9.4
2.1
6.3
8.4
6.4
Normal Q−Q Plot
45
5.3
7.3
9.7
8.2
35
30
1. Construct a stem-and-leaf plot to assess whether the
data are from an approximately normal distribution.
2. Compute s for the sample data. Ans: s = 2.353.
3. Find the values of QL and QU , then use the value of s
to assess whether the data come from an approximately
normal distribution. Note: 1.34 ± 0.04=very good
Sample Quantiles
40
5.9
4.0
6.0
4.6
Gas Mileage Data
−2
−1
0
1
2
Theoretical Quantiles
STAT 515 – p.88
STAT 515 – p.89
Point Estimation
Population vs Sample
•
Recall that mayor-voters example, for which we
hypothesized that the proportion of all voters who would
favor the candidate is 60% in a particular city. However,
in practice it is usually unknown, and needs to be
estimated by sample data.
•
Definition: In statistics, a parameter is a numerical
measure of a population . Since it is based on the
observations in the whole population, its value is usually
unknown.
•
Definition: A sample statistic is a numerical
descriptive measure of a sample. It is calculated from
the observations in the sample.
• List of population parameters and corresponding sample
statistics.
Mean:
µ
Variance:
σ2
Standard deviation: σ
Binomial proportion: p
•
x¯
s2
s
pˆ
Note: the term statistic refers to a sample quantity and
the term parameter refers to a population quantity.
STAT 515 – p.90
STAT 515 – p.91
Die Tossing
Sampling Distribution
•
When you sit down, and toss a balanced die; you know
in principle you may get 6 different values on the upper
face. If you toss it three times, and the observations
appear as 2,3,6; this may be considered as a sample.
•
But what is the relevant population here, in other words,
what is the mechanism generating your data? The
population here can be best described using a random
variable with the uniform distribution on {1, 2, 3, 4, 5, 6}.
This random variable is responsible for generating a
potentially infinite population.
•
How about using sample mean to estimate population
mean? This is particularly relevent if the die is not
known to be balanced a priori. So the mean is unknown!
STAT 515 – p.92
•
Definition: The sampling distribution of a sample
statistic calculated from a sample of n measurements is
the probability distribution of the statistic.
•
How to find a sampling distribution? Answer: It is
usually difficult and sometimes impossible. Why do we
want to find it? In short: Compare estimators for a
population parameter, draw inference at some confidence
level.
STAT 515 – p.93
Estimating Population Mean
Challenging Question
•
Median
Consider a game played with a standard 52-card bridge
deck in which you can score 0,3, or 12 points on any one
hand. Suppose the population of points scored per hand
is described by the probability distribution shown here. A
random sample of n=3 hands is selected from the
population.
Points, x 0
3
12
p(x)
1/2 1/4 1/4
Mean
a. Find the sampling distribution of the mean x¯.
b. Find the sampling distribution of sample median M .
µ
STAT 515 – p.94
STAT 515 – p.95
Another Exercise
The probability distribution shown here describes a population
of measurements that can assume values of 0,2,4, and 6, each
of which occurs with the same relative frequency:
x
p(x)
• How many
possibilities for x¯ ?
• Is M discrete?
•
Watch the typo on
bottom row.
STAT 515 – p.96
0
2
4
6
1
4
1
4
1
4
1
4
(a). List all the different samples of n = 2 measurements
that can be selected from this population.
(b). Calculate the mean of each different sample listed in
part (a).
(c). If a sample of n = 12 measurements is randomly
selected from the population, what is the probability
that a specific sample will be selected?
STAT 515 – p.97
What is a sampling distribution?
•
It is a distribution about a sample statistic, like the
mean x¯, and the sample variance s2 .
•
Sampling distribution usually depends on the size of a
sample.
•
It also depends on the population where samples are
drawn.
•
If population is simple, and the potential possible
samples are finite, then we can get a very good idea of
the sampling distribution.
•
In general we may appeal to central limit theorem.
Exercise
Suppose one can design a rule of counting points for a hand
in a bridge game, such that for any given hand the points can
only be one of the following three possibilities:
x
0
1
4
p(x) 1/3 1/3 1/3
a. Find the population mean and variance.
b. For the sample statistic, s2 with sample size n = 2, find its
sampling distribution. Is it biased while estimating σ 2 ?
STAT 515 – p.98
STAT 515 – p.99
Comparing Estimators
What if both are unbiased?
•
The same population parameter can sometimes be
estimated in two (or more) different ways, i.e., several
estimators. What is the useful procedure to compare
them?
•
If the sampling distribution of a sample statistic has a
mean equal to the population parameter the statistic is
intended to estimate, the statistic is said to be an
unbiased estimate of the parameter.
•
If the mean of the sampling distribution is not equal to
the parameter, the statistic is said to be a biased
estimate of the parameter.
STAT 515 – p.100
•
If both estimators are unbiased, then we will look at the
spread-out of the sampling distributions. The smaller the
standard deviation is the better.
•
The standard deviation of the sampling distribution of a
statistic is also called the standard error of the
statistic.
STAT 515 – p.101
Which one is unbiased?
•
Following that bridge game example, suppose we change
the rule of counting points of a hand a little bit, we have
the following possibilities for each hand.
x
0
3
12
p(x) 1/3, 1/3, 1/3
The sampling distributions of ¯x (n=3) and M are:
¯x
0
1
2
3
4
p(¯x) 1/27 3/27 3/27 1/27 3/27
¯x
5
6
8
9
12
p(¯x) 6/27 3/27 3/27 3/27 1/27
Further Considerations
•
If you still have some doubt choosing the right
estimator, let us look at the standard deviations of their
sampling distributions.
•
2 = 20.9136
Solution: σx2¯ = 8.6667 vs σM
•
As we can see, sample mean x¯ is usually better than the
sample median M in estimating the population mean.
•
Summary: Ideally we want to find an estimator that is
unbiased and has the smallest variance among all
unbiased estimators. We call this statistic the
minimum-variance unbiased estimator (MVUE).
M
0
3
12
p(m) 7/27 13/27 7/27
STAT 515 – p.102
Sampling distribution again
STAT 515 – p.103
Sample Mean Diagram
Assuming a random sample x of n observations has been
selected from any population. Then the following are true:
(1) The mean of the sampling distribution equals the mean
x]) = µ.
of the sampled population. That is, µx¯ (:= E[¯
(2) The standard deviation of the sampling distribution
equals
Population
Standard deviation of sampled population
Square root of sample size
√
That is, σx¯ = σ/ n (also called the standard error of
the mean).
sampling distribution
x
sample mean
0
STAT 515 – p.104
7
STAT 515 – p.105
Another Example
Central Limit Theorem
Consider the following distribution:
•
Theorem 6.1 (in the book): If a random sample of n
observations is selected from a population with a normal
distribution (normal population), the sampling
distribution of x¯ will be a normal distribution.
•
Theorem 6.2 (Central Limit Theorem): Consider a
random sample of n observations selected from a
population (any population) with mean µ and
standard deviation σ . Then, when n is sufficiently large,
the sampling distribution of x¯ will be approximately a
normal distribution√with mean µx¯ = µ and standard
deviation σx¯ = σ/ n. The larger the sample size, the
better will be the normal approximation to the sampling
distribution of x¯. n = 30 is usually good enough.
x
1 2 3 8
p(x) .1 .4 .4 .1
a. Can someone find the population mean µ and variance σ 2 ?
b. Consider drawing samples of size 2 from the population,
can you work out the sampling distribution of the mean x¯?
Can you confirm E[¯
x] = µ, and compute σx¯ ?
STAT 515 – p.106
STAT 515 – p.107
Sample Test Problem
•
The left column gives
four different kinds of
populations, from
which the samples
could be drawn.
• Can you observe the
patterns of those
curves when sample
size increases?
• This shows the central
limit behavior.
STAT 515 – p.108
•
Question: Suppose we have selected a random sample
of n = 36 observations from a population with mean
equal to 80 and standard deviation equal to 6. It is
known that the population is not extremely skewed.
a. Sketch the relative frequency distribution for the
sampling distribution of the sample mean x¯.
b. Find the probability that x¯ will be larger than 82.
STAT 515 – p.109
Exercise
One Sample Inference
Question: A manufacturer of automobile batteries claims
that the distribution of the lengths of life of its best battery
has a mean of 54 months and a standard deviation of 6
months. Suppose a consumer group decides to check the
claim by purchasing a sample of 50 of the batteries and
subjecting them to tests that estimate the battery’s life.
•
Confidence interval for population mean
•
The idea is to give an interval such that you can claim
with certain probability, the true population parameter is
going to be in the interval.
a. Assuming that the manufacturer’s claim is true, describe
the sampling distribution of the mean lifetime of a sample
of 50 batteries.
b. Assuming that the manufacturer’s claim is true, what is
the probability that the consumer group’s sample has a
mean life of 52 or fewer months?
•
In the large sample case,
2σ
x¯ ± 2σx¯ = x¯ ± √
n
has good coverage probabilities. Why?
STAT 515 – p.110
STAT 515 – p.111
Hospital patients
•
Confidence Coefficient
For this dataset, we can read from the book (page 307)
that x¯ = 4.53 days and s = 3.68 days. So we can
construct the interval
σ
,
x¯ ± 2σx¯ = 4.53 ± 2 √
100
but we do not know σ , how can we approximate it?
s
3.68
σ
≈ x¯ ±2 √
= 4.53±2
x¯ ±2 √
= 4.53±.74.
10
100
100
STAT 515 – p.112
•
Definition 7.2: An interval estimator or ( confidence
interval) is a formula that tells us how to use sample
data to calculate an interval that estimates a population
parameter.
•
Definition 7.3: The confidence coefficient is the
probability that an interval estimator encloses the
population parameter – that is, the relative frequency
with which the interval estimator encloses the population
parameter when the estimator is used repeatedly a very
large number of times. The confidence level is the
confidence coefficient expressed as a percentage.
STAT 515 – p.113
99 % confidence interval
•
How do you find it?
•
Choose α such that 100(1 − α) = 99, solving it gives
α = 0.01
•
•
Then look at the normal table, find out the upper α/2
percentile, zα/2 .
•
Then use the standard formula to find out that
confidence interval.
100(1 − α)% C.I. for µ
The large-sample 100(1 − α)% confidence interval for µ
is
σ
x¯ ± zα/2 σx¯ = x¯ ± zα/2 √
n
where zα/2 is the z value with an area α/2 to its right
√
and σx¯ = σ/ n. The parameter σ is the standard
deviation of the sampled population and n is the sample
size.
•
Remark: When σ is unknown and n is large (say,
n ≥ 30), the confidence interval is approximately equal
to
s
x¯ ± zα/2 √
n
where s is the sample standard deviation.
STAT 515 – p.114
STAT 515 – p.115
Example
•
Exercise
Problem: Unoccupied seats on flights cause airlines to
lose revenue. Suppose a large airline wants to estimate
its average number of unoccupied seats per flight over
the past year. To accomplish this, the records of 225
flights are randomly selected, and the number of
unoccupied seats is noted for each of the sampled flights.
Descriptive statistics for the data are displayed below
Variable
N
Mean
StDev SE Mean
NOSHOWS 225 11.5956 4.1026 0.2735
•
A random sample of 90 observations produced a mean
x¯ = 25.9 and a standard deviation s = 2.7.
a. Find a 90% confidence interval for µ
b. Find a 99% confidence interval for µ
Can you construct a 90% confidence interval for µ, the
population mean?
STAT 515 – p.116
STAT 515 – p.117
Confidence Interval Interpretation
•
•
•
Sampled Intervals
When we form a 100(1 − α)% confidence interval for µ,
we usually express our confidence in the interval with a
statement such as “ We can be 100(1 − α)% confident
that µ lies between the lower and upper bounds of the
confidence interval.
The statement reflects our confidence in the estimation
procedure, rather than in the particular interval that is
calculated from the sample data.
We know that repeated applications of the same
procedure will result in different lower and upper bounds
on the interval. Furthermore, we know that 100(1 − α)%
of the resulting intervals will contain µ.
•
Decrease the confidence level.
• Following up the previous example, a 90% confidence
interval for µ is
√
√
x¯±1.645(σ/ n) ≈ 4.53±(1.645)(3.68)/ 100 = 4.53±.61
STAT 515 – p.120
µ
µ
STAT 515 – p.119
Small-sample confidence interval
•
There are many actual needs to address small samples.
For example, Federal legislation requires pharmaceutical
companies to perform extensive tests on new drugs
before they can be marketed. After testing on animals,
and if it seems safe, then the company can try it out on
humans. However, it is unlikely you will have large
sample due to ethical standards.
•
Suppose a pharmaceutical company must estimate the
average increase in blood pressure of patients who take a
certain new drug. Assume only 6 patients can be used in
the initial phase of human testing. How do you make
confidence interval for that average increase in blood
pressure?
•
This interval (3.92,5.14) is narrower than the previously
calculated 95% confidence interval, (3.79,5.27).
• Remark: although the interval is narrower,
simultaneously we are having “less confidence” for the
narrower interval to cover the true population parameter.
• The other way of decreasing the width of an interval
without sacrificing “confidence” is to increase the sample
size n.
10 samples
Confidence intervals are meant to be at certain confidence levels.
STAT 515 – p.118
Narrow the width of a C.I.
10 samples
STAT 515 – p.121
Two Remarks
•
•
t-statistic
Remark 1: The sampling distribution of x¯ may not be
normal, if the population is far from being normal (say,
very skewed). But remember, x¯ has (approximately)
normal distribution if the population is (approximately)
normal, no matter how small the sample size is.
•
t=
the approximation of σ by s would be very poor, if the
sample size is small.
x¯ − µ
√
s/ n
•
It is also called Student’s t-statistic, because William
Gosset (1876-1937) first found its distribution in a paper
published under his pen name, Student.
•
Degrees of freedom of t-statistic: Apparently, the
amount of variability in the sampling distribution
depends on the sample size. To capture this relationship,
we call n − 1 the degrees of freedom.
Remark 2: In the formula
σ
σx¯ = √ ,
n
t-statistic is defined in the following way
STAT 515 – p.122
Variability of the t-statistic
STAT 515 – p.123
t-distribution
•
The bigger the number of degrees of freedom associated
with the t-statistic, the less variable will be its sampling
distribution. see the Table on page 318.
• Compared to the standard normal statistic
z=
x¯ − µ
x¯ − µ
= √ ,
σx¯
σ/ n
standar normal
t with df=4
−3
the t-statistic is more variable. For example, when n = 5
which means degrees of freedom is 4, the standard
normal z-score is z.025 = 1.96, but t.025 = 2.776.
• When sample size is about 30, there is no real difference
in the distributions of t-statistic and the standard normal
statistic.
STAT 515 – p.124
−2
−1
0
1
2
3
•
Critical values
• One parameter
family
• Courtesy
textbook, P.796
STAT 515 – p.125
Drug Testing Example
Summary of the procedure
•
Remember n − 1 = 5 which is way less than 30, so we
use t-statistic to construct confidence intervals. Read
from the table, t.025 = 2.571.
•
From the text p.319, the actual data gives us x¯ = 2.283
and s = .950, so
.950
2.283 ± (2.571) √
= 2.283 ± .997
6
•
The small-sample confidence interval for µ is
s
x¯ ± tα/2 √
n
where tα/2 is based on (n − 1) degrees of freedom.
•
gives us the 95% confidence interval (1.286,3.280).
It is required the population has a relative frequency
distribution that is approximately normal. (It has been
empirically found that t-distribution is not very sensitive
to the departure from normality in the population.)
STAT 515 – p.126
STAT 515 – p.127
Example
•
Exercise
Some quality control experiments require destructive
sampling in order to measure a particular characteristic
of the product. The cost is usually high, so only small
samples are available. Suppose a manufacturer of
printers for personal computers wishes to estimate the
mean number of characters printed before the printhead
fails. The manufacturer tests n = 15 printheads and
records the number of characters printed until failure for
each. The actual data and its summary are located on
p.320.
(1) Form a 99% confidence interval for the mean number
of characters printed before the printhead fails.
(2) What assumption is required for the interval you
found in part (1) to be valid? is that assumption
reasonably satisfied?
STAT 515 – p.128
•
The following sample of 16 measurements was selected
from a population that is approximately normally
distributed:
91 80 99 110 95 106 78 121
106 100 97 82 100 83 115 104
(a). Construct an 80% confidence interval for the
population mean.
(b). Construct a 95% confidence interval for the
population mean, and compare the width of this
interval with that of part (a).
(c). Carefully interpret each of the confidence intervals,
and explain why the 80% confidence interval is
narrower.
STAT 515 – p.129
Population Proportion
•
Confidence Interval for p
Question: Public-opinion polls are conducted regularly
to estimate the fraction of U.S. citizens who trust the
president. Suppose 1,000 people are randomly chosen
and 637 answer that they trust the president. How
would you estimate the true fraction of all U.S. citizens
who trust the president?
pˆ =
637
= .637
1, 000
This is a point estimate (can we say estimator?), what
about an interval estimator?
1. The mean of the sampling distribution of pˆ is p; that is,
pˆ is an unbiased estimator of p.
2. The
pstandard deviation of the sampling distribution of pˆ
is pq/n, where q = 1 − p.
3. distribution? large sample scenario.
4. The large-sample confidence interval for p is
r
r
pq
pˆqˆ
pˆ ± zα/2 σpˆ = pˆ ± zα/2
≈ pˆ ± aα/2
n
n
where pˆ = x/n and qˆ = 1 − pˆ. Note: x is counting the
number of successes, like those people who trust their
president.
STAT 515 – p.130
STAT 515 – p.131
Some Conditions for C.I.
1. A random sample is selected from the target population
2. The sample size n is large, say bigger than 30. (In
particular, either nˆ
p ≥ 15 or nˆ
q ≥ 15)
Getting back to the president example, we can construct the
95% confidence interval for the proportion of all U.S. citizens
who trust the president
r
pq
pˆ ± zα/2 σpˆ = .637 ± 1.96
1, 000
where p can be estimated by pˆ, and q ≈ qˆ = 1 − pˆ.
STAT 515 – p.132
Interesting Example
•
Problem: Many public polling agencies conduct surveys
to determine the current consumer sentiment concerning
the state of the economy. For example, the Bureau of
Economic and Business Research at the University of
Florida conducts quarterly surveys to gauge consumer
sentiment in the sunshine state. Suppose that BEBR
randomly samples 484 consumers and finds that 257 are
optimistic about the state of the economy. Use a 90%
confidence interval to estimate the proportion of all
consumers in Florida who are optimistic about the state
of economy. Based on the confidence interval, can
BEBR infer that the majority of Florida consumers are
optimistic about the economy?
STAT 515 – p.133
Wilson’s adjustment for p
•
Sometimes the true population parameter is near 0 or 1;
for example, suppose one wants to estimate the
proportion of people of who die from a bee sting. This
proportion may very likely be near 0 (say, p ≈ .0013).
Can you estimate the parameter based on a sample of
size 50, or even 200? The answer is: NO.
• Wilson’s adjustment: An adjusted (1 − α)100%
confidence interval for p is
r
p˜(1 − p˜)
p˜ ± zα/2
n+4
Quick Application
•
Application of Wilson’s adjustment
• Problem: According to an article, the probability of
being the victim of a violent crime is less than .01.
Suppose that, in a random sample of 200 Americans, 3
were victims of a violent crime. Use a 95% confidence
interval to estimate the true proportion of Americans
who were victims of a violent crime.
• How does Wilson’s adjusted confidence interval compare
with (−.0.002, 0.032)?
x+2
where p˜ = n+4
is the adjusted sample proportion, and
x =# of successes.
STAT 515 – p.134
STAT 515 – p.135
Determining the sample size
•
Recall that one alternative way of reducing the width of
a confidence interval, while maintaining the confidence
level, is to increase the sample size.
• This is a real issue faced by an experiment designer, who
would like to decide how big the sample size he or she
would like.
• For example, to estimate a population mean by a
potential large sample, one may want a 95% confidence
interval with a certain narrow width to satisfy some
agency requirements. In this situation a sampling
scheme has to be worked out.
Sampling Error
•
Let us introduce the notion, sampling error, which
should be distinguished from the standard error of the
sampling distribution.
• The sampling error is defined to be the half length of a
100(1 − α)% confidence interval. In formula, it is given
by
σ
zα/2 √
= SE,
n
solving it gives us
(zα/2 )2 σ 2
n=
.
(SE)2
If n above is not an integer, you should round it up to
make the sample size sufficient.
STAT 515 – p.136
STAT 515 – p.137
Value of σ
Sample Size Calculation
•
To calculate the sample size n, we need to know σ . In
practice, one may estimate it using current available
sample as data collection goes.
•
Or conservatively, we may use the approximate
relationship, σ ≈ R/4. You may still remember the 2σ or
3σ rule.
•
If you would use R/4 approximation, by all means you
should conservatively make your sample size a little
bigger than the actual numerical value.
•
Suppose the manufacturer of official NFL footballs uses
a machine to inflate the new balls to a pressure of 13.5
pounds. When the machine is properly calibrated, the
mean inflation pressure is 13.5 pounds, but
uncontrollable factors cause the pressures of individual
footballs to vary randomly from about 13.3 to 13.7
pounds. For quality control purposes, the manufacturer
wishes to estimate the mean inflation pressure to within
.025 pound of its true value with a 99% confidence
interval. What sample size should be specified for the
experiment?
Solution: Note z.005 = 2.575. If by calculation n = 107
after rounding up, we may very well require a sample
size n = 110 to be more certain to obtain 99%
confidence level.
STAT 515 – p.138
Test of a Hypothesis
STAT 515 – p.139
Null hypothesis and alternative
•
In estimation, we have seen how to make inference
about one single parameter, where the goal was to get
an estimate as exact as possible. There are other
situations, in which we may only want to know some
qualitative relationship. For example, one may want to
know whether the mean of a driver’s blood alcohol
exceeds the legal limit after two drinks.
• One characteristic of this inference is that we are making
inference about how the value of a parameter relates to
a specific numerical value. Is it less than, equal to, or
greater than the specified number? This type of
inference is called a test of hypothesis.
STAT 515 – p.140
•
Pipe manufacturer example: Suppose a certain city
requires the residential sewer pipe to be more than 2,400
pounds per foot of length. Each manufacturer that
wants to sell pipe will have to pass the inspection. So
there is a testing problem here, and we are still interested
in the population mean µ; but we are less interested in
estimating the value of µ than we are in testing a
hypothesis about its value. What is the hypotheis?
Whether the mean breaking strength of the pipe exceeds
2,400 pounds per linear foot?
STAT 515 – p.141
How to Set Up?
•
•
Taking care of sampling variability
•
Null hypothesis (H0 ): µ ≤ 2, 400
Alternative hypothesis (Ha ): µ > 2, 400.
• Suppose you have a data set from one manufacturer
company, how can you decide whether the company will
meet the requirement? In other words, how can you test
whether the mean pipe breaking strength from this
company is bigger than 2,400? You may say, compute
the sample mean x¯ for this data set, if it’s bigger than
2,400, then reject the null hypothesis. Would this be fair
to the company? You want to show convincing evidence.
Generally speaking, we want to find a procedure to take
care of the sampling variability.
• The rational is that: Suppose the null is true, you look
at your data, and ask, does the data support the belief?
• In practice, we may compute a test statistic (a sample
statistic for testing purpose), under the null the test
statistic may have a sampling distribution. As we know
the sampling distribution tells us the sampling variability.
If the test statistic evaluated on your particular dataset
is too far away from the center of the sampling
distribution. That means you should reject the null
hypothesis, in this case we say the testing is significant.
STAT 515 – p.142
Pipe example
•
How to find convincing evidence?
Observation: For the null hypothesis, µ ≤ 2, 400, if we
can reject the hypothesis µ = 2, 400 in favor of
µ > 2, 400, then µ ≤ 2, 400 is automatically rejected. So
we may look at the test statistic
•
Remember we are testing: H0 : µ = 2, 400 against
Ha : µ > 2, 400. If H0 is true, then z has a standard
normal distribution if sample size n is reasonably big. If
z=
x¯ − 2, 400
x¯ − 2, 400
√
=
z=
σx¯
σ/ n
•
STAT 515 – p.143
How large must z be before the city can be convinced
that the null hypothesis can be rejected in favor of the
alternative hypothesis and conclude that the pipe meets
the requirement?
STAT 515 – p.144
x¯ − 2, 400
x¯ − 2, 400
√
√
≈
σ/ n
s/ n
is bigger than 1.645, this is indeed convincing evidence
H0 should be rejected in favor of Ha .
• Of course, there is some probability of making mistakes
if we do this.
STAT 515 – p.145
Type I decision error
•
Example
Following the previous procedure, we may make some
mistakes by falsely rejecting the null hypothesis. But
that probability is small; if the null hypothesis is indeed
true, and we reject it by realizing z > 1.645. What is the
probability we are making a mistake? This is called a
Type I decision error. Denote
•
Suppose we test 50 sections of sewer pipe and find the
mean and standard deviation for these 50 measurements
to be, respectively,
x¯ = 2, 460 pounds per linear foot
and
s = 200 pounds per linear foot,
α = P (Type I error)
= P (rejecting Null when it is indeed true).
what is the z value? Can we reject the null hypothesis
µ = 2, 400 in favor of µ > 2, 400? How about
H0 ≤ 2, 400 ? what’s the Type I error probability if we
are comparing z-value with 1.645?
So in our example,
α = P (z > 1.645 when in fact µ = 2, 400) = .05
STAT 515 – p.146
STAT 515 – p.147
Summary
Challenging a claim
•
Null hypothesis: usually something you have doubt, but
of interest.
• Alternative: what you are really interested in.
• Test statistic: for example
z=
•
x¯ − 2, 400
σx¯
Rejection region: for example z > 1.645 is one rejection
region if we allow Type I error probability to be α = .05
STAT 515 – p.148
•
Exercise: A University of Florida economist conducted a
study of Virginia elementary school lunch menus. During
the state-mandated testing period, school lunches
averaged 863 calories (findings available in National
Bureau of Economic Research, Nov. 2002). The
economist claimed that after the testing period end, the
average caloric content of Virginia school lunches
dropped significantly. Set up the null and alternative
hypothesis to test the economist’s claim.
STAT 515 – p.149
Test of a hypothesis
•
•
•
•
•
Level of a test
A sewer pipe manufacturer claim that, on average their
pipes have breaking strength beyond 2,400 pounds.
Suppose you have a dataset consisting 50
measurements, namely the breaking strength measured
on 50 sections of a sewer pipe the company produced.
Can you test the claim in a ‘convincing’ way?
Null hypothesis: H0 : µ ≤ 2, 400, where µ is the mean
breaking strength of the sewer pipe the company can
produce.
Alternative hypothesis: Ha : µ > 2, 400.
Does the data show convincing evidence to reject H0 in
favor of Ha .
From the actual data y , y¯ = 2, 460 pounds per linear
foot, and s = 200 pounds per linear foot.
•
Also known as type I error probability
• For the z-value of that particular dataset, if we are
comparing it with the value 1.645, and decide to reject
H0 whenever the sample z -value exceeding 1.645. This
is called a level-α test, and the α is given by .05.
Another interpretation for this value is the type I error
probability. Why? remember we are using the rejection
region (1.645, +∞); if the null is indeed true, we may
still have α probability of seeing z -values in that region.
Since we have decided that’s the rejection region, then
can you figure out the implication?
• Description above gives us the type I error probability.
STAT 515 – p.150
Type II error probability
•
Suppose we got another sample (dataset), a, with size
n = 50, and it gives us a¯ = 2, 430 and s = 200. For this
dataset, we have the test statistic
z=
2430 − 2400
30
√
= 1.06.
=
28.28
200/ 50
If we continue to test the original claim at the .05 level,
the result is not significant here for this dataset.
• Note our estimate a
¯ here does exceed 2,400 quite a bit,
but somehow we fail to reject H0 : µ ≤ 2, 400 at the
significance level .05.
STAT 515 – p.152
STAT 515 – p.151
Type II error probability (cont’d)
•
what will happen if we accept the null hypothesis? we
may be making some mistake again with certain
probability β , Type II error probability. This probability is
usually very difficult to compute, statisticians tend to try
to avoid this issue by claiming ‘the sample evidence is
insufficient to reject H0 .
• Rule of thumb: Since β , the type II error probability, is
very difficult to compute, we will generally avoid the
decision to accept H0 , preferring instead to say the
sample evidence is insufficient to reject H0 when the
sample test statistic is not in the rejection region.
• There may indeed exist situations where you can
compute β . In a formal statistical report, it is acceptable
to say you want to accept H0 , then give the α and β
values.
STAT 515 – p.153
Summary–PP. 355-356, Chapter 8
•
Null hypothesis: usually something you have doubt, but
of interest.
•
Alternative: what you are really interested in.
•
Test statistic: for example
z=
x¯ − 2, 400
σx¯
•
rejection region: for example z > 1.645 is one rejection
region if we allow Type I error probability to be α = .05
•
Type II error probability.
Challenging a claim (cont’d)
•
Exercise: A University of Florida economist conducted a
study of Virginia elementary school lunch menus. During
the state-mandated testing period, school lunches
averaged 863 calories (findings available in National
Bureau of Economic Research, Nov. 2002). The
economist claimed that after the testing period end, the
average caloric content of Virginia school lunches
dropped significantly. Set up the null and alternative
hypothesis to test the economist’s claim. What is the
test statistic you may want to look at?
STAT 515 – p.154
Further example
•
STAT 515 – p.155
Two-sided test
According to a researcher at the Univ. of Florida wildlife
ecology and conservation center, the average level of
mercury uptake in wading birds in the everglades has
declined over the past several years (UF News,
December 15, 2000). Five years ago, the average level
was 15 parts per million.
a. Give the null and alternative hypothesis for testing
whether the average level today is less than 15 ppm.
b. Describe a Type I error for this test
c. Describe a Type II error for this test.
d. Describe the rejection region for a .01-level test.
e. For a sample y of 100 measurements, suppose
y¯ = 14.6 and s = 2, can you reject this 0.01-level test?
What about y¯ = 12, s = 2 ?
STAT 515 – p.156
• H0 : µ ≤ 2, 400
against Ha : µ > 2, 400 is called
one-sided test, because the alternative hypothesis only
has one direction.
• a. One tailed (upper tailed)
b. One tailed, (lower tailed)
c. Two tailed.
• Remark: The tail is in the direction of the alternative
hypothesis.
• The real difference from one-sided test is the rejection
region.
STAT 515 – p.157
Problem–P.360 Chapter 8
•
Problem: (The effect of drugs and alcohol on the
nervous system ) Suppose a research neurologist is
testing the effect of a drug on response time by injecting
100 rats with a unit dose of the drug, subjecting each
rat to a neurological stimulus, and recording its response
time. The neurologist knows that the mean response
time for rats not injected with drug is 1.2 seconds. She
wishes to test whether the mean response time for
drug-injected rats differs from 1.2 seconds. Set up the
test of hypothesis for this experiment, using α = .01.
• Can somebody guess what the rejection region should
be? one-sided or two-sided?
Some comments and p-value
•
Rejection region: For a two-sided test with level α, the
critical values (percentiles) are found using α/2, instead
of α. This is very important to keep in mind.
• What is a p-value? Definition from the book: The
p-value, also called the observed significance level, for a
specific statistical test is the probability (assuming that
H0 is true) of observing a value of the test statistic that
is at least as contradictory to the null hypothesis, and
supportive of the alternative hypothesis, as the actual
one computed from the sample data.
• For example, in the testing of sewer pipes we computed
zˆ = 2.12 from one particular sample. So the observed
significance level (p-value) for this test is
p − value = P (z ≥ 2.12) = 0.5 − 0.4830 = 0.0170.
STAT 515 – p.159
STAT 515 – p.158
p-value for two-sided test
•
Step 1: Determine the value of the test statistic (say, z)
corresponding to the result of the sampling experiment.
• Step 2: If the test is two-sided, the p-value is equal to
twice the tail area beyond the observed z-value in the
direction of the sign of z .
• Reporting p-value: when the test level is chosen, if the
observed significance level (p-value) is less than the
chosen value of α, the reject the null hypothesis.
Otherwise, do not reject the null hypothesis.
STAT 515 – p.160
Problem – Chapter 8, P. 368
•
The lengths of stay (in days) for 100 randomly selected
hospital patients are observed and recorded, as shown in
the book. Suppose we want to test the hypothesis that
the true mean length of stay (LOS) at the hospital is
less than 5 days; that is
H0 : µ = 5 versus Ha : µ < 5.
Assuming that σ = 3.68, use the data in the table to
conduct the test at α = .05. Can you reject the null
hypothesis?
STAT 515 – p.161
Exercise
•
When sample size is small
Consider a test of H0 : µ = 75 performed with the
computer. SPSS reports a two-sided p-value of .1032.
Make the appropriate conclusion for each of the
following situations:
a. Ha : µ < 75, z = −1.63, α = 0.05
b. Ha : µ < 75, z = 1.63, α = .10
c. Ha : µ > 75, z = 1.63, α = .10
d. Ha : µ 6= 75, z = −1.63, α = .01
• z -statistic
is no longer very normally distributed.
• Remember confidence interval situation, we can work
out the sampling distribution of the following statistic
t=
x¯ − µ0
√ ,
s/ n
but watch for the degrees of freedom.
STAT 515 – p.162
STAT 515 – p.163
Small-Sample test of µ
•
•
One sided test
1. H0 : µ = µ0
In the text there is a water-quality monitoring
experiment, where the goal is to watch whether the pH
value measured in the drinking water would fall below
7.0, which is considered dangerous to human health.
One water-treatment plant has a target pH of 8.5, but
only collected 17 water samples, can you test the claim?
Note: for these 17 water samples, it is known
x¯ = 8.42, s = .16 .
• Solution: On page 372-373 of the textbook.
2. Ha : µ < µ0 (or Ha : µ > µ0 )
3. Test statistic: t =
x
¯−µ
√0
s/ n
4. Rejection region: t < −tα
•
Two-tailed test
1. H0 : µ = µ0
2. Ha : µ 6= µ0
3. Test statistic: t =
One Example
x
¯−µ
√ 0
s n
4. Rejection region: t < −tα/2 or t > tα/2
STAT 515 – p.164
STAT 515 – p.165
One more exercise
•
C.I. in action
A car manufacturer wants to test a new engine to
determine whether it meets new air pollution standards.
The mean emission µ of all engines of this type must be
less than 20 parts per million of carbon. Ten engines are
manufactured for testing purposes, and the emission
level of each is determined. The data (in parts per
million) are listed below.
•
A random sample of 90 observations produced a mean
x¯ = 25.9 and a standard deviation s = 2.7. Find a 95%
confidence interval for the population mean µ. Suppose
another sample of 15 observations is obtained, and it has
a mean x¯ = 23, and a standard deviation s = 3.5; can
you construct a 97% confidence interval for µ?
15.6 16.2 22.5 20.5 16.4
19.4 19.6 17.9 12.7 14.9
Do the data supply sufficient evidence to allow the
manufacturer to conclude that this type of engine meets
the population standard? Assume that the manufacturer
is willing to risk a Type I error with probability α = .01.
STAT 515 – p.166
STAT 515 – p.167
Determining sample size again
•
How to determine sample size for estimating proportion?
• A gigantic warehouse located in Atlanta, GA, stores
approximately 60 million empty aluminum beer and soda
cans. Recently, a fire occurred at the warehouse. The
smoke from the fire contaminated many of the cans with
blackspot, rendering them unusable. A University of
South Florida statistician was hired by the insurance
company to estimate p, the true proportion of cans in
the warehouse that were contaminated by the fire. How
many aluminum cans should be randomly sampled to
estimate the true proportion to within .02 with 90%
confidence.
STAT 515 – p.168
Some Ideas
•
Remember the formula for Confidence Interval
r
pq
.
pˆ ± zα/2
n
In this problem we do not have a pˆ available, so being
conservative, we can estimate pq = 1/4 because this is
the biggest possible for product p ∗ q
√
( pq ≤ (p + q)/2 = 1/2).
•
The allowable sampling error SE= .02, so√we should pick
n to solve the following equation zα/2 /2 n = .02, where
α = 0.1, z0.1 being the upper .1 percentile for the
standard normal. Make sure round up n.
STAT 515 – p.169
Hypothesis Testing
Solution
•
BusinessWeek.com provides consumers with retail prices
of new cars at dealers from across the country. The July
2006 prices for the hybrid Toyota Prius were obtained
from a sample of 160 dealers. These 160 prices are
saved in the HYBRIDCARS file. Preliminary analysis
showed that the mean of these 160 prices is $ 25476.7.
Suppose one is interested in knowing whether the mean
July 2006 dealer price of the Toyota Prius differs from $
25,000, and after a two-sided analysis in MINITAB, the
p-value is reported to be .014. Can you decide the
sample standard deviation for this sample? If you are
willing to risk a Type I error probability .05, can you
reject the null hypothesis?
• Yes, we can reject H0 because p-value = .014 ≤ .05.
•
Hypothesis setup: H0 : µ = 25, 000 vs Ha : µ 6= 25, 000
Use z-statistic: z = x¯−25,000
; for this dataset, after a
σx¯
two-sided testing, the reported p-value is .014.
• So under H0 ,
P (z > observed z-value) = 0.014/2 = 0.007
• We need to find z0 such that
P (0 < z < z0 ) = 0.5 − 0.007 = 0.493, from the standard
normal curve table z0 = 2.45.
• Then we have the relationship
•
observed z-value =
25476.7 − 25, 000
√
= z0 = 2.45
s/ n
plugging in n = 160 gives s = 2461.156.
STAT 515 – p.170
STAT 515 – p.171
Simple Linear Regression
•
Deterministic model vs Probabilistic model
• Simple regression model (first coined by Francis
Galton) is about the relationship of two variables in
the population. say, Height vs IQ index
• Here is a more interesting example, it is known that the
response time to certain stimulus is related to the
percentage of a certain drug in the bloodstream.
• One may believe this relationship is deterministic, but
after some careful thinking, you may change your mind.
Given a certain percentage of the drug, there may still
be some variability in the response time, either due to
some other hidden causes, or just simply to individual
difference.
STAT 515 – p.172
The Typical Model
•
So we shall be mostly interested in the mean of certain
variable, given the other variable. In statistics, you may
often hear response variable and predictors.
•
In symbols,
•
A first-order probabilistic model
y = 1.5x + random error
y = β0 + β1 x + ε
where y = Dependent or response variable, x =
Independent or predictor variable.
STAT 515 – p.173
Hello, Mr. Euclid
•
In the previous slide, ε is the random error component,
which has mean 0, β0 = y-intercept of the line, and β1 is
the slope.
• It turns out that this is a very useful model, and the goal
is to estimate β0 and β1 as accurately as possible using
observations on x and y.
• Please notice that E[y] = β0 + β1 x , and this
relationship is deterministic.
• A major focus will be trying to figure out ways
estimating β0 and β1 , regression serves a good ground
for both parameter estimation and testing.
•
Below is something you need to know, and I am willing
to bet you already knew.
•
Suppose a line
y = β0 + β1 x
passes through the point (-2,3) and (4,6), what should
be β0 and β1 .
STAT 515 – p.174
STAT 515 – p.175
Scatter plot
3.5
3.0
2.5
2.0
1.0
1.5
Is there any linear relationship between x and y .
Reaction time
Subject Percent x of Drug Reaction time y (seconds)
1
1
1
2
2
1
3
3
2
4
4
2
5
5
4
4.0
The Data
1
2
3
4
5
Percent of Durg
STAT 515 – p.176
STAT 515 – p.177
Fitted with visual line
Errors of prediction=distance between the fitted value
and the actual observation
•
It is easy to see: sum of errors=0, but sum of squared
errors (SSE)=2.
•
By playing with this, you can see there are many lines
which may give you sum of errors=0, but there is only
one line which will give you minimum SSE.
3.0
2.5
2.0
1.0
1.5
Reaction time
3.5
4.0
•
1
2
3
4
5
Percent of Durg
STAT 515 – p.178
STAT 515 – p.179
Least Squares Estimate
•
To find the linear relationship of two variables, suppose
we observe a sample of n data points, say,
(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ). We want to find out the
least squares line, in other words, to find out β0 , β1
minimizing
SSE =
n
X
i=1
2
[yi − (β0 + β1 xi )] .
The minimizers, denoted by βˆ0 and βˆ1 , are called the
least squares estimates of the population parameters β0
and β1 .
STAT 515 – p.180
Some Formulas
•
Solving this optimization problem, we get:
Slope:
SSxy
βˆ1 =
SSxx
where
P
P
X
X
( xi )( yi )
SSxy =
(xi − x¯)(yi − y¯) =
xi y i −
n
P
X
X
( xi ) 2
SSxx =
(xi − x¯)2 =
x2i −
n
• y-intercept: βˆ0 = y¯ − βˆ1 x
¯.
STAT 515 – p.181
Example 11.1 – Page 566
Refer to the reaction data presented in the previous
table. Consider the straight-line model E(y) = β0 + β1 x,
where y =reaction time and x=percent of drug received.
a. Use the method of least squares to estimate the
values of β0 and β1 .
b. Predict the reaction time when x = 2%.
c. Find SSE for the analysis.
d. Give practical interpretations of βˆ0 and βˆ1 .
•
It
compute P
P
Pis easy toP
xi = 15, yi = 10, x2i = 55, xi yi = 37, so
P
X
( xi ) 2
2
SSxy = 7
SSxx =
xi −
= 10.
5
It follows
βˆ1 = .7,
•
10
βˆ0 = y¯ − βˆ1 x¯ =
− (.7)
5
15
5
= −0.1
The least squares line is given by:
yˆ = βˆ0 + βˆ1 x = −.1 + .7x
STAT 515 – p.183
STAT 515 – p.182
Least Squares Line
Fun Exercise
•
Consider the following pairs of measurements:
3.5
4.0
x 5 3 -1 2 7 6 4
y 4 3 0 1 8 5 3
1.5
2.0
2.5
3.0
a. Construct a scattergram (scatter-plot) of these data.
b. What does the scattergram suggest about the
relationship
c. Can you calculate the least squares estimate of β0
and β1 ?
d. Plot the least squares line on your scattergram.
Does the line appear to fit the data well?
e. Interpret the y-intercept and slope of the least
squares line.
1.0
y
•
Solutions
1
2
3
4
5
x
STAT 515 – p.184
Figure 1: SSE=1.1
STAT 515 – p.185
Regression Inference
•
Ground Assumptions
•
Recall the model
y = β0 + β1 x + ε
•
The least squares estimators for β1 and β0 are given by
SSxy
,
βˆ1 =
SSxx
•
•
•
βˆ0 = y¯ − βˆ1 x¯
By fitting this model, we know the estimates are too
much focused on the sample data. Can we say
something about these estimates on different samples?
In other words, how reliable are they?
•
•
In order to pursue statistical inference, some
assumptions are needed for the random component.
ε is assumed to have a probability distribution with mean
0
The variance of the probability distribution of ε is
constant for all settings of the independent x. By
notation, we assume Var[ε] ≡ σ 2 .
ε has a normal distribution.
The values of ε associated with any two observed values
of y are independent.
STAT 515 – p.186
STAT 515 – p.187
Estimating σ 2
•
Recall SSE=
algebra
in which
Example
P
(yi − yˆi )2 , expanding it out and by some
SSE = SSyy − βˆ1 SSxy
SSyy =
X
yi2 −
(
One estimator for σ 2 is given by
s2 =
•
Referring to the drug reaction time example, please
compute an estimate of σ ; below is the data.
Subject Percent x of Drug Reaction time y (seconds)
1
1
1
2
2
1
3
3
2
4
4
2
5
5
4
yi )2
n
P
SSE
n−2
where n is the sample size.
STAT 515 – p.188
STAT 515 – p.189
Making inference on the slope
•
In previous study, we have tried to visually inspect the
scattergram, and try to see whether there is some
convincing evidence for linear relationship.
• More formally, based on data we can perform a
two-sided test
H0 : β1 = 0
•
vs Ha : β1 6= 0
To perform this test, we must first find an appropriate
test statistic.
Sampling Distribution of βˆ1
•
Research efforts suggest that one such test statistic may
be formed based upon βˆ1 ; but its sampling distribution is
needed.
• Under those assumptions about ε, it can be shown that
the sampling distribution of the least squares estimator
βˆ1 of the slope is normal with mean β1 and the standard
deviation
σ
σβˆ1 = √
;
SSxx
the formula above is called the standard error
√ of the
slope estimator, it can be estimated by s/ SSxx .
STAT 515 – p.190
STAT 515 – p.191
Test Statistic
•
Since the sampling distribution of βˆ1 is known, the
approriate test statistic is a t-statistic, formed as
t=
where
•
Exercise
βˆ1 − Hypothesized value of β1
sβˆ1
•
For the drug reaction time example, conduct a .01-level
test to determine whether the reaction time (y) is
linearly related to the amount of drug (x).
• Solution to be applied.
s
sβˆ1 = √
.
SSxx
The t-statistic above has a t-distribution with df=n-2
STAT 515 – p.192
STAT 515 – p.193
R Demo on linear regression
The Fitted Model
25
15
20
final
30
35
library(faraway)
data(stat500)
ls()
names(stat500)
attach(stat500)
plot(midterm,final)
abline(0,1)
g $<$-- lm(final~midterm,stat500)
summary(g)
g\$coef
abline(g\$coef,col="red")
Remark: One may have to install the faraway package first.
10
15
20
25
30
midterm
STAT 515 – p.194
Summary on Testing β1
•
One-tailed test: H0 : β1 = 0 vs Ha : β1 < 0 (or
Ha : β1 > 0).
Test statistic:
βˆ1
βˆ1
t=
= √
sβˆ1
s/ SSxx
Rejection region: t < −tα (or t > tα when Ha : β1 > 0).
• Two-tailed test: H0 : β1 = 0 vs Ha : β1 6= 0, and the
rejection region is given by |t| > tα/2
•
where tα and tα/2 are based on (n − 2) degrees of
freedom.
STAT 515 – p.196
STAT 515 – p.195
t-statistic for drug reaction time
•
As we can compute, βˆ1 = .7, s = .61, and SSxx = 10.
Thus
t=
.7
βˆ1
.7
√ =
√
= 3.7
=
.19
s/ SSxx
.61/ 10
If we are willing to risk a Type-I error probability,
α = 0.05, then the rejection region for t will be
|t| > t.025 = 3.182.
Apparently, 3.7 is in the upper rejection region, so we
reject the null hypothesis and conclude that the slope β1
is not 0. What is the observed p-value for this two-sided
test, approximately ?
STAT 515 – p.197
Reading Output
Confidence Interval for β1
•
Major statistical software packages, like R, Minitab,
SPSS, will report a two-tailed p-value for each of the
β -parameters in the regression fitting. For example, in
simple linear regression, the p-value for the two-sided
test H0 : β1 = 0 versus Ha : β1 6= 0 is given on the
printout. If you want to perform a one-sided test of
hypothesis, you need to make the adjustment.
• Upper-tailed test Ha : β1 > 0, p-value=p/2 if the
observed t > 0, and 1 − p/2 if t < 0.
•
•
A 100(1 − α)% confidence interval for the simple linear
regression slope is given by
βˆ1 ± tα/2 sβˆ1
where the estimated standard error of βˆ1 is computed as
sβˆ1 = √
s
SSxx
and tα/2 is based on (n-2) degrees of freedom.
Lower-tailed test Ha : β1 < 0, p-value=p/2 if the
observed t < 0, and 1 − p/2 if t > 0.
STAT 515 – p.198
STAT 515 – p.199
Drug Reaction Time Data
•
What is a 95% confidence interval for β1 ?
• tα/2 is based on (n-2)=3 degrees of freedom. Reading
from the t-table, t.025 = 3.182.
• So a 95% confidence interval is given by
.61
s
βˆ1 ± t.025 sβˆ1 = .7 ± 3.182 √
= .7 ± 3.182 √
SSxx
10
which gives .7 ± .61. So the estimate of the interval for
the slope parameter β1 is from .09 to 1.31, and one can
be 95% confident there is a positive correlation between
the response time and drug percentage.
STAT 515 – p.200
Exercise
•
Consider the following pairs of observations:
y 4 2 5 3 2 4
x 1 4 5 3 2 4
a). Construct a scattergram of the data.
b). Use the method of least squares to fit a straight line
to the 6 data points.
c). Plot the least squares lines on the scattergram of
part a.
d). Compute the test statistic for determining whether x
and y are linearly related.
e). Carry out the test you set up in part d, using α = .01.
f). Find a 99% confidence interval for β1 .
STAT 515 – p.201
Correlation Coefficients
The (Pearson) correlation coefficient is defined as
4
−2.0
5
−1.0
6
SSxy
SSxx SSyy
−3.0
3
for a sample of n paired measurements.
• It can be shown that −1 ≤ r ≤ 1, and r is scaleless. So
this is much better than βˆ1 . If r = 0, then βˆ1 = 0, and
vice versa. So r = 0 means there is no linear relationship
at all. Can you guess what r = 1 (or r = −1) would
mean?
0
1
2
3
4
5
6
0
1
2
3
4
5
4
3
6 8
2
0
1
2
3
4
5
r=0, no linear relationship
6
0
1
2
3
4
5
•
STAT 515 – p.203
The coefficient of determination
•
For the data,
y 4 2 5 3 2 4
x 1 4 5 3 2 4
It turns out if you would square r, there is a relation
r2 =
•
How much is its r?
• Using x <-c(1,4,5,3,2,4) and y <-c(4,2,5,3,2,4), we
know
SSxy = sum(x ∗ y) −
6
r>0, some positive relationship
STAT 515 – p.202
Example
6
r=−1, perfect negative linear relationship
16
r=1, perfect positive linear relationship
12
r=p
2
•
Data Clouds
•
SSyy − SSE
SSE
=1−
SSyy
SSyy
This is an important measure of goodness of fit in linear
regression. Perfect fit gives r = 1.
sum(x) ∗ sum(y)
= 2.67
n
similarly, SSyy = 7.33, SSxx = 10.83; so r = 0.3.
STAT 515 – p.204
STAT 515 – p.205
Prediction
Prediction Intervals
•
One important goal for regression analysis is for
prediction.
• For the drug reaction example, if somebody comes in
with a measure of the drug percentage, 5%; can you
predict what the response time for this individual would
be?
• Recall the least squares line
•
Instead of looking at point estimates, one may derive
confidence intervals for prediction.
• A 100(1 − α)% prediction interval for an individual new
value of y at x = xp is
yˆ ± tα/2 (Estimated standard error of prediction)
or
s
yˆ = βˆ0 + βˆ1 x = −.1 + .7x,
yˆ ± tα/2 s
so the predicted value would be yˆ = −.1 + .7(5) = 3.4
seconds. But how reliable is this estimate? What would
other samples tell you about this value?
1+
1 (xp − x¯)2
+
n
SSxx
where tα/2 is based on (n-2) degrees of freedom.
STAT 515 – p.206
Example
•
STAT 515 – p.207
Something of less importance
Refer again to the drug reaction regression. Predict the
reaction time for the next performance of the experiment
for a subject with a drug concentration of 4%. Use a
95% prediction interval.
•
Along with prediction interval for a new subject at
x = xp , one can talk about estimating the mean of y in
that subpopulation with characteristic (predictor) xp .
•
The confidence interval is going to be different from the
prediction interval because
s
1 (xp − x¯)2
σyˆ = σ
+
n
SSxx
is different from
σ(y−ˆy) = σ
STAT 515 – p.208
s
1+
1 (xp − x¯)2
+
n
SSxx
STAT 515 – p.209
A walk through linear regression
•
Based on empirical ground, and look at the scattergram,
hypothesize the model
E(y) = β0 + β1 x
•
Estimate the β ’s, using least squares estimation
• Keep in mind the assumptions on random errors: (i)
mean(ε)=0; (ii) Var(ε) = σ 2 stays constant; and (iii)ε’s
are independent, with normal distribution.
• t-test for β0 and β1 , to check model adequacy. Look for
quantities like s, p-values, and F statistic.
• Estimation and/or prediction. (using interval estimation)
Estimability
•
There are many causes for non-estimability
• Textbook gives us a very simple one
• Suppose you want to fit a model relating annual crop
yield y to the total expenditure for fertilizer, x. Let us
propose the following the model
E(y) = β0 + β1 x.
If you observe a data with only one single x, the
parameters in the model cannot be estimated.
STAT 515 – p.210
Outlier
STAT 515 – p.211
Comparing two population means
•
In regression analysis, it is very important to identify
outliers because they may make your findings spurious.
An outlier is defined to be the observation whose
residual is larger than 3s (in absolute values).
• Example: Consider the following 10 data points.
x 3 5 6 4 3 7 6 5 4 1
y 4 3 2 1 2 3 3 5 4 7
(a) Is there sufficient evidence to indicate that x and y
are linearly correlated?
(b) Can you find any outlier?
(c) Can you compute R2 and Ra2 ? which one is bigger?
Is it always true?
STAT 515 – p.212
•
Problem: A study published in the Journal of American
Academy of Business, examined whether the
perception of the quality of service at five-star hotels in
Jamaica differed by gender. Hotel guests were randomly
selected from the lobby and restaurant areas and asked
to rate 10 service-related items. Each item was rated on
a five-point scale (1=“much worse than I expected,”
5=“much better than I expected”), and the sum of the
items for each guest was determined. A summary of the
guest scores are provided in the following table:
Gener
Sample size
Mean Score
Males
127
39.08
Standard Deviation
6.73
Females
114
38.79
6.94
STAT 515 – p.213
Question
Is there any gender difference for rating the hotel? Can you
construct a 90% confidence interval for that difference?
•
To answer this question, let µ1 denote the mean score
for male population, and let µ2 denote the mean score
for female population. Suppose we have sample x1 from
the male population, and x2 from the female population,
a natural point estimator for µ1 − µ2 is x¯1 − x¯2 . In order
to construct confidence interval, we need to work out
the sampling distribution. As most of you can imagine,
it is
, with mean=µ1 − µ2 , and variance=combined
variability from x¯1 and x¯2 .
STAT 515 – p.214
•
Problem: In an experiment to improve the Japanese
reading comprehension levels at the University of Hawaii,
14 students participated in a 10-week extensive reading
program in a second-semester Japanese course. The
numbers of books read by each student and the student’s
course grade are reported in the following table:
# of Books
Course Grade
# of books
Course Grade
53
A
30
A
42
A
28
B
40
A
24
A
40
B
22
C
39
A
21
B
34
A
20
B
34
A
16
B
STAT 515 – p.216
•
In other words,
σ(¯x1 −¯x2 ) =
s
σ12 σ22
+ ;
n1 n2
and the 100(1 − α)% confidence interval here is of the
form
(¯
x1 − x¯2 ) ± zα/2 σ(¯x1 −¯x2 ) .
STAT 515 – p.215
Question: Consider two populations of students who
participate in the reading program prior to taking a
second-semester Japanese course: those who earn an A grade
and those who earn a B or C grade. Of interest is the
difference in the mean number of books read by the two
populations of students. Can you draw inference on how
many more books a B student should read in order to get A?
STAT 515 – p.217