Download Study Guide

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Probability wikipedia , lookup

Statistics wikipedia , lookup

Transcript
A.P. Statistics Exam Study Guide
Lacey Kaplan
Unit 1
Important introductory terms:
Individuals (Observational units): The objects described by the set of data.
They can be people, but they can also be just about anything.
Variables: Any characteristic of an individual.
Distribution: Describes the different values of a variable and the frequency
(may be relative frequency) the variable takes on each value.
Categorical Variable: Places individuals into categories based on things like
race, gender, etc. These take the form of bar graphs or pie/circle graphs.
Quantitative variable: Takes on numerical values about the individual such as
height, weight, IQ score, etc. These take the form of histograms, stemplots, and box plots.
3 key features of a distribution:
Shape:
Symmetric: values are balanced
Skewed: one end of the distribution stretches out further than the
other.
Center:
Describes middle of distribution (mean or median)
Spread:
Variation in the data (Range, Standard deviation, or IQR)
Other features of a distribution:
Outliers:
A value not part of the overall pattern.
Mode:
Peak (s) or cluster (s) of a distribution.
Bimodal: 2 normal curves
Mean: X or  : Average of the values for the variable
Median: The midpoint of the value for the variable
First Quartile (Q1): The midpoint for the lower half of the data
Third Quartile (Q2): The midpoint for the upper half of the data
Five number summary:
Minimum
Q1
Median
Q3
Maximum
The Q1 is the middle value of the bottom half of the data. If the median is
in the middle of two numbers start at the lower number and count back.
Range: Distance between the minimum value and the maximum value
Interquartile range: Distance between Q3 and Q1
Outlier: Any value outside [Q1-1.5xIQR,Q3+1.5xIQR]
Standard Deviation Sx or  x: The average distance the value of the variable
is from the mean
Variance sx2 or  x2
Resistant measure: A value that is relatively unaffected by changing a small
proportion of the total number of values
Density curve: Has an area of exactly 1 underneath it and describes the
overall pattern underneath it
Mean > Median
Mean= Median
z
Mean < Median
x
 . This value tells you how many standard
Standard Z-score:
deviations your value is from the mean.
99.7%
95%
68%
QuickTime™ and a
decompressor
are needed to see this picture.
This is the empirical rule:
68% of the data is plus or minus 1 standard deviation.
95% of the data is plus or minus 2 standard deviations.
99.7% of the data is plus or minus 3 standard deviations.
Four important formulas:
x
X
n
 (x  X)

2
s
2
n 1
s
 (x  X)
2
n 1
z
xX
s
Z-score applications:
1) Assume the distribution is normal.
2) Sketch the normal curve.
3) Use the formula to find the z-score
4) Use normal cdf to find the %
5) Answer in context of the question
Calculator functions for Unit 1:
To find % normal curve above, below, or between given Z-scores:
Normal cdf (min, max)
To find z-score for a given curve:
Invnorm (proportion below)
Example: Find the z-score associated with the top 20% of the normal curve.
Invnorm (.8)= .841
Unit 2:
Bivariate data: data involving two quantitative variables
The x-axis has the explanatory variable, and the y-axis has the response
variable.
Key Features:
1) Form: linear
2) Direction (positive/ negative)
3) Strength: weak, moderate, strong
We describe the association between two variables.
The Correlation Coefficient:
The correlation coefficient, r, is the statistic that measures the strength
and direction of a linear association between two variables.
There are several formulas for r but for now it is sufficient to know that:
1) r is a value from -1 to 1, never outside this range
2) Positive values of r indicate positive association, negative values of r
indicate negative association, if r=0, there is absolutely no correlation.
3) The choice of x & y as the explanatory and response variables do not
matter
4) The scale of the x and y axes do not matter
5) R is useful only for linear relationships and tells us nothing about nonlinear relationships
6) R is not a resistant measure and is very sensitive to certain kinds of
outliers known as influential points
7) R can only be found for quantitative variables
What values of r mean.
-1
.8
-.5
Strong Moderate
-.3
0
-.3
Weak NO Weak
Negative Correlation
The formula for the correlation coefficient, r is:
r
1
xX
xX


n 1
sx
sy
You can also think about it like this:
.5
.8
+1
Moderate Strong
Positive Correlation
r
z z
x y
n 1
Lurking variable: temperature in the fact that as crime goes up so do ice
cream sales. You can’t therefore say that a strong association implies
causation.
LSRL (Least squares regression line):
y$  b0  b1 x
b0= where you are starting/ y intercept
b1= slope/ rate of change
Interpreting the LSRL:
•
APexam
 6.97  .1278midterm
For every point you get on the miderm, we would expect your AP exam score
to be 0.1278 higher.
Finding r:
Zy
Zx
Slope=r
Example: Fast food sandwiches
Protein:
X  17.2g
sx  14.6g
Fat:
Y  23.5g
sy  16.4g
r=.83
b1  r(
sy
sx
)
b1= .972
The line of best fit goes through the average, so you can plug in the average
to find b0.
B0= 6.8
Final equation:
•  6.8  .972 protein
Fat
Coefficient of determination, r2:
The % variability in the response variable accounted for by the model.
R2 (as a percent) of the variation in (response variable) is accounted for by
this model .
Calculator functions and further notes for Unit 2:
Creating a scatterplot:
Enter data into two lists.
Press 2nd statplot.
Turn the desired plot on, choose 1st type of plot, enter list name
For explanatory variable in xlist, response variable in ylist. Choose
zoom then 9.
Describe scatterplots by characterizing form, direction, and strength.
Form: The general pattern. Can by characterized as linear, non-linear,
quadratic, exponential, etc.
Direction: Can be positively associated (above-average values of explanatory
correspond to above-average values of response), negatively associated
(above-average values of explanatory correspond to below-average values of
response), or none.
Strength: Describe how closely the points follow a clear (not necessarily
linear) form. Scatterplots that show tight adherence to the general form
have strong association, plots where the points have significant spread about
the general form have weak association. Strength is difficult to determine
by eye due to possible distortion based on scaling of the axes.
Finding the correlation coefficient:
Enter data into 2 lists.
Press stat, arrow to calc, choose 4: linreg(ax+b).
Enter two list names separated by a comma, read r (correlation
coefficient) and r2 (coefficient of determination) off the screen.
Characteristics of correlation and the correlation coefficient r:
1) The correlation coefficient will be the same regardless of which of
the two variables is designated as the explanatory variable.
2) R does not change if the units of measure of x or y are changed, or if
each x or y value is multiplied by a constant, or if a constant is added
to each x or y value.
3) Correlation measures only the strength of a linear relationship.
Correlation does not describe curved relationships, no matter how
strong.
4) Correlation is not a resistant measure.
5) Correlation does not imply causation (lurking variables).
Finding the regression line:
Enter data into 2 lists.
Press stat, arrow to calc, choose 4:linreg(ax+b)
Enter two list names separated by a comma, then (if desired) another
comma and a functional variable for graphing the line by choosing vars
arrowing to y vars then 1: function and then choosing y1 or another
desired y variable. The slope and intercept of the regression line will
be displayed, along with r and r2. The equation of the line will be
inserted into y1.
The regression line describes how the response variable (y) changes as an
explanatory variable (x) changes.
LSRL: the line minimizes the sum of the distances from the data points of
the line.
y$ = the predicted, or estimated value of the response variable obtained from
the LSRL.
Characteristics of the regression line:
1) It is important to correctly indicate the explanatory and response
variables.
2) The least squares regression line always passes through the point
(X,Y ) .
b1  r 
sy
sx .
3) The slope of the LSRL is
4) The y intercept is b0  Y  b1 X .
5) R2, the coefficient of determination, indicates the proportion of the
response variable variation that is explained the least squares
regression line.
6) In the regression line using for the points (zx,zy), r is the slope and
the y-intercept=0.
$
Prediction with the Regression Line: calculating y for a given x using the
regression line equation.
Interpolation: Using the regression line to predict values within the range of
the explanatory variable. Appropriate and usually accurate.
Extrapolation: Using the regression line to predict values outside the range
of the explanatory variable. Dangerous and often inaccurate.
Residuals:
The difference between the observed value of the response variable and the
expected value, which is predicted by the regression line. It is simply the
distance the point is from the regression line. If the point is above the
regression line, the residual is positive and is the point is below the line. If
the point is below the regression line, the residual is negative.
residual  y  y$
Characteristics of residuals:
1) The sum of the residual is always equal to 0.
2) The mean of the residuals is always equal to 0.
Residual plot: A scatterplot of residuals. The residuals plotted on the y-axis
vs. the explanatory variable on the x-axis. Visualize the regression line being
rotated to make It horizontal and each point’s distance from the line stays
the same.
If a residual plot shows no systematic pattern and an even scattering about
the line y=0, then the regression line is a good fit to the data.
Residual Plots:
Whenever a regression is run on the calculator, the residuals will be
placed in a list called resid.
To view a residual plot, follow the directions for a scatterplot: press
2nd staplot, turn the desired plot on, choose 1st type of plot, choose
explanatory variable for xlist, residual for ylist. Choose zoom then 9.
Influential points:
Individual data points which, if removed, dramatically affect the regression
line (slope and or intercept) and correlation coefficient.
1) An influential datapoint may have a small residual because it pulls the
regression line toward itself.
2) The regression line is not resistant.
Unit 3
Lesson A1: Introduction to Probability
The probability of an event is a value between 0 and 1 and is a measure of
how likely an event is to happen.
Probability= 5 C3 (.2)3 (.8)2
# timeseventhasoccured
total # ofpossibilities
We use probability to describe how likely random events are to happen.
A random event is one that is unpredictable in the short term, but has a
regular distribution of outcomes in a large number of repetitions.
Examples of random events: tossing a thumbtack, playing roulette, tossing a
die.
Random events are studied through a long series of trials which must be
independent.
Events are independent when one outcome has no bearing on future
outcomes.
Law of large numbers: if you keep flipping coins, the probability increases
closer and closer to 50/50.
Or: add
And: multiply
Vocabulary:
Event: a particular outcome or set of outcomes (trials)
Sample space: a set of all possible outcomes
Mutually exclusive or disjointed events: two events have no outcomes in
common
Example of a probability table:
Rain
Saturday
40%
Sunday
70%
No Rain
60%
30%
100%
100%
28% Rain Sat.
and Sun.
18% No Rain Sat.
and Sun.
Rules of Basic Probability:
1. 0  P(x)  1
2. Sum of the probabilities for all possible outcomes in a sample space must
total one
3. P(not X)= 1-P(X)
4. If events A and B are disjoint, then:
P(A or B)= P(A)+P(B)
5. Disjoint events are said to be mutually exclusive. This means that they
have no outcomes in common.
Sample Problems for Lesson A1:
Suppose that 40% of cars in your area are manufactured in the United
States, 30% in Japan, 10% in Germany, and 20% in other countries. If cars
are selected at random, find the probability that:
1)
2)
3)
4)
5)
6)
7)
A car is not U.S. made: 60%
It is made in Japan or Germany: 40%
You see two cars in a row from Japan: 90%
None of three cars come from Germany: 72.9%
At least one of three cars is U.S. made: 78.4%
The first Japanese car is the fourth one you see: 10.29%
At least one of five cars is from Germany: 40.95%
More on A.P. Statistics Quiz A- Chapter 14 Worksheet
Lesson A2: Mutually Exclusive and Independent Events
Two events are mutually exclusive if they cannot occur at the same time (i.e.
they have no common outcomes). Ex: If you roll a coin, it is heads or tails,
not both.
Two events are independent if the fact that A occurs does not affect the
probability of B occurring. Ex: If you roll 2 die
Addition and Multiplication Rules:
Key:
= or
= and
B/A= B given that A has occurred
(Addition “Or” Problems):
P(A
B)= P(A)+P(B)-P(A B)
(Multiplication “And” Problems):
P(A B)= P(A)  P(B)
When the events are not independent…
P(A B)= P(A)  P(B/A)
(Given Problems):
P(B/A):
P(A I B)
P(A)
Sample Problems on Lesson A2:
Boy
Girl
Total
Grades
117
130
247
Popular
50
91
141
Sports
60
30
90
Total
227
251
478
The table above indicates the number of randomly chosen 4 5, & 6 graders
as to whether their primary goal was to get good grades, to be popular, or to
be good at sports.
1) Based on the table, a randomly chosen child has the following
probabilities:
P(girl)= 25/478
P(girl and popular)= 91/478
P(sports)= 90/478
Conditional Probability takes into account a given condition has already
occurred. For example, the probability of chosen a student that excels at
sports given that we have selected a girl is:
P(sports/girl)= 30/251
2) Find the probability of choosing a student that desires good grades
given a boy is chosen. 117/227
3) Given that a boy is chosen, find the probability of choosing a student
that desires to be popular.
Police report that 78% of drivers stopped on suspicion of drunk driving are
given a breath test, 36% a blood test, and 22% both tests. What is the
probability that a randomly selected DWI suspect is given:
Make a Venn Diagram
Breath
56%
Both
Blood
22%
14%
8% Neither
More on SCPM- General Rules of Probability Sheet
Lesson A3: Tree Diagrams
Sample Problems on Lesson A3:
Employment data at a large company revealed that 72% of workers are
married, that 44% are college graduates, and that half of college graduates
are married.
COLLEGE: .30
22%
NO COLLEGE: .70
50.4%
.72
MARRIED
UNMARRIED
.28
COLLEGE: 78%
22%
NO COLLEGE: 22%
6%
More on AP STAT Tree Diagram Practice Sheet
See “AP STAT- Probability Quiz Review” and “AP STAT- Probability Quiz”
for more on Lesson A
Lesson B1: Expected Value
E(x)=
 x  P(X)
Example 1:
Game with a die. Roll 1 Win $1
Roll 2 Win $2
Roll 3 Win $3
Roll 4 Win $4
Roll 5 Win $5
Roll 6 Win $6
Win
Prob.
1
1/6
2
1/6
3
1/6
4
1/6
5
1/6
6
1/6
E(x)= (1 (1 / 6)  (2  (2 / 6)  (3 (3 / 6)...
E(x)= 21/6
E(x)= 3 (1/2)= $3.50
$3.50 is the average winning.
Example 2: Mrs. Smith has a 9 litter of golden puppies. 3 male and 4 female.
You randomly choose 2 puppies. Find the expected number of male puppies.
# Males
Prob.
0
(4/7)(3/6)
1
(34/42)
2
(3/7)(2/6)
*All of the probabilities must add up to 1
E(x)= 6/7 male puppies
How to find expected value on your calculator:
List 1: Row 1
List 2: Row 2
Stat- Calc- 1 Var Stat L1,L2
x = expected value (A.K.A. average)
 = standard deviation
Example 3: A couple plans to have children until they have a girl, but they
agree that they will not have any more than three children eve if all are
boys.
1) Find the expected number of children.
# Children
Prob.
1
1/2
2
4(1/2)= 1/4
3
3(1/2)+3(1/2)=
1/2
2) Find the expected number of boys that they will have
# Boys
Prob.
0
1/2
1
1/4
2
1/8
3
1/8
Example 4: In a litter of seven kittens, three are female. You pick two
kittens at random.
1) What is the expected number of male kittens you would get?
# Males
Prob.
0
(3/7)(2/6)=
6/42
1
2
(4/7)(3/6)+(3/7)(4/6)= (4/7)(3/6)=
24/42
12/42
See “AP Stat-Probability Models” and “Probability Model Quiz” for more on
Lesson B1.
Lesson B2: The Wieland and Liza Problem
Given random variable x
E(x+c)= E(x)+c
(c is a constant)
E(x+y)= E(x)+E(y)
Standard deviation (x+y) or difference of standard deviations:
 (x)2   (y)2
Example 1:
Wieland and Liza have cereal for breakfast each morning. Wieland has an
average of 14 oz of cereal (st. dev. 3 oz) and 8 oz of milk (st. dev. 1.4 oz).
Liza has an average of 7.5 oz of cereal (st. dev. 3 oz) and 5 oz of milk (st.
dev. 8 oz).
1) What is the average size of Wieland’s breakfast and standard
deviation?
Average Size:(14+8)=22
Standard Deviation:
32  1.4 2 = 3.31
2) Wieland goes on a diet and eats only ½ of his usual breakfast. Find the
new average and standard deviation.
22
 11
2
3.31

 1.655
2

3) What is Liza’s average breakfast for the entire week?
Per day average: 12.5
Per day St. Dev.: 1.53
Per week average:
12.5+12.5+12.5+12.5+12.5+12.5+12.5= 87.5
12.5 7 = 87.5
Per week St. Dev.:
(1.53)2  (1.53)2  (1.53)2  (1.53)2  (1.53)2  (1.53)2  (1.53)2 = 4.05
7(1.53)2 = 4.05
If each event is independent and there is a normal distribution:
For two variables:
x


Z= Z-score
x= difference in x and y
 = average difference
 = different in standard deviations
For one variable:
x


Z= Z score
x= the value in question
 = average
 = standard deviation
Example of “for one variable”:
Wieland has an average of 22 oz and a standard deviation of 3.31. Find
the probability that Wieland has more than 25 oz.

25  22
= .906
3.31
Because you are looking for the probability that he has MORE than 25 oz:
2nd Vars- normal cdf- (.906,99)
Answer: P(z>.906)= .182
18.2% chance he has more than 25 oz.
Example of “for two variable”:
Find the probability that Liza has more cereal than Wieland.
W=
 : 22
 = 3.31
L=
 = 12.5
 = 1.53
Average difference: 22-12.5=9.5
Difference in standard deviations:
(3.31)2  (1.53)2 = 3.65
0  9.5
= -2.6
3.65
(You make X zero because you want the difference between Wieland and
Liza to be less than zero, because that would mean that Liza has MORE
cereal than Wieland. You are looking for a negative Z score.

Normal cdf(-99, -2.6)= .00466
P(z<-2.6)=.00466
Lesson B3: Geometric and Binomial Probability
Preliminary Example: 20% of cheerios boxes have a Dora prize. How many
boxes would Wieland have to buy so Liza gets a prize?
P(win on 5th box)= .8)4 (.2)2 = .08192
Geometric and Binomial Restrictions:
1) Each trial is independent
2) Only the option of “success” or “failure” (“p” or “q”) (Only 2 options)
Basic Vocabulary:
Permutation: order matters
Combination: order doesn’t matter
N
C
R
C
R
= number of combinations
N, Math, PRB, R
N
P=
N
=
C
n!
r!(n  r)!
R
P R (1  P)N  R
binompdf(n,p,r)
2nd- vars- binompdf
Example 1: Picking 5 boxes and winning exactly 3 times.
P= 5 C3 (.2)3 (.8)2
binompdf(5, .2, 3)
Lesson B4: Binomial Probability
Make sure:
1) Each trial is independent
2) Only success and failure
Example 1: 10% of the population is left handed. 4 people walk in the
room. Find the probabilities:
# LH Ppl
Prob.
0
1(.9)4
1
4(.1)1 (.9)3
2
6(.1)2 (.9)2
3
4(.10)3 (.90)1
4
1(.10)4
How you would use binompdf for these problems:
Example, for 2 LH people you would do:
binompdf (4,.10,2)  Exactly 2 LH people
Lesson B5: Binomial Probability and the Normal Model
When the numbers are large, the normal model can be used.
If we expect:
np  10
nq  10
E(x)= np
  npq
Example 1: If the American Red Cross needs at least 1850 O-Negative
donors, if 6% of the donors are O-Negative, find the probability that a
group of 35,000 people has at least 1850 O-Negative donors.
n= 35,000
p= .06
q= .94
E(x)= (35,000)(.06)= 2100
  (35,000)(.06)(.94)  44.4
Now go back to the Z score idea from Lesson B2:

1850  2100
44.4
z= -5.63
P(z>-5.63)
 1.00
Lesson B6: binomcdf Uses
Example: Archer hits 80% of bulls eyes. He shoots 6 arrows.
# Bulls 0
1
2
3
4
5
Eyes
Prob.
6.4 105 .001536 .01536 .03192 .24576 .3932
(Find the probabilities based on what you learned in Lesson B4)
6
.26214
1) P(6th is first miss)= (.8)5(.2)
2) P(misses exactly once)= binompdf (6, .8, 5) or binompdf (6, .2, 1)
3) P(more than 3 bulls eye)=
binompdf (6, .8, 4)+ binompdf (6, .8, 5)+ binompdf (6, .8, 6)
Wait, there must be a quicker way…
binomcdf deals with everything BELOW the number you put in for “c”.
This means for “more than”, or anything having to do with going ABOVE
that number, you must do 1-binomcdf.
1-binomcdf(6, .8, 3)
Visual Expression of that:
0123456
4 numbers
below, so do 4-1=3. 3 is put in for n.
4) P(less than 5 bullseye):
0123456
5 numbers
below, so do 5-1=4. 4 is put in for n.
binomcdf (6, .8, 4)
5) P(at least 4):
At least has the same idea as “more than”. It is asking you to go above a
number, so you have to do 1-binomcdf.
0123456
4 numbers
below, so do 4-1=3. 3 is put in for n.
1-binomcdf (6, .8, 3)
6) P(at most 3):
0123456
4 numbers
below, so do 4-1=3. 3 is put in for n.
binomcdf(6, .8, 3)
Lesson B7: Simulations
Simulation is a way to estimate probabilities when we are either unable to
determine probabilities analytically or do not have the time, resources, or
money to estimate probability by observation.
Some probabilities associated with gaming (poker, 21) are difficult to
calculate by hand because the composition of a deck of cards is
constantly changing. High speed computer simulations with many, many
trials are used to predict probabilities like these.
Example 1:
Pascack Valley A.P. Students have a 70% probability of achieving a score
of “4” or “5” on the A.P. Exam in May. What is the probability that out of
5 randomly selected students at least 4 obtain a score of “4” or “5”.
Use the random number table so that 1,2,3,4,5,6,7= success and 8,9,0=
failure. Choices could vary here.
Trial #
1
2
3
4
5
6
7
8
9
10
Digits
19223
95034
05756
28713
96409
12531
42544
82853
73676
47150
7/10 Are “Yes” for a 4 or a 5.
Unit 4:
Methods of Sampling:
Random sampling: Each member of the population has an equal chance of
being selected. Computers are often used to generate random telephone
numbers.
Stratified Sampling: Classify the population into at least two strata, then
draw a sample from each.
Systematic sampling: Select every nth member.
Cluster sampling: Divide the population area into sections, randomly select a
few of those sections, and then choose all members in them.
Convenience sampling: Use results that are readily available (NOT GOOD!).
Cautions about sample surveys:
1) Undercoverage: some groups in the population are left out of the
sample selection process
2) Nonresponse: individual chosen for a sample won’t cooperate or can’t
be contacted
3) Bias: a study is biased if it systematically favors certain outcomes
4) Response bias: a respondent may not be truthful
5) Interviewer bias: interviewers may try to obtain certain answers by
their attitude
6) Question wording: can influence a respondent’s answer by leading
them towards a particular point of view
7) Small samples: are not as accurate as large ones. Large samples
decrease the margin or error of a sample.
Important terms:
Census: survey of the entire population
Sample: piece of the entire population
You must write in detail how you take your random sample.
Population parameter: what we’re looking at (what data you are collecting)
Statistic: summary of data
Notes on Observational Studies and Experimental Design:
Observational Studies
1) Retrospective study: a look back on past events
2) Prospective study: researcher identifies subjects in advance and
collects data as events unfold
3) Observational studies can be used to find trends or identify possible
relationships. Not cause and effect though.
4) Observational studies do not demonstrate a causal relationship
Randomized, Comparative Experiments
1) A study that allows us to prove a cause and effect relationship
2) Researcher identifies at least one explanatory variable (known as a
factor) to manipulate at least one response variable
3) An experiment:
a. Manipulates factor levels to create treatments
b. Randomly assigns subjects to these treatment levels
c. Compares the responses of the subject groups across treatment
levels
d. The individuals that we experiment on are experimental units
(subjects/ participants if they are human)
e. The values for a factor are called levels
f. A treatment is a combination of specific levels from all the factors
that an experimental unit receives
4) Four principals of experimental design:
a. Control: we must control sources of variation other than the
factors we are testing
b. Randomization: allows us to equalize the effects of unknown or
uncontrollable sources of variation
c. Replication: repeat the experiment, applying the treatments to a
number of subjects. We sometimes replicate on different groups
allowing us to generalize the results
d. Blocking: grouping similar individuals together and then randomize
within these blocks (not required)
5) New terms
a. Control treatments: a baseline measurement (control group)
b. Blinding: not allowing individuals who can influence the results
(subjects, treatment administrators, technicians) or those who
evaluate the results (judges, treating physicians, etc)
 Single-blind: When one group is blinded
 Double-blind: When both groups are blinded
c. Placebo: A “fake” treatment that looks just like the actual
treatment
Further notes:














Undercoverage is when someone is not asked that should be, and
voluntary response bias is when they don’t answer for a specific
reason
Census: everyone in the population
Population: who you are trying to generalize to
People that answer are your sample
Sample are only people that stop to answer you
Simple random sample: everyone has an equal chance of getting chosen
Multistage sample: Use of more than one type of sampling method.
Example: stratified by grades, broke up honors and not honors,
stratify and cluster
Study can be observational or experimental.
Factors: what you are looking at
o Example: Trying to find out if warming up or sleeping make
people run faster.
o Factors: sleep and warm up time
o Levels: 6 hours sleep, 8 hours sleep, 10 hours sleep 20 minutes
warm up, 0 minutes warm up
o Treatments: 6 treatments (levels 1xlevels2)
Experimental units: who you run the experiment on
Response variable: running times (for above example)
4 principles in more depth:
o Randomized: must randomly assign treatment
o Replication: have to be able to replicate it and run it exactly the
same way. Different trials with different people
o Control: must have control to avoid confounding variables
o Blocking: don’t have to block, but if you think you have to, you
should
Statistically significant: difference is greater than what we would
expect to happen randomly. Can find statistical significance in a study,
you just can’t prove causation in a study
Placebo: method of control and blinding: like a pill with nothing in it




Randomizing treatment, not randomizing the sample in an experiment
Control group: group that represents the norm (in an experiment)
Placebo effect: taking sugar pill and thinking you are better
Matched pairs: When we block into specific units to compare to each
other. Example: twins, before and after pictures
Unit 5:
The variability we see from sample to sample is the sampling error, or
sampling variability.
Sampling distribution with proportions:
Assumptions and conditions:
Sample values are independent
Sample size is large enough
Random condition (for experiments: randomize assignment. For study:
randomize sample)
Can’t just write: assumptions and conditions have been met. You need to be
specific about what exactly has been met.
Example: We know that 13% of the population is left handed. A 200 seat
auditorium has been built with 15 lefty seats that have been built in desk on
the left rather than the right arm of the chair. In a class of 90 students,
what is the probability that there will be enough seats for the left handed
students?
Assumptions and Conditions:




Assume each student is independent and that 90 students is large
enough of a sample size.
Through 90 students are not a random sample, we proceed as if they
are.
90 is <10% all students that may end up in their lecture
(.13)(.90)=11.7>10 successes (.87)(.90)=78.3 > 10 failures
µ
Find SD( p ) and conduct the z score work.
Find probability that >15 students are left handed.
µ
p =15/90= .167
µ
p =13/100=.13
pq
n
µ
p  .0354
µ
p
µ
p p
SD
z  1.05
z



n<10% of population. We expect 10 successes and 10 failures
Given a population proportion, p:
pq
µ
p
n
Control Limit Theorem (CLT): With averages
 Sample size gets larger, distribution gets more normal
 Same conditions and assumptions except you don’t need np>10 and
nq>10.
 We use X as an estimate of  (population mean)


SD(X) 

n
Standard deviation of sampling distribution “standard error”
Use X to estimate 
Example: If the average GPA is 3.2 with a standard deviation of .85m what is
the probability that a picked person has a GPA > 4?
z
4  3.2
.8
Example: What is the probability is > 3.3 if the average is among 25
students?
z
3.3  3.2
.8
25
*Make sure to look at former tests and quizzes to review.
Unit 6:
The first thing you need to know, is that there are two types of data
you can be given:
Percents
and
Averages
Note: Anything written on this sheet in pink pertains to percents, and
anything on this sheet written in green pertains to averages.
Example of a problem using percents: Out of 109 people polled, 10 said
µ
they were in favor of the new policy. The percent is 9% ( p represents
percents).
Example of a problem using averages: The average grade for fourth
period statistics is a 42. The average is a 42 ( X represents averages)
Okay, there are four types of intervals/ tests that can be done with
percents, and five types of intervals/tests that can be done with
averages.
Everything I can do with percents:




1-proportion
1-proportion
2-proportion
2-proportion
z-interval (Confidence Interval)
z-test (Hypothesis Test)
z-interval (Confidence Interval)
z-test (Hypothesis Test)
Everything I can do with averages:
 1-sample t-interval (Confidence Interval)
 1-sample t-test (Hypothesis Test)


o Matched Pair Test (Hypothesis Test)
2-sample t-interval (Confidence Interval)
2-sample t-test (Hypothesis Test)
Before we explain the specifics of all of the nine intervals/tests, let’s
review exactly what all of these things are doing:
µ
Using a sample ( p or X , depending on whether it is a percent or
average you are given) to make an estimate (confidence interval) or
judgment (hypothesis test) about a population parameter (p or  ,
depending on whether it is a percent or average you are given).
We are now going to review everything pertaining to percents, or all of
your “proportion” z-tests and z-intervals.
1. The 1-proportion z-interval
When do I use it?
When you are asked to perform a confidence interval and you are only given
one percent. A confidence interval is an estimate of a population parameter.
The assumptions and conditions:
1. Independence
2. Randomization (random sample or randomized experiment)
µ
pn  10
$
3. qn  10
4. n<10% of the population
The Steps:
1) Write out your proportion
µ
p = whatever percent you are given
2) Solve for your standard error
µ
pq$
SE( µ
p) 
n
3) Find your z*
One’s you should know so you don’t always have to solve for it:
90%=1.65
95%=1.69
98%=2.33
He will usually not ask you for any other than those, but if you just want to
know how…
Example: Finding it for a 90% C.I.
.100-.90= .10
.10/2= .05
Invnorm (.05)= -1.65
Invnorm (.95<-.90+.05)= 1.65
It is 1.65.
4) Calculate your margin of error
ME  z * (SE( µ
p ))
5) Write our your confidence interval
p  ME
6) Write out your conclusion
I am _% confident that the population proportion is between ____ and
____.
2. The 1-proportion z-test
When do I use it?
When I am given one percent and asked if I think the true percent is
actually greater than, less than, or simply not that percent.
The assumptions and conditions:
1. Independence
2. Randomization (random sample or randomized experiment)
pn  10
3. qn  10
4. n<10% of the population
Steps:
1) Write out the null and alternative hypothesis
H 0 : p  whatever the problem says the percent is
H A : p , ,  whatever the problem says the percent is
2) Choose alpha level
Always make   .05
3) Write out knowns
µ
You always know p and n
4) Calculate standard deviation
SD( µ
p) 
pq
n
5) Calculate z-score and find p-value
µ
p p
z
SD( µ
p)
Then solve for either the probability z-score is greater than or less
than whatever you find it to be. Or, if you are testing if p is simply not
equal to a given value, you do a two-tailed test and do 2 x the
probability z-score is greater than what you find it to be. You use
normal cdf for this just as you would any z-score.
Your answer is your p-value.
If your p-value is less than alpha:
In repeated samples, and assuming the null hypothesis is true, we would
expect results similar to this p-value percent of the time. Since the p-value
is less than alpha, I reject the null hypothesis. There is evidence that the
alternative hypothesis is true.
Possibility of a: Type I Error.
If your p-value is greater than alpha:
In repeated samples, and assuming the null hypothesis is true, we would
expect results similar to this p-value percent of the time. Since the p-value
is greater than alpha, we fail to reject the null hypothesis. There is no
evidence that the null hypothesis is false.
Possibility of a: Type II Error.
3. The 2-proportion z-interval
You simply do:
(µ
p1  µ
p2 )  z * (
µ
p1 q$1 µ
p q$
 2 2)
n1
n2
When do I use this?
When I am given more than one percent.
Note* The assumptions and conditions are the same for a 2-proportion
µ
z-test, you just use p in a confidence interval assumption and p in a
hypothesis test assumption)
4) The 2-proportion z-test
Assumptions and Conditions:
1. Both samples must be a simple random sample
2. The samples are chosen independently
3. Each sample is less than 10% of the population
n1 p1  10
n1q1  10
n2 p2  10
4. n2 q2  10
For 2-proportion z-tests ONLY! You use something called
µ
p pooled.
p1  p2

µ
p pooled. n1  n2
Your standard error pooled is then found by using:
SE pooled ( µ
p1  µ
p2 ) 
µ
p pooled q$pooled
n1

µ
p pooled q$pooled
n2
You then find your z-score using:
z
( p1  p2 )  0
SE pooled
*Note, the reason it is minus zero is because in all 2-proportion and
sample-tests and the null hypothesis is that the difference between the
two percents or averages is 0. ALWAYS.
You then find your p-value the same way you would for the 1-proportion
tests.
Alright, time for t-tests/intervals
1. The 1-sample t-interval
When do I use this?
When you are given one mean.
Assumptions and Conditions:
1.
2.
3.
4.
Random sample or random assignment of treatments
Independence
n<10%
If n<15, distribution must be nearly normal. If 15<n<40 distribution
must be unimodal and symmetric. If n>40 any distribution without
extreme skewness. Always draw a histogram of your data to show the
distribution!
Finding a confidence interval for a one-sample t-interval is extremely
easy. You just do:
X  t * SE(X)
X
t* 
whatever average you are given
program, inv-t (if Wieland has not programmed this into
your calculator you need to get it programmed asap). It will ask for
degrees of freedom, which is just n-1.
SE(X) 
sX
n
2. The 1-sample t-test
H 0 :   something
H A :  ,, something
t
X
SE(X)
You then do the same thing you would do with a z-score to find the p-value.
The conclusions are found the same way!
Note* Matched pair tests are when you use one sample and just change
something both times. It is done exactly like a 1-sample t-test, you
just have:
H 0 : D  0
H A :  D ,, 0 The average difference=
D
*There are the same assumptions as you would have for a 1-sample ttest.
3. The 2-sample t-interval
The only thing that changes as far as the assumptions and conditions is you
make two histograms instead of one, because there are two samples.
(X1  X2 )  t *
s1 s2

n1 n2
4. The 2-sample t-test
t
(X1  X2 )  ( 1  2 )
s1 s2

n1 n2
Note* You do not pool with t-tests
That’s not that hard, right? Okay, now here is everything you need to
know about errors.
A Type I error occurs when you reject the null hypothesis.
A Type II error occurs when you fail TO reject the null hypothesis. (II, TO,
yeah, you get it)
You:
Prob. Of Type I
Prob. Of Power
No change
Prob. Of Type
II
Decrease
Increase sample
size
Increase alpha
Increase
Decrease
Increase
Increase
Power is 1   and  is the probability of a type II error. This makes power
the probability of NOT having a type II error.