Download Chapter 1 Reminders

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
AP Statistics







Chapter Reminders
page 1 of 16
Chapter 2 Reminders
A categorical variable places individuals in a category.
A quantitative variable has a numerical value that measures something.
Quantitative data should have units
Just because a value is a number, don’t assume that it is a quantitative variable.
A statistic is a numerical summary of data.
Know the difference between statistics and data.
Know the meaning of univariate and bivariate analysis.
Chapter 3 Reminders



ALWAYS make a picture.
Know how to create and interpret these graphs: bar charts, pie charts, contingency tables
(also called two-way tables), segmented bar charts
Two way tables: Know how to find marginal distributions and conditional distributions
Example: A group of students were asked if they preferred the number 2 or the number 5.
Then they were asked if they preferred the color blue or the color green. The results are given
below.
Blue
Green
2
18
7
5
6
15
The marginal distribution for color preference is:
Blue
Green
24
22
The marginal distribution for number preference is:
2
5
25
21
The conditional distribution of those who chose blue is:
Number
preference
2
5
18
6
The conditional distribution of those who did not choose blue is:
Number
preference
2
5
7
15

Don’t confuse similar sounding percentages/proportions: The proportion of American men who
are US Senators is very small. The proportion of US Senators who are American men is very
large.
AP Statistics
Chapter Reminders
page 2 of 16
Chapters 4 – 5 Reminders

When describing a distribution always mention:
shape (symmetric, right skewed
, left skewed
, uniform
bimodal
, multimodal
,...)
center (median or mean)
spread (range or IQR, or standard deviation)
any unusual characteristics (gaps, clusters, possible outliers ...)
,

Know how to create and interpret these graphs: dotplots, stemplots (also called stem-and-leaf
plots), histograms, relative frequency histograms, boxplots, cumulative frequency graphs (also
called ogives).
**Title and Label all graphs**

Know how to find the mean and median when given a frequency table.

In a skewed distribution, the mean is farther out in the tail than the median.

Know how to locate the mode(s) on a histogram.




Know how to find the first quartile (Q1) and the third quartile (Q3).
Interquartile range (IQR) = Q3 - Q1
Outliers are observations less than Q1 - (1.5)(IQR) or greater than
Q3 + (1.5)(IQR)
The five number summary is: min, Q1, median, Q3, max

Know how to find variance and standard deviation of a set of data:
n
s2 
 (x
i 1
i
n
 x)2
n 1
s 
 (x
i 1
i
 x)2
n 1


Typically, use mean and standard deviation when distribution is relatively mound shaped.
Use the five number summary when distribution is skewed.

Know which measures are resistant to extreme values.
AP Statistics
Chapter Reminders
page 3 of 16
Chapter 6 Reminders

A density curve is a curve that
- is always on or above the horizontal axis
- has an area exactly 1 underneath it

Normal distributions are denoted as N(μ, σ) where μ = population mean and σ = population
standard deviation

68-95-99.7 rule (also known as the Empirical Rule) :
**True for Normal distributions only.**
- 68% of observations fall within σ of μ
- 95% of observations fall within 2 σ of μ
- 99.7% of observations fall within 3 σ of μ

In a normal distribution the points of inflection are at μ - σ and μ + σ .

The larger σ, the flatter the curve -- The smaller σ, the taller the curve

normalcdf(lowerbound, upperbound) gives the proportion of observations between those
two points on the standard normal curve N(0,1).

normalcdf(lowerbound, upperbound, mean, standard deviation) gives the proportion of
observations between those two points for any normal curve
**Notice the c. We do not use the other one.**

Standardized values: z 
x

 Know how to use the standard normal table
 Know how to find the percentile for an observation value
AP Statistics

Chapter Reminders
page 4 of 16
Know how to get an observation when given a proportion of values under the curve.
Example: Find the 90th percentile for the normal distribution with mean 50
and standard deviation 6.
-Find z using table backwards or invNorm(prop. to the left)
InvNorm(.9)  1.28
-Plug in μ, σ, and z into formula to find x.
x
x  50
z
 1.28 
 x  57.68

6

Know how to use a normal probability plot to determine if a set of data is likely to have come
from a normal distribution.
Chapters 7 – 10 Reminders


A response variable measures an outcome of a study (y)
An explanatory variable attempts to explain the observed outcomes (x)

Scatterplots: **Title, label and mark axes**
If there is an explanatory variable it always goes on the x axis.
Two variables are positively associated when large values of one go with large values of the
other
Two variables are negatively associated when large values of one go with small values of the
other



When describing a scatterplot mention: strength (strong, weak, ...), direction (positive or
negative), and form (linear, exponential, …) and unusual features.

The correlation coefficient, r
-measures the strength of the linear relationship of x and y of two
quantitative variables.
-has no unit of measure.
-is always between -1 and 1.
-does not change when the unit of measure changes or if the variables
are exchanged.

Correlation …
-is strongly affected by extreme observations
-helps establish association but not causation
-is not a complete description of two variable data
-Don’t say ‘correlation’ unless you mean r.
AP Statistics
Chapter Reminders
page 5 of 16

The coefficient of determination, r2, is the fraction of the variation in the values of y that is
explained by the change in x.

Know how to graph a scatterplot on the calculator.

Know how to use the calculator to find LSRL when given the data.
- Stat-Calc-#8 (x-list, y-list) (L1 and L2 are the default lists)
- Turn Diagnostic On to get r in the output (Use catalog to find Diagnostic On)

Find LSRL formulas on formula sheet

The center of gravity ( x, y ) is always on the LSRL

Residuals
- Residual = observed y - predicted y = y  yˆ
- The sum of the residuals is always 0.
- The sum of the residuals squared is always smaller than it would be for any other line.

Residual plot: A scatterplot of the x (explanatory variable) values and the residuals.

Residual plots of “good’ models
-have close to the same number of positive and negative values
-are scattered with no pattern
-have small residuals


An outlier in this context is a point whose residual is an outlier compared to the other
residuals. (a point that falls outside the general pattern of the plot)
An influential point is a point for which the slope of the LSRL changes a good bit if it is
removed. (usually far in the horizontal direction)

Know how to interpret the information in a LSRL computer printout.

Exponential model:
 x, log y 
log yˆ  b1 x  bo


yˆ  10b0 10b1x

Power functions:  log x,log y 
log yˆ  b1 log x  bo

yˆ  10b0
 x 
b1

You’d have to have a lot of
POWER to pick up two logs!
Ha ha ha!!
AP Statistics
Chapter Reminders
page 6 of 16

A lurking variable is a variable that has an important effect on the relationship among the
variables is a study but is not included among the variables studied.

Extrapolation is the practice of using the regression equation to predict outside the domain of
the explanatory values that were used to form the line. **It is not recommended.**

Association does not imply causation

Causation: Changes in x cause changes in y.
Chapter 11 Reminders
 Simulations
You must include the following:
1. State the problem or describe the experiment
2. State assumptions (usually something about probabilities of outcomes and each trial
being independent)
3. Explain process in detail (include digit assignment, any ignored digits,
stopping rule, what is counted, replacement issues)
4. Simulate “many” times
5. State conclusions.
Chapter 12 Reminders

Types of samples:
Simple Random Sample (SRS): subjects are selected without replacement, every individual
has an equal chance of being chosen and every subgroup has an equal chance of being the
subgroup chosen
Voluntary response sample: people choose themselves by responding
Convenience sample: chooses people easiest to reach
Probability sample:gives each member of the population a known chance (>0) to be selected.
Stratified random sample: divide population into groups of similar individuals, then choose a
separate SRS from each group, combine them to form the full sample
Multistage cluster sample: Example: 1. Choose from all the counties in the US, choose
towns in each chosen county, 3. choose subdivisions within each town, 4. choose households
within each subdivision
Quota: Subjects are chosen around categories(age, gender,…) according to known
demographic information
Systematic sample: Example: Choose #1, #51, #101, …

A sampling frame is the actual list of possible subjects. Ideally, the sampling frame should
include everyone in the population.
AP Statistics




Chapter Reminders
page 7 of 16
The placebo effect occurs when subjects have some type of different response
(improvement) that is not due to the treatment itself – maybe thinking they are receiving a
treatment causes some improvement
A census is a method of collecting data from all members of the population.
A study is biased if it systematically favors certain outcomes.
Types of bias:
- Undercoverage bias: when some groups are left out of the sampling process
- Nonresponse bias: when someone refuses to participate
- Response bias: people not giving reliable responses
- Measurement bias: the way measurements are taken favors particular results
Chapter 13 Reminders


Observational studies observe individuals or measure variables of interest but do not attempt
to influence responses.
Experiments
- impose some type of treatment
- are the only source of fully convincing data when trying to determine cause and effect.

Principles of experimental design:
1. Control of effects of lurking variables
2. Randomization
2. Replication

Types of experimental design
- block design: similar to the stratified sampling design
- matched pairs design: data from two samples are paired, differences are found, one
sample t-procedures are used

Terms
- factor: the explanatory variables
- treatment: the specific experimental condition
- experimental units: what the treatment is imposed on
- subject: human experimental units
- double blind: anyone working directly with the units (and
obviously the units themselves) are unaware which group(control or
treatment) the units are in
- confounding: Two variables are confounded if we can’t separately identify their
effects on the response variable.
AP Statistics
Chapter Reminders
page 8 of 16
Chapter 14 – 15 Reminders

Random does not mean haphazard

Permutations:(order matters) P(n, r ) 

Combinations:(order does not matter) C (n, r ) 

Probability:

Tree diagrams: Example: Roll a die then toss a coin.
n!
(n  r )!
n!
r! (n  r )!
number of favorable outcomes
number of outcomes
P(1,H) =
P(1|H) =

Terms:
- complement – A’ or Ac denote the complement of A
- union – ‘or’, ‘ᴜ’
- intersection – ‘and’, ‘∩’
- disjoint (mutually exclusive) – can’t occur together
- conditional event – P(B|A) means the probability of B given A
- independent – A and B are independent if P(A) = P(A|B) = P(A|B’)
- sample space – set of all possible outcomes
AP Statistics
Chapter Reminders
page 9 of 16

P(A ∩ B) = P(A) * P(B) if and only if A and B are independent.

P  A | B 
P  A  B
P( B)

Two way table probabilities: AP Statistics students were asked to select their favorite from
each of the following lists: {mountains, beach} and {fall, spring}. The results are described
below:
Mountains
Beach
Fall
2
1
Spring
4
2
P(fall) =
3 1

9 3
P(fall | mountains) =
2 1

4 2
(Completely ignore the beach column.)
P(mountains | fall) =
2
3
(Completely ignore the spring row.)
Chapter 16 Reminders

mean of a random variable X:  x (population mean)

mean of several actual values of X: x (sample mean)

The mean is also called the expected value.

Mean and variance of a discrete random variable:
X
Probability
x1
p1
x2
p2
…
xn
pn
μX = x1p1 + x2p2 + ... xnpn
σX =

( x1  1 ) 2 p1  ( x 2   2 ) 2 p 2    ( x n   n ) 2 p n
Law of Large Numbers: As the number of observations increases, x approaches μx (and stays
that close)
AP Statistics

Chapter Reminders
page 10 of 16
Rules for means
μa + bX = a + b μx
μX + Y = μx + μY

Rules for variances
σ2a + bX = b2σ2X
(X and Y must be independent):
σ2X + Y = σ2X + σ2Y
σ2X - Y = σ2X + σ2Y

Standard deviations do not add, variances do (even with subtraction)
Chapter 17 Reminders

The Binomial Setting
B Binary outcomes - just two possibilities “success” and “failure”
I Independence - the n observations are independent
N Number of observations is fixed
S Same probability of a success for each trial

Binomial distribution: B(n,p) where n = number of trials, p = probability of a success

The variable of interest, X, is the number of successes in the n trials.

The probability that X = k, P(X=k) is obtained by binompdf(n, p, k)

The probability that X ≤ k, P(X ≤ k) is obtained by binomcdf(n, p, k)

Mean of a Binomial Random Variable μ = np

Standard deviation of a Binomial Random Variable:   np(1  p)

The Geometric Setting
1. just two possibilities “success” and “failure”
2. Independence - the observations are independent
3. Same probability of a success for each trial

The variable of interest, X, is the number of trials necessary to get first success.

The probability that X = k, P(X=k) is obtained by geometpdf(p, k)
AP Statistics
Chapter Reminders
page 11 of 16

The probability that X ≤ k, P(X≤k) is obtained by geometcdf(p, k)

Mean of a Geometric Random Variable μ =
1
p
Chapter 18 Reminders

A parameter describes a population.

A statistic is a number obtained from a sample.

A sampling distribution of a statistic is the distribution of values taken by the statistic in many
samples of the same size from the same population.

A statistic used to estimate a parameter is unbiased if the mean of its sampling distribution equals
the parameter.

p represents a population proportion

p̂ represents a sample proportion


The sampling distribution of p̂
- is close to normal when n is large (np ≥ 10, n(1-p) ≥ 10)
p 1  p 
- has mean = p and standard deviation =
n
μ represents a population mean

x represents a sample mean

The sampling distribution of x
- is normal if X has a normal distribution
- is close to normal if n ≥ 30, regardless of distribution of X.
- has mean = μ and standard deviation =


n
The Central Limit Theorem:
As the sample size increases, the sampling distribution of x approaches a normal distribution
– regardless of the distribution of X.
AP Statistics
Chapter Reminders
page 12 of 16
Chapter 19 – 25 Reminders
Inference Overview
 A confidence interval is a method of estimating a parameter.

Two parts to a confidence interval: the interval and the
confidence level (denoted by C)

Form of a confidence interval: estimate  margin of error
estimate  (# of standard deviations on either side)(standard deviation)

Margin of error decreases
- when n increases or
- when confidence level decreases

Know how to find a sample size necessary for a given margin of error and a given confidence
level


Confidence intervals are used to estimate a parameter
Significance tests are used to assess evidence for a particular claim

A significance test does the following – Suppose the null hypothesis is true. With that assumption,
is our sample outcome unusual?

A p-value is the probability that we would get by chance a result at least as extreme as our
sample result.


Small p-values give evidence against Ho
Large p-values fail to give evidence against Ho – they do not give evidence of anything.

A significance level, α, is sometimes used as a decisive boundary for rejecting H o and failing to
reject Ho (α = .1, α = .05 and α = .01 are typical values)

Statistical significance does not mean ‘important’, it means ‘not likely to occur by chance’

Inference (Significance tests and confidence intervals) are based on the laws of probability

Randomization ensures the probability laws apply.




Type I Error – the null hypothesis is true and it is rejected
Probability of a Type I error = α (the significance level)
Type II Error – the null hypothesis is false, but not rejected
Probability of a Type II error = β  can be computed if you have a specific alternative in mind
AP Statistics


Chapter Reminders
page 13 of 16
The Power of the test is the probability that the null hypothesis is rejected given that it is false.
Power of the test = 1 – β
Quantative Data - When population standard deviation, σ, is known: one sample z interval or one
sample z test
Confidence Interval: x  z
*
x
n
Assumptions:
x 
z

Test statistic:
x
n

SRS
Pop. is normal OR n ≥ 30
Pop. size ≥ 10n
When we use s instead of σ, in the test statistic, we get a t-statistic instead of a z-statistic
t-distributions
- are similar in shape to normal distributions
- have a larger variance than the normal distribution
- approach a normal curve as the degrees of freedom increase
Quantative Data - When population standard deviation, σ, is NOT known: one sample t interval or
one sample t test
Confidence Interval: x  t
Test statistic: t 

x 
sx
n
*
sx
n
d. f  n 1
Assumptions:
SRS
Pop. is normal OR n ≥ 30
Pop. size ≥ 10n
d. f .  n 1
The t statistic for comparing two means: t 
x1  x 2
does not actually have a t-distribution, it is
s12 s 22

n1 n 2
close if we estimate the degrees of freedom with a complicated formula (that is what the calculator
does) or we could use the conservative estimate of min{n1 – 1, n2 –2}
AP Statistics
Chapter Reminders
page 14 of 16
Quantative Data - When comparing two means (with population standard deviation NOT known: two
sample t interval or two sample t test
Confidence Interval:
 x1  x2   t
*
s12 s2 2

n1 n2
d . f .  min  n1  1, n2  1
Test statistic:
t
x1  x2
s12 s2 2

n1 n2
d . f .  min  n1  1, n2  1
Categorical Data - one proportion z interval or one proportion z test
pˆ 1  pˆ 
Confidence Interval: pˆ  z *
n
Test statistic:

z
pˆ  po
po 1  po 
n
Assumptions:
2 SRSs
distinct populations
Independent samples
Normal populations
OR n1 + n2 ≥ 40
Pop. 1 ≥ 10n1
Pop. 2 ≥ 10n2
Assumptions:
SRS
Pop. size ≥ 10n
npˆ  10AND n 1  pˆ   10
Assumptions:
SRS
Pop. size ≥ 10n
npo  10 AND n 1  po   10
Choosing a sample size for a specific margin of error
p(1  p)
- margin of error = z *
n
- Since we don’t know p. We use a guess from a previous study or the
conservative guess of 0.5
AP Statistics
Chapter Reminders
page 15 of 16
Categorical Data – When comparing two proportions: two proportion z interval or two proportion z
test
Confidence Interval:
pˆ 1  pˆ1  pˆ 2 1  pˆ 2 
Assumptions:

 pˆ1  pˆ 2   z* 1
n1
n2
2 SRSs
2 distinct populations
Test statistic:
pˆ1  pˆ 2
1 1
pˆ 1  pˆ    
 n1 n2 
Population 1 ≥ 10
Population 2 ≥ 10
pˆ  overall combined proportion
Chapter 26 Reminders

Chi-square test for goodness of fit
- used to see how well an observed distribution fits a hypothesized distribution
- Can be done on the calculator if OBSERVED values are in L1 and EXPECTED values
are in L2
- Some calculators have the GOF test under STATS  TESTS, others have it under the
PROGRAMS menu

Chi-square test for independence (same process as the test for homogeneity)
- used to determine if two categorical variables recorded for ONE SAMPLE are
independent (or ‘associated’ … or ‘related’)
- Can be done on the calculator if the TWO WAY TABLE is entered as a MATRIX
- Expected counts do not need to be entered
- Use the Chi-square test under STATS  TESTS

Chi-square test for homogeneity (same process as the test for independence)
- Used to determine if a distribution is the same across the categories for different groups
(TWO different SAMPLES)
- Can be done on the calculator if the TWO WAY TABLE is entered as a MATRIX
- Expected counts do not need to be entered
- Use the Chi-square test under STATS  TESTS
AP Statistics
Chapter Reminders
page 16 of 16
Chi-square test for Goodness of Fit
Test statistic:
(obs - exp)2
2
 
, df  # of categories  1
exp
Assumptions:
Random selection
(All) expected values ≥ 5
Chi-square test for Independence and Chi-square test for Homogeneity
Test statistic:
Assumptions:
2
(obs
exp)
Random selection –necessary
2  
, df   # of rows  1 # of columns  1
only if generalizing results
exp
(All) expected values ≥ 5