Download Chapters 1-6: Population vs. sample → Parameter vs. statistic

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Central limit theorem wikipedia , lookup

Statistical inference wikipedia , lookup

Gibbs sampling wikipedia , lookup

Law of large numbers wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Chapters 1-6:
Population vs. sample Æ Parameter vs. statistic
Statistics: Descriptive vs. inferential
Types of variables
Quantitative vs. Qualitative
/
|
Discrete Continuous
Tables, charts & graphs
- frequency tables
- qualitative: bar graph/pie chart
- stem-and-leaf plot/dot plot
- time plot
- histogram (modality)
- traits: # of modes, tail weight, overall shape (symmetry, skewness)
- identify skewness by TAIL
- boxplot (skewness)
- outliers, overall shape (symmetry, skewness)
- identify skewness inside box or entire graph
Measures of center/spread/position
- center: mean, median, mode
Æ Outlier effect? Skewness effect?
- spread: range, variance, standard deviation, IQR
Æ Why use squared and (n – 1)? Ever negative? Empirical Rule?
- position: min, max, percentiles (quartiles)
Æ recall that we INCLUDE the median when determining quartiles
Æ 5-number summary, boxplot, types of outliers
Chapters 7-10:
Displaying bivariate data
- scatterplot: visual aid to see form/strength/direction of relationship
and/or outliers (large residual, high leverage, influential)
- correlation: numerical aid to see strength/direction of relationship (range?)
Æ Warning: assumes linearity, sensitive to outliers
Simple linear regression analysis
- regression line: ŷ = b0 + b1x
⎛ sy ⎞
- least-squares estimation gives b1 = r ⎜⎜ ⎟⎟ and b0 = y − b1 x
⎝ sx ⎠
- estimation: interpolation vs. extrapolation (BAD!)
- R-squared: r2 = proportion of variation in y explained by x
- causation: association does NOT imply causation
- residual plots: observed vs. theoretical appearance
- transformation of a variable can help improve linearity
Chapter 11-13:
- observational/retrospective/prospective study, experiment/controlled clinical trial
Æ population and causal inferences (what needs to be present for each?)
- types of bias (response, undercoverage, nonresponse)
- types of sampling: with/without replacement, SRS/stratified/cluster/
voluntary/convenience/systematic
- controlling factors: randomization, blocking, direct control, replication
- more experiment design definitions
Chapters 14-15:
- types of events: marginal, conditional, union, intersection, complement,
- What common words identify them?
- relating events: dependent vs. disjoint vs. independent
- Do these relations affect the rules below? If so, how?
- Do they allow certain rules to be easily extended?
- probability laws:
P( A ∩ B)
- conditional probability: P ( A | B) =
P( B)
C
- complement rule: P(A ) = 1 – P(A)
- multiplication rule: P( A ∩ B ) = P(A and B) = P(A) × P(B | A) = P(B) × P(A | B)
- addition rule: P(A or B) = P(A) + P(B) – P(A and B)
- total probability rule: P ( A) = P ( A ∩ B ) + P A ∩ BC
(
)
- recall examples where we combined a few of these together
Chapter 16-17:
Distributions
- discrete (exact probability or intervals) vs. continuous (only intervals)
Discrete: If P(X = a) > 0, then P(X ≤ a) ≠ P(X < a)
Continuous: If P(X = a) = P(X = b) = 0, then P(a ≤ X ≤ b) = P(a < X < b)
- discrete distributions:
- determine probability distribution (values of X and corresponding probabilities)
- mean: µ = ∑ xi P ( X = xi )
- variance: σ 2 = ∑ ( xi − µ ) 2 P( X = xi )
- continuous distributions:
- uniform distribution: finding an area of a rectangle (with a twist!)
- normal distribution: symmetric, 2 parameters: µ and σ, other properties
Standard Normal Distribution (and its applications)
- µ = 0 and σ = 1
- Table Z only gives areas to left of value z, conversion to these values required
Æ use diagrams, complements, symmetry, etc.
⎛ X −µ x−µ⎞
≤
- standardizing: P(X ≤ x) Æ P⎜
⎟ = P(Z ≤ z)
σ ⎠
⎝ σ
- identifying values for a given probability: x = µ + zσ
Combinations and Functions of Random Variables
For any constants a and b,
Means:
Variances:
1. E(a) = a
1. V(a) = 0
2. E(aX) = aE(X)
2. V(aX) = a2V(X)
3. E(aX + b) = aE(X) + b
3. V(aX + b) = a2V(X)
4. E(aX ± bY) = aE(X) ± bE(Y)
4. V(aX ± bY) = a2V(X) + b2V(Y) ± 2abcov(X, Y)
Y = a1X1 + a2X2 + … + anXn + b, E(Y) = a1E(X1) + a2E(X2) + … + anE(Xn) + b
If X1, X2, …, Xn are independent, V (Y ) = a12V ( X 1 ) + a 22V ( X 2 ) + ... + a n2V ( X n )
Chapter 18:
Sampling Distributions
- sample proportion:
p (1 − p )
pq
=
.
n
n
Rule 3: If np and n(1 – p) are both ≥ 10, then p̂ has an approx. normal dist’n.
pˆ − p
All 3 rules Æ If rule 3 holds, Z =
∼ N (0,1)
p (1 − p )
n
- sample mean:
Rule 1: µ p̂ = p.
Rule 2: σ pˆ =
Rule 1: µ y = µ
Rule 2: σ y =
σ
n
Rule 3: When the population distribution is normal, the sampling distribution of
y is also normal for any sample size n.
Rule 4 (CLT): When n > 30, the sampling distribution of y is well approximated
by a normal curve, even when the population distribution is not itself normal.
All 4 rules Æ If n is large OR the population is normal, Z =
Y −µ
∼ N (0,1)
σ/ n
Chapter 19:
- how to interpret CI?
- generic CI: point estimate ± (critical value) × (standard error)
Æ confidence level increases, ME increases
Æ n increases, ME decreases
- sample proportion:
Assumptions: random sample, npˆ ≥ 10 and n(1 − pˆ ) ≥ 10.
pˆ ± zα / 2
pˆ (1 − pˆ )
n
- choosing n:
⎛ CV ⎞
n ≈ p *(1 − p*) ⎜
⎟
⎝ ME ⎠
No previous info? Use p* = 0.5 for safety.
2
where CV = zα /2
Chapters 20/21:
Null vs. alternative hypotheses; 2-tailed vs. 1-tailed (lower or upper)
Rules for hypothesis construction:
1. No statistics, only parameters.
2. Parameter symbols and claimed values appear the same in H0 and HA.
3. The signs must be different between H0 and HA.
4. Equality signs ONLY appear in the null.
Decision
Do not reject H0
Reject H0
Actual situation
H0 is true
H0 is false
Correct
Type II or
decision
β error
Type I or
Correct
α error
decision
Test statistic, P-value Æ judgment approach (JA) OR significance level approach (SLA)
Significance Level Approach:
P-value ≤ α Æ reject H0;
P-value > α Æ do not reject H0
Judgment Approach:
0 < P-value < 0.01 indicates convincing to strong evidence against H0
0.01 < P-value < 0.05 indicates strong to moderate evidence
0.05 < P-value < 0.1 indicates moderate to suggestive, but inconclusive evidence
0.1 < P-value < 1 indicates weak evidence
Steps in a Hypothesis-Testing Analysis:
1. Assumptions; 2. Hypotheses (select α); 3. Test statistic; 4. P-value; 5. Conclusion
Test for Population Proportion
Assumptions: categorical variable, random sample, np0 ≥ 10 and n(1 – p0) ≥ 10.
pˆ − p 0
z0 =
p 0 (1 − p 0 )
n
Chapter 22:
Two Independent Population Proportions
Assumptions:
independent random samples, n1 & n2 large (nipi ≥ 10 & ni(1 – pi) ≥ 10 for i = 1,2)
CI: pˆ 1 − pˆ 2 ± zα/2
H0: p1 – p2 = 0
pˆ 1 (1 − pˆ 1 ) pˆ 2 (1 − pˆ 2 )
+
n1
n2
Æ Because of H0, we use pˆ pooled =
Then, under same assumptions, z0 =
n1 pˆ1 + n2 pˆ 2 y1 + y2
=
n1 + n2
n1 + n2
pˆ1 − pˆ 2 − ( p1 − p2 )
⎛1 1 ⎞
pˆ pooled (1 − pˆ pooled ) ⎜ + ⎟
⎝ n1 n2 ⎠
Chapter 23:
- introduce t-distribution (same as z except for parameter (df) and NOT knowing σ)
- sample mean:
Assumptions: random sample, n ≥ 30 OR population is normal, σ unknown
⎛ s ⎞
⎟⎟
y ± tα/2, n – 1 × ⎜⎜
⎝ n⎠
- choosing n:
2
⎛ CV ⎞ 2
n≈ ⎜
⎟ σˆ
⎝ ME ⎠
No σ̂ ? Use σ̂ ≈ range/6 for approximately normal data.
where CV = zα /2
Test for Population Mean
Assumptions: num. var., random sample, n ≥ 30 OR population is normal, σ unknown
Y − µ0
t0 =
s/ n
Chapters 24/25:
Independent samples
Assumptions:
independent random samples, n1 & n2 ≥ 30 OR both pop’ns are normal, unknown and
unequal standard deviations (σ1 ≠ σ2)
Æ how do you check?
H0: µ1 – µ2 = 0
SE ( y1 − y 2 ) =
s12 s 22
+
n1 n2
t0 =
y1 − y 2 − ( µ1 − µ 2 )
SE ( y1 − y 2 )
(df ≥ min{n1 – 1, n2 – 1})
CI (same assumptions): y1 − y2 ± tα/2, df × SE ( y1 − y2 )
Assumptions:
independent random samples, n1 & n2 ≥ 30 OR both pop’ns are normal, unknown and
equal standard deviations (σ1 = σ2)
Æ how do you check?
H0: µ1 – µ2 = 0
sp =
(n1 − 1) s12 + (n2 − 1) s 22
n1 + n2 − 2
SE ( y1 − y2 ) = s p
1 1
+
n1 n2
CI (same assumptions): y1 − y2 ± tα/2, df × SE ( y1 − y2 )
t0 =
y1 − y 2 − ( µ1 − µ 2 )
SE ( y1 − y 2 )
(df = n1 + n2 – 2)
Paired samples
Assumptions: paired samples, random sample of d’s, n ≥ 30 OR pop’n dist’n is normal,
σd unknown
H0: µd = 0
(define ‘d’ first)
d − µd
⎛s ⎞
(df = n – 1)
CI (same assumptions): d ± tα/2, df × ⎜⎜ d ⎟⎟
t0 =
sd
⎝ n⎠
n
Chapter 26:
(Obs − Exp) 2
χ2 = ∑
follows χ2-dist’n
Exp
cells
Goodness-of-fit test
- single random sample
- one categorical variable
- every expected count ≥ 5
- df = #Categories – 1
- expected counts: npi
Test of homogeneity
Test of independence
- indep. random samples
- single random sample
- one variable per sample
- two categorical variables
- every expected count ≥ 5
- every expected count ≥ 5
- df = (R – 1)(C – 1)
- df = (R – 1)(C – 1)
(row marginal total)(column marginal total)
grand total
R = # of rows; C = # of columns
Chapter 28:
ANOVA Assumptions:
independent and random samples, similar σ, normal distributions
- yij = observation for ith subject in jth group
- y j vs. y to predict yij
- H0: µ1 = … = µk
- HA: the µi are not all equal (OR, at least 2 µi are different)
SST = (variability between samples)
SSE = (variability within samples)
MST
SST / (k − 1)
F0 =
=
∼ FNk −−1k
MS E SS E / ( N − k )
Reject H0 when F0 is large (greater than 10 works for most values of α) or use P-value.
ANOVA Table:
Source
Between
Within
Total
df
k–1
N–k
N–1
SS
SST
SSE
SSY
MS
MST
MSE
F
MST/MSE
P-value
?
Note that SSY = SST + SSE; df(total) = df(Between) + df(Within);
For each of the top two rows, MS = SS/df.
Also note that the estimate for common variance = σˆ 2 = MS E .
How to determine data structure?
1. Are you dealing with proportions or means?
2. How many samples are there?
3a. Are the samples independent or paired? 3b. How many variables/levels are there?
4a. Are the variances equal/unequal?