Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Degrees of freedom (statistics) wikipedia , lookup
History of statistics wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Taylor's law wikipedia , lookup
Central limit theorem wikipedia , lookup
Statistical inference wikipedia , lookup
Gibbs sampling wikipedia , lookup
Law of large numbers wikipedia , lookup
Chapters 1-6: Population vs. sample Æ Parameter vs. statistic Statistics: Descriptive vs. inferential Types of variables Quantitative vs. Qualitative / | Discrete Continuous Tables, charts & graphs - frequency tables - qualitative: bar graph/pie chart - stem-and-leaf plot/dot plot - time plot - histogram (modality) - traits: # of modes, tail weight, overall shape (symmetry, skewness) - identify skewness by TAIL - boxplot (skewness) - outliers, overall shape (symmetry, skewness) - identify skewness inside box or entire graph Measures of center/spread/position - center: mean, median, mode Æ Outlier effect? Skewness effect? - spread: range, variance, standard deviation, IQR Æ Why use squared and (n – 1)? Ever negative? Empirical Rule? - position: min, max, percentiles (quartiles) Æ recall that we INCLUDE the median when determining quartiles Æ 5-number summary, boxplot, types of outliers Chapters 7-10: Displaying bivariate data - scatterplot: visual aid to see form/strength/direction of relationship and/or outliers (large residual, high leverage, influential) - correlation: numerical aid to see strength/direction of relationship (range?) Æ Warning: assumes linearity, sensitive to outliers Simple linear regression analysis - regression line: ŷ = b0 + b1x ⎛ sy ⎞ - least-squares estimation gives b1 = r ⎜⎜ ⎟⎟ and b0 = y − b1 x ⎝ sx ⎠ - estimation: interpolation vs. extrapolation (BAD!) - R-squared: r2 = proportion of variation in y explained by x - causation: association does NOT imply causation - residual plots: observed vs. theoretical appearance - transformation of a variable can help improve linearity Chapter 11-13: - observational/retrospective/prospective study, experiment/controlled clinical trial Æ population and causal inferences (what needs to be present for each?) - types of bias (response, undercoverage, nonresponse) - types of sampling: with/without replacement, SRS/stratified/cluster/ voluntary/convenience/systematic - controlling factors: randomization, blocking, direct control, replication - more experiment design definitions Chapters 14-15: - types of events: marginal, conditional, union, intersection, complement, - What common words identify them? - relating events: dependent vs. disjoint vs. independent - Do these relations affect the rules below? If so, how? - Do they allow certain rules to be easily extended? - probability laws: P( A ∩ B) - conditional probability: P ( A | B) = P( B) C - complement rule: P(A ) = 1 – P(A) - multiplication rule: P( A ∩ B ) = P(A and B) = P(A) × P(B | A) = P(B) × P(A | B) - addition rule: P(A or B) = P(A) + P(B) – P(A and B) - total probability rule: P ( A) = P ( A ∩ B ) + P A ∩ BC ( ) - recall examples where we combined a few of these together Chapter 16-17: Distributions - discrete (exact probability or intervals) vs. continuous (only intervals) Discrete: If P(X = a) > 0, then P(X ≤ a) ≠ P(X < a) Continuous: If P(X = a) = P(X = b) = 0, then P(a ≤ X ≤ b) = P(a < X < b) - discrete distributions: - determine probability distribution (values of X and corresponding probabilities) - mean: µ = ∑ xi P ( X = xi ) - variance: σ 2 = ∑ ( xi − µ ) 2 P( X = xi ) - continuous distributions: - uniform distribution: finding an area of a rectangle (with a twist!) - normal distribution: symmetric, 2 parameters: µ and σ, other properties Standard Normal Distribution (and its applications) - µ = 0 and σ = 1 - Table Z only gives areas to left of value z, conversion to these values required Æ use diagrams, complements, symmetry, etc. ⎛ X −µ x−µ⎞ ≤ - standardizing: P(X ≤ x) Æ P⎜ ⎟ = P(Z ≤ z) σ ⎠ ⎝ σ - identifying values for a given probability: x = µ + zσ Combinations and Functions of Random Variables For any constants a and b, Means: Variances: 1. E(a) = a 1. V(a) = 0 2. E(aX) = aE(X) 2. V(aX) = a2V(X) 3. E(aX + b) = aE(X) + b 3. V(aX + b) = a2V(X) 4. E(aX ± bY) = aE(X) ± bE(Y) 4. V(aX ± bY) = a2V(X) + b2V(Y) ± 2abcov(X, Y) Y = a1X1 + a2X2 + … + anXn + b, E(Y) = a1E(X1) + a2E(X2) + … + anE(Xn) + b If X1, X2, …, Xn are independent, V (Y ) = a12V ( X 1 ) + a 22V ( X 2 ) + ... + a n2V ( X n ) Chapter 18: Sampling Distributions - sample proportion: p (1 − p ) pq = . n n Rule 3: If np and n(1 – p) are both ≥ 10, then p̂ has an approx. normal dist’n. pˆ − p All 3 rules Æ If rule 3 holds, Z = ∼ N (0,1) p (1 − p ) n - sample mean: Rule 1: µ p̂ = p. Rule 2: σ pˆ = Rule 1: µ y = µ Rule 2: σ y = σ n Rule 3: When the population distribution is normal, the sampling distribution of y is also normal for any sample size n. Rule 4 (CLT): When n > 30, the sampling distribution of y is well approximated by a normal curve, even when the population distribution is not itself normal. All 4 rules Æ If n is large OR the population is normal, Z = Y −µ ∼ N (0,1) σ/ n Chapter 19: - how to interpret CI? - generic CI: point estimate ± (critical value) × (standard error) Æ confidence level increases, ME increases Æ n increases, ME decreases - sample proportion: Assumptions: random sample, npˆ ≥ 10 and n(1 − pˆ ) ≥ 10. pˆ ± zα / 2 pˆ (1 − pˆ ) n - choosing n: ⎛ CV ⎞ n ≈ p *(1 − p*) ⎜ ⎟ ⎝ ME ⎠ No previous info? Use p* = 0.5 for safety. 2 where CV = zα /2 Chapters 20/21: Null vs. alternative hypotheses; 2-tailed vs. 1-tailed (lower or upper) Rules for hypothesis construction: 1. No statistics, only parameters. 2. Parameter symbols and claimed values appear the same in H0 and HA. 3. The signs must be different between H0 and HA. 4. Equality signs ONLY appear in the null. Decision Do not reject H0 Reject H0 Actual situation H0 is true H0 is false Correct Type II or decision β error Type I or Correct α error decision Test statistic, P-value Æ judgment approach (JA) OR significance level approach (SLA) Significance Level Approach: P-value ≤ α Æ reject H0; P-value > α Æ do not reject H0 Judgment Approach: 0 < P-value < 0.01 indicates convincing to strong evidence against H0 0.01 < P-value < 0.05 indicates strong to moderate evidence 0.05 < P-value < 0.1 indicates moderate to suggestive, but inconclusive evidence 0.1 < P-value < 1 indicates weak evidence Steps in a Hypothesis-Testing Analysis: 1. Assumptions; 2. Hypotheses (select α); 3. Test statistic; 4. P-value; 5. Conclusion Test for Population Proportion Assumptions: categorical variable, random sample, np0 ≥ 10 and n(1 – p0) ≥ 10. pˆ − p 0 z0 = p 0 (1 − p 0 ) n Chapter 22: Two Independent Population Proportions Assumptions: independent random samples, n1 & n2 large (nipi ≥ 10 & ni(1 – pi) ≥ 10 for i = 1,2) CI: pˆ 1 − pˆ 2 ± zα/2 H0: p1 – p2 = 0 pˆ 1 (1 − pˆ 1 ) pˆ 2 (1 − pˆ 2 ) + n1 n2 Æ Because of H0, we use pˆ pooled = Then, under same assumptions, z0 = n1 pˆ1 + n2 pˆ 2 y1 + y2 = n1 + n2 n1 + n2 pˆ1 − pˆ 2 − ( p1 − p2 ) ⎛1 1 ⎞ pˆ pooled (1 − pˆ pooled ) ⎜ + ⎟ ⎝ n1 n2 ⎠ Chapter 23: - introduce t-distribution (same as z except for parameter (df) and NOT knowing σ) - sample mean: Assumptions: random sample, n ≥ 30 OR population is normal, σ unknown ⎛ s ⎞ ⎟⎟ y ± tα/2, n – 1 × ⎜⎜ ⎝ n⎠ - choosing n: 2 ⎛ CV ⎞ 2 n≈ ⎜ ⎟ σˆ ⎝ ME ⎠ No σ̂ ? Use σ̂ ≈ range/6 for approximately normal data. where CV = zα /2 Test for Population Mean Assumptions: num. var., random sample, n ≥ 30 OR population is normal, σ unknown Y − µ0 t0 = s/ n Chapters 24/25: Independent samples Assumptions: independent random samples, n1 & n2 ≥ 30 OR both pop’ns are normal, unknown and unequal standard deviations (σ1 ≠ σ2) Æ how do you check? H0: µ1 – µ2 = 0 SE ( y1 − y 2 ) = s12 s 22 + n1 n2 t0 = y1 − y 2 − ( µ1 − µ 2 ) SE ( y1 − y 2 ) (df ≥ min{n1 – 1, n2 – 1}) CI (same assumptions): y1 − y2 ± tα/2, df × SE ( y1 − y2 ) Assumptions: independent random samples, n1 & n2 ≥ 30 OR both pop’ns are normal, unknown and equal standard deviations (σ1 = σ2) Æ how do you check? H0: µ1 – µ2 = 0 sp = (n1 − 1) s12 + (n2 − 1) s 22 n1 + n2 − 2 SE ( y1 − y2 ) = s p 1 1 + n1 n2 CI (same assumptions): y1 − y2 ± tα/2, df × SE ( y1 − y2 ) t0 = y1 − y 2 − ( µ1 − µ 2 ) SE ( y1 − y 2 ) (df = n1 + n2 – 2) Paired samples Assumptions: paired samples, random sample of d’s, n ≥ 30 OR pop’n dist’n is normal, σd unknown H0: µd = 0 (define ‘d’ first) d − µd ⎛s ⎞ (df = n – 1) CI (same assumptions): d ± tα/2, df × ⎜⎜ d ⎟⎟ t0 = sd ⎝ n⎠ n Chapter 26: (Obs − Exp) 2 χ2 = ∑ follows χ2-dist’n Exp cells Goodness-of-fit test - single random sample - one categorical variable - every expected count ≥ 5 - df = #Categories – 1 - expected counts: npi Test of homogeneity Test of independence - indep. random samples - single random sample - one variable per sample - two categorical variables - every expected count ≥ 5 - every expected count ≥ 5 - df = (R – 1)(C – 1) - df = (R – 1)(C – 1) (row marginal total)(column marginal total) grand total R = # of rows; C = # of columns Chapter 28: ANOVA Assumptions: independent and random samples, similar σ, normal distributions - yij = observation for ith subject in jth group - y j vs. y to predict yij - H0: µ1 = … = µk - HA: the µi are not all equal (OR, at least 2 µi are different) SST = (variability between samples) SSE = (variability within samples) MST SST / (k − 1) F0 = = ∼ FNk −−1k MS E SS E / ( N − k ) Reject H0 when F0 is large (greater than 10 works for most values of α) or use P-value. ANOVA Table: Source Between Within Total df k–1 N–k N–1 SS SST SSE SSY MS MST MSE F MST/MSE P-value ? Note that SSY = SST + SSE; df(total) = df(Between) + df(Within); For each of the top two rows, MS = SS/df. Also note that the estimate for common variance = σˆ 2 = MS E . How to determine data structure? 1. Are you dealing with proportions or means? 2. How many samples are there? 3a. Are the samples independent or paired? 3b. How many variables/levels are there? 4a. Are the variances equal/unequal?