Download Review Notes for Midterm II

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Linear least squares (mathematics) wikipedia , lookup

Psychometrics wikipedia , lookup

Taylor's law wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Confidence interval wikipedia , lookup

Regression toward the mean wikipedia , lookup

Omnibus test wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Transcript
Stat 112
D. Small
Review of Ideas from Lectures 13-19
I. Comparisons Among Several Groups – One Way Layout
Data: Samples of size n1 ,, nI from several populations 1, 2, ... , I with means
1 ,  2 ,,  I .
1. Ideal model used for making inferences
A. Random, independent samples from each population.
B. All populations have same standard deviation (  )
C. Each population is normal.
2. Planned comparison of two means.
A. Use usual t-tools but estimate  from weighted average of sample standard
deviations in all groups ( s p ). The estimate of  has degrees of freedom n-I where
n1   nI  n . In JMP, s p is the Root Mean Square Error.
B. 95% confidence interval for  i   j : Yi  Y j  t.975,n  I * s p
1
1

ni n j
3. One-way ANOVA F-test
A. Is there any difference between any of the means?
H 0 : 1   2  ....   I vs. H a : at least two of the means are not equal.
B. F test statistic measures how much the sample means vary relative to the
population standard deviation. Large values of F are implausible under H 0 .
C. F test in JMP is reported under the Analysis of Variance
4. Robustness of inferences to assumptions
A. Normality is not critical. Nonnormality is only a problem is distributions are
extremely long tailed or skewed and sample size in a group is <30.
B. Assumptions of random sampling and independence of observations are critical.
C. Assumption of equal standard deviation in population is critical. Rule of thumb:
Check if largest sample standard deviation divided by smallest sample standard
deviation is <2.
5. Multiple comparisons
A. Family of tests: When several tests (or confidence intervals) are considered
simultaneously, they constitute a family of tests.
B. Unplanned comparisons: In a one-way layout, we are often interested in
comparing all pairs of groups. The family of tests consists of all possible pairwise
comparisons, I*(I-1)/2 tests for a one-way layout with I groups.
C. Individual Type I error rate: Probability for a single test that the null hypothesis
will be rejected assuming that the null hypothesis is true.
D. Familywise Type I error rate: Probability for a family of test that at least one null
hypothesis will be rejected assuming that all of the null hypotheses are true.
Familywise type I error rate will be at least as large and usually considerably
larger than the individual type I error rate when the family consists of more than
one test.
E. Multiple comparisons procedure: Seeks to make sure that the familywise Type I
error rate is kept at tolerable level.
F. Tukey-Kramer procedure: Multiple comparisons procedure designed specifically
for the family of unplanned comparisons in a one-way layout. Instead of rejecting
H 0 :  i   j vs. H a :  i   j at the 0.05 level if
| t |
| Yi  Y j |
SE (Yi  Y j )
 2 (approximately) as for a planned comparison, reject if |t|>q*
where q* is greater than 2. In JMP, reject H 0 :  i   j if entry in “Comparison of
All Pairs Using Tukey’s HSD” is positive.
G. Bonferroni procedure:
a. General method for doing multiple comparisons for any family of k tests.
b. Denote familywise error we want by p*
c. Compute p-values for each test -- p1 ,, p k .
p*
d. Reject null hypothesis for ith test if p i 
.
k
e. Guarantees that familywise type I error rate of k tests is at most p*.
H. Multiplicity: General problem in data analysis is that we may be implicitly
looking at many things and only notice the interesting patterns. To deal with
multiplicity properly, we should determine what our family of tests is (the family
of all comparisons we have considered) and use a multiple comparisons procedure
such as Bonferroni. Multiplicity makes it hard to detect when a null hypothesis is
false because it makes it harder for a particular null hypothesis to be rejected.
One way around the multiplicity problem is to design a study to search
specifically for a pattern that was suggested by an exploratory data analysis. This
converts an unplanned comparison into a planned comparison.
6. Linear combinations of group means:
A. Parameter of interest:   C1 1  C2  2    CI  I .
B. Point estimate: g  C1Y1  C2Y2    C I YI .
C. Standard error: SE ( g )  s p
C12 C 22
C2

 I .
n1
n2
nI
D. 95% Confidence interval for  : g  t.975,n I * SE ( g ) .
E. Test of H 0 :    *, H a :    * . For level .05 test, reject H 0 if and only if
 * does not belong to the 95% confidence interval.
II. Chi-Squared Goodness of Fit Test for Nominal Data
A. Nominal data: Data that place an individual into one of several categories, e.g.,
color of M&M, candidate person voted for.
B. Population of nominal data: population with k categories can be described by the
proportion in each category, p1 in category 1, p 2 in category 2, ..., p k in category
k ( i 1 pi  1 ).
k
C. One sample test for nominal data: Analogue of one sample problem with interval
population. Take random sample of size n from a population of nominal data.
We want to test H 0 : p1  p1* , p 2  p 2* ,, p k  p k* vs. H a : at least one of
pi  pi* (i=1,...,k).
D. Chi-squared test: Method for doing a one sample test for nominal data. Based on
comparing expected frequencies under H 0 to the observed frequencies. A large
test statistic is evidence against H 0 . Test is only valid if expected frequencies in
each category are 5 or more. When necessary, categories should be combined in
order to satisfy this condition.
E. Chi-square test in JMP: We use the Pearson chi-square test in JMP.
III. Simple Linear Regression
1. Regression analysis
A. Setup: We have a response variable Y and an explanatory variable X. We
observe pairs ( X 1 , Y1 ), , ( X n , Yn ) .
B. Goal: Estimate the mean of Y for the subpopulation with explanatory variable
X=x, {Y | X  x} .
C. Application to prediction: The mean of Y given X=X0, {Y | X  X 0 } , provides a
good prediction of a future Y0 if we know that the future observation has X=X0.
2. Ideal Simple Linear Regression model
A. Assumptions of model:
a. The mean of Y given X=x is a straight line function of x,
{Y | X  x}   0  1 x .
b. The standard deviation of Y given X=x is the same for all x and denoted
by  .
c. The distribution of Y given X=x is normally distributed.
d. The sampled observations Y1 , Y2 ,..., Yn are independent.
B. Interpretation of parameters.
a. Intercept  0 : The mean of Y given that X=0.
b. Slope  1 : The change in the mean of Y given X=x that is associated with
each one unit increase in X.
c. Standard deviation  : Measures how accurate the predictions of y based
on x from the regression line will be. If the ideal simple linear regression
model holds, then approximately 68% of the observations will fall within
 of the regression line and 95% of the observations will fall within 2
of the regression line.
C. Estimation of parameters
a.  0 and 1 : Estimate by the coefficients ̂ 0 and ˆ1 that minimize the sum
of squared prediction errors when the regression line is used to predict yi
given x  xi .
b.  : The residual for observation i is the prediction error of using
ˆ{Y | X  xi } to predict Yi , res i  y i  ˆ0  ˆ1 xi . We estimate  by
taking the standard deviation of the residuals, corrected for degrees of
sum of all squared residuals
freedom, ˆ 
. ̂ is the root mean square
n2
error in JMP.
3. Cautions about regression analysis
A. Interpolation vs. Extrapolation:
a. Interpolation: Drawing inference about {Y | X  x} for x within range of
observed X, ( X 1 ,, X n ). Strong advantage of regression model over a
one-way layout analysis is the ability to interpolate from regression
analysis.
b. Extrapolation: Drawing inference about {Y | X  x} for x outside range
of observed X. Dangerous. Simple linear regression model may hold
approximately over region of observed X but not for all X.
B. Association is not causation: Regression analysis shows how the mean of Y for
different subpopulations X=x is associated with x. No cause and effect
relationship between X and Y can be inferred unless X is randomly assigned to
units as in a random experiment. Alternative explanation for strong relationship
between mean of Y given X=x other than that X causes Y to change: (i) Reverse
is true. Y causes X; (ii) There may be a lurking (confounding variable) related to
both X and Y which is the common cause of X and Y.
4. Inference for ideal simple linear regression model
A. Sampling distribution of ˆ1 : SE ( ˆ1 )  
1
.
(n  1) s x2
B. Hypothesis tests for H 0 :  0  0 and H1 : 1  0 : Based on t-test statistic
| Estimate |
| t |
SE ( Estimate)
C. Confidence intervals for  0 and  1 : 95% confidence intervals:
ˆ0  tn2 (.975) SE (ˆ0 )  ˆ0  2SE ( ˆ0 )
ˆ1  tn2 (.975)SE(ˆ1)  ˆ1  2SE(ˆ1 )
D. Confidence intervals for mean of Y at X=X0 (Confidence intervals for mean
response): (95% interval) ˆ{Y | X  X 0 }  t n2 (.975)SE[ˆ{Y | X  X 0 }] where
SE{ˆ{Y | X  X 0 }]  ˆ
1 ( X 0  X )2

, ˆ{Y | X  X 0 }  ˆ0  ˆ1 X 0 . Note that
n
(n  1) s x2
precision in estimating {Y | X  X 0 } decreases as X 0 gets farther away from
sample mean of X’s.
E. Prediction interval: Interval of likely values for a future randomly sampled Y0 from
the subpopulation X  X 0 .
a. Property: 95% prediction interval for X0: If repeated samples ( y1 ,, y n )
are obtained from the subpopulations ( x1 ,..., x n ) and a prediction interval is
formed, the prediction interval will contain the value of Y0 for a future
observation from the subpopulation X0 95% of the time.
b. Prediction interval must account for two sources of uncertainty
i. Uncertainty about the location of the subpopulation mean
{Y | X  X 0 } because we estimate {Y | X  X 0 } from the data
using least squares
ii. Uncertainty about the where future value Y0 will be in relation to its
mean.
Confidence intervals for the mean response only need to account for the
first source of uncertainty and are consequently narrower than prediction
intervals.
c. 95% prediction interval at X  X 0 :
ˆ{Y | X  X 0 }  t n2 (.975) ˆ 2  SE[ˆ{Y | X  X 0 }]2
d. Comparison with confidence interval for mean response at X  X 0 :
Prediction interval is always wider. As sample size n becomes large,
margin of error of confidence interval for mean response goes to zero but
margin of error of prediction interval does not go to zero.
F. R-squared statistic
a. Definition: R 2 is the percentage of the variation in the response variable
that is explained by the linear regression of the response variable on the
explanatory variable.
Total sum of squares - Residual sum of squares
R2 
Total sum of squares
where Total sum of squares =
=

n
i 1

n
i 1
(Yi  Y ) 2 and Residual sum of squares
(Yi  ˆ{Y | X  X i }) 2
b. R 2 takes on values between 0 and 1, with values nearer to 1 indicating a
stronger linear relationship between X and Y. R 2 provides unitless measure of
strength of relationship between x and y.
c. Caveats about R 2 : Not useful for assessing whether ideal simple linear
regression model is correct (use residual plots); not useful for deciding whether or
not Y is associated with X (use hypothesis test of H 0 : 1  0 vs. H a : 1  0 ).
5. Regression diagnostics:
A. Conditions required for inference from ideal simple linear regression model to be
accurate must be checked:
a. Linearity (mean of Y given X=x is a straight line function of x).
Diagnostic: Residual plot
b. Constant variance. Diagnostic: Residual plot
c. Normality. Diagnostic: Histogram of residuals
d. Independence. Diagnostic: Residual plot
B. Residual plot: Scatterplot of residuals versus X or any other variable (such as time
order of observations). If the ideal simple linear regression model holds, the residual
plot should look like random scatter – there should be no pattern in the residual plot.
a. Pattern in the mean of the residuals, i.e., the residuals have a mean less than
zero for some range of x and a mean greater than zero for another range of x.
Indicates nonlinearity.
b. Pattern in the variance of the residuals, i.e., the residuals have greater variance
for some range of x and less variance for another range of x. Indicates
nonconstant variance.
c. Pattern in the residuals over time indicates violation of independence. Pattern
in the mean of residuals over time potentially indicates lurking variable that is
associated with time.
6. Transformations
A. Basic idea: If we detect nonlinearity, we may still be able to use the ideal
simple linear regression model by transforming X to g(X) and/or Y to
f(Y), fitting the ideal simple linear regression model using f(Y) as the
response variable and g(X) as the explanatory variable and then “backtransforming” to recover ˆ {Y | X  x} .
B. Choice of transformation: Use Tukey’s Bulging rule to decide which
transformations to try. Make residual plots of the residuals for the
transformed model vs. the original X. Choose transformations which
make the residual plot look most like random scatter with no pattern in the
mean of the residuals vs. X.
C. Prediction after transformation: To predict Y given X=x (or estimate
{Y | X  x} ) when Y has been transformed to f(Y) and X to g(X),
ˆ{Y | X  x}  f 1 (ˆ{ f (Y ) | g ( X )  g ( x)})
(it’s easier to think through examples than to use formula directly).