Download PS 3 FINAL STUDY GUIDE

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Vector generalized linear model wikipedia , lookup

Time value of money wikipedia , lookup

Predictive analytics wikipedia , lookup

Taylor's law wikipedia , lookup

Generalized linear model wikipedia , lookup

Regression analysis wikipedia , lookup

Transcript
-
Central Limit Theorem- must be a random sample
Standard Error:
-
o As the sample size gets larger more confidence, less standard error
Confidence Interval:
-
-
-
-
-
-
P-values will range between 0-1 and represent the probability that we would see
the relationship findings due to random chance
Lower p-values increase our confidence that there is indeed a relationship
between two variables in question
o Statistically significant at .05 level
o DOES NOT MEAN THE RELATIONSHIP IS STRONG OR CAUSAL
o P-value conveys level of confidence with which we can reject the null
hypothesis
Tabular analysis= when you have a categorical independent variable and a
categorical dependent variable…crosstabs…must determine what individual cell
values represent- proportions or percentages
Chi Squared Test:
o O= observed frequency
o E= expected frequency
 If O>E…x2 is large
 If O=E x2 is zero…NULL HYP. IS TRUE
If the calculated value is greater then the critical value, then we conclude that
there is a relationship between two variables (in the population) and if the
calculated value is less than the critical value, we cannot make such a conclusion
Degrees of freedom
o Df= (rows-1)(columns-1)
Means of Difference Test:
o Two different samples of data
o When we have continuous dependent variable & categorical independent
variable
o This test determines if differences between figures with no expected
relationship are statistically significant (compare means of two samples)
T-test:
-
Standard error: measure of uncertainty about statistical estimate
-
-
If t < critical value not confident in the relationship
If t > critical value  conclude that there IS a relationship
Normal distribution is symmetrical around the mean (mean, mode and median are
all the same)- 68-95-99 rule
Sampling distribution is the hypothetical distribution of sample means
o Nonrandom sample of convenience does very little to make
correlations/assumptions between the sample and the population
HYPOTHESIS TESTING:
(i) testing null, HO= no relationship
(ii) minimizing type I error- FALSE POSITIVE (ex convicting an innocent
person/ rejecting the null when it is true)
(iii) P-values- Levels of Significance- probability of type I error…probability of
rejecting the null when the null is true
o we want small values
o value does NOT tell about the strength or causality
(iv) statistical significance- x2- dependent and independent…must be categorical
o difference of means- dependent is categorical, independent is continuous
covariance:
-
correlation coefficient:
-
if all the points on the plot line up perfectly on a straight, positively sloping line,
r=1…if negatively sloping, r= -1
t-test for correlation coefficient:
-
-
-
-
-
CORRELATION & BIVARIATE REGRESSION:
Continuous dependent variable and continuous independent variable…allows us
to control for cofounders (multiple regression)
o Y=MX+B
 Slope (x) + y-intercept
 M & B are the line’s parameters
 POPULATION REGRESSION MODEL:
o Yi=∞ + ßXi+µi
-
-
-
 i= index of data set
µi= stochastic, or “random” component of dependent variable
In bivariate regression we use information from the sample regression model to
make inferences about unseen population regression model
o Sample regression model places hats (^) to indicate that they are estimates
of terms from the unseen population regression model
o Expectations are the expected value of Y given Xi
Line of “best fit”
Add together squared value of each of the residuals…want to choose the line with
the smallest total value
Ordinary least squares (OLS) regression- which line minimizes the sum of the
squared residuals
In bivariate regression, you NEVER observe the entire population
Yi=∞ + ßXi+µi - actual value of Y
-
Yi=∞ + ßXi predicted value of Y
Look for “goodness of fit”…r2: amount of variation from the mean in the
dependent variable
Total variation= explained and unexplained
0 ≤ R2 ≤ 1… 0 will never happen because that means there is no variation (all
points on the line)
HO (null hypothesis) ß*=O
ORDINARY LEAST SQUARES:
Formulae for OLS parameter estimated comes from setting the sum of squared
residual s equal to zero and solving for ß hat and ∞ hat.
o Denominator for ß is sum of squared deviations of Xi from the mean value
of X
 The more spread out X is, the less steep the slope.
Y variance is broken into explained and residual
R- squared statistic: ranges between zero and 1…indicates goodness of
fit…proportion of variation in dependent variable accounted for by model
Total Sum of Squares:- total variation in Y
-
Residual Sum of Squares: residual variation in Y (not accounted for by X)
-
-
-
Model Sum of Squares (MSS)
-
R2 tells that model accounts for % of variation in dependent variable (ex if r2=
.55, model accounts for 55% variance in dependent variable)
Unseen variance (oi) is estimated from residuals µi after parameters for sample
regression model have been estimated by following formula:
-
-
-
-
Larger values further the individual is from regression line
o Larger values, larger variance ad SE of slope of parameter estimate
Further the points are from regression line, less confidence we have in value of
slope
The more variation in X, the more precisely we will be able to estimate the
relationship between X&Y…..larger sample sizes, smaller standard error
T-Ratio:
For t-calculation: need degrees of freedom (equal to the # cases [n] minus the #
parameters estimated [k])
Level of significance in t-table critical value
o If calculated & T> critical value REJECT the null hypothesis & conclude
that there IS a relationship
 Rejecting the null hypothesis means the slope of regression line
(effect of X on Y) is statistically significant
R2 helps measure variance in an outcome…CANNOT compare magnitudes of the
coefficients (because the variables are measured on different scales)
Multiple regression measures independent effect of each variable by holding
values of other variables constant
If one causal variable is excluded from the model, we cannot be confident that our
results will be correct
Problems related to observational research: no manipulated of what cases receive
treatments
Estimates can be inaccurate if there is a non-linear regression/relationship
between independent and dependent variables
Ex) Parabolic
o Inaccurate if you have small sample size wide confidence interval
o Inaccurate if you are better at predicting a particular range of dependent
variables