Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
- Central Limit Theorem- must be a random sample Standard Error: - o As the sample size gets larger more confidence, less standard error Confidence Interval: - - - - - - P-values will range between 0-1 and represent the probability that we would see the relationship findings due to random chance Lower p-values increase our confidence that there is indeed a relationship between two variables in question o Statistically significant at .05 level o DOES NOT MEAN THE RELATIONSHIP IS STRONG OR CAUSAL o P-value conveys level of confidence with which we can reject the null hypothesis Tabular analysis= when you have a categorical independent variable and a categorical dependent variable…crosstabs…must determine what individual cell values represent- proportions or percentages Chi Squared Test: o O= observed frequency o E= expected frequency If O>E…x2 is large If O=E x2 is zero…NULL HYP. IS TRUE If the calculated value is greater then the critical value, then we conclude that there is a relationship between two variables (in the population) and if the calculated value is less than the critical value, we cannot make such a conclusion Degrees of freedom o Df= (rows-1)(columns-1) Means of Difference Test: o Two different samples of data o When we have continuous dependent variable & categorical independent variable o This test determines if differences between figures with no expected relationship are statistically significant (compare means of two samples) T-test: - Standard error: measure of uncertainty about statistical estimate - - If t < critical value not confident in the relationship If t > critical value conclude that there IS a relationship Normal distribution is symmetrical around the mean (mean, mode and median are all the same)- 68-95-99 rule Sampling distribution is the hypothetical distribution of sample means o Nonrandom sample of convenience does very little to make correlations/assumptions between the sample and the population HYPOTHESIS TESTING: (i) testing null, HO= no relationship (ii) minimizing type I error- FALSE POSITIVE (ex convicting an innocent person/ rejecting the null when it is true) (iii) P-values- Levels of Significance- probability of type I error…probability of rejecting the null when the null is true o we want small values o value does NOT tell about the strength or causality (iv) statistical significance- x2- dependent and independent…must be categorical o difference of means- dependent is categorical, independent is continuous covariance: - correlation coefficient: - if all the points on the plot line up perfectly on a straight, positively sloping line, r=1…if negatively sloping, r= -1 t-test for correlation coefficient: - - - - - CORRELATION & BIVARIATE REGRESSION: Continuous dependent variable and continuous independent variable…allows us to control for cofounders (multiple regression) o Y=MX+B Slope (x) + y-intercept M & B are the line’s parameters POPULATION REGRESSION MODEL: o Yi=∞ + ßXi+µi - - - i= index of data set µi= stochastic, or “random” component of dependent variable In bivariate regression we use information from the sample regression model to make inferences about unseen population regression model o Sample regression model places hats (^) to indicate that they are estimates of terms from the unseen population regression model o Expectations are the expected value of Y given Xi Line of “best fit” Add together squared value of each of the residuals…want to choose the line with the smallest total value Ordinary least squares (OLS) regression- which line minimizes the sum of the squared residuals In bivariate regression, you NEVER observe the entire population Yi=∞ + ßXi+µi - actual value of Y - Yi=∞ + ßXi predicted value of Y Look for “goodness of fit”…r2: amount of variation from the mean in the dependent variable Total variation= explained and unexplained 0 ≤ R2 ≤ 1… 0 will never happen because that means there is no variation (all points on the line) HO (null hypothesis) ß*=O ORDINARY LEAST SQUARES: Formulae for OLS parameter estimated comes from setting the sum of squared residual s equal to zero and solving for ß hat and ∞ hat. o Denominator for ß is sum of squared deviations of Xi from the mean value of X The more spread out X is, the less steep the slope. Y variance is broken into explained and residual R- squared statistic: ranges between zero and 1…indicates goodness of fit…proportion of variation in dependent variable accounted for by model Total Sum of Squares:- total variation in Y - Residual Sum of Squares: residual variation in Y (not accounted for by X) - - - Model Sum of Squares (MSS) - R2 tells that model accounts for % of variation in dependent variable (ex if r2= .55, model accounts for 55% variance in dependent variable) Unseen variance (oi) is estimated from residuals µi after parameters for sample regression model have been estimated by following formula: - - - - Larger values further the individual is from regression line o Larger values, larger variance ad SE of slope of parameter estimate Further the points are from regression line, less confidence we have in value of slope The more variation in X, the more precisely we will be able to estimate the relationship between X&Y…..larger sample sizes, smaller standard error T-Ratio: For t-calculation: need degrees of freedom (equal to the # cases [n] minus the # parameters estimated [k]) Level of significance in t-table critical value o If calculated & T> critical value REJECT the null hypothesis & conclude that there IS a relationship Rejecting the null hypothesis means the slope of regression line (effect of X on Y) is statistically significant R2 helps measure variance in an outcome…CANNOT compare magnitudes of the coefficients (because the variables are measured on different scales) Multiple regression measures independent effect of each variable by holding values of other variables constant If one causal variable is excluded from the model, we cannot be confident that our results will be correct Problems related to observational research: no manipulated of what cases receive treatments Estimates can be inaccurate if there is a non-linear regression/relationship between independent and dependent variables Ex) Parabolic o Inaccurate if you have small sample size wide confidence interval o Inaccurate if you are better at predicting a particular range of dependent variables