* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download statistics - summary - Michigan State University
Psychometrics wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Taylor's law wikipedia , lookup
History of statistics wikipedia , lookup
Foundations of statistics wikipedia , lookup
Categorical variable wikipedia , lookup
Omnibus test wikipedia , lookup
Resampling (statistics) wikipedia , lookup
PRR 844 Spring 2001 Page 1 STATISTICS - SUMMARY 1. Functions of statistics a. description: summarize a set of data b. inference: make generalizations from sample to population. parameter estimates, hypothesis tests. 2. Types of statistics i. Descriptive statistics: describe a set of data a. frequency distribution - SPSS Frequency b. central tendency: mean, median (order statistics), mode. SPSS - Descriptives c. dispersion: range, variance & standard deviation in Descriptives d. Others: shape -skewness, kutosis. e. EDA procedures (exploratory data analysis) . SPSS Explore Stem & leaf display: ordered array, freq distrib. & histogram all in one. Box and Whisker plot: Five number summary-min.,Q1, median, Q3, and max. Resistant statistics: trimmed and winsorized means,midhinge, interquartile deviation. ii. Inferential statistics: make inferences from samples to populations. a. Parameter estimation - confidence intervals around population parameters b. Hypothesis testing - test relationships between variables iii. Parameteric vs non-parametric statistics a. parametric : assume interval scale measurements and normally distributed variables. b. nonparametric (distribution free statistics) : generally weaker assumptions: ordinal or nominal measurements, don't specify the exact form of distribution. 3. General rules for interpreting hypothesis tests. i. You test a NULL hypothesis - The NULL hypothesis is a statement of NO relationship between the two variables (e.g., means are the same for different subgroups, correlation is zero, no relationship between row and column variable in a crosstab table). a. Pearson Correlation rxy =0. b. T-Test x =y c. One Way ANOVA M1=M2=M3=...=Mn d. Chi square : No relationship between X and Y. Formally, this is captured by the "expected table", which assumes cells in the X-Y table can be generated completely from row and column totals. ii.. TESTS are conducted at a given "confidence level" - most common is a 95% level. At this level there is a 5% chance of incorrectly rejecting the null hypothesis when it is true. For stricter test, use 99% confidence level and look for SIG's <.01. Weaker, use 90% , SIG's < .10. iii.. On computer output look for the SIGnificance or PROBability associated with the test. The F, T, Chi-square, etc are the actual "test statistics", but the SIG's are what you need to complete the test. SIG gives the probability you could get results like those you see from a random sample of this size IF there were no relationship between the two variables in the population from which it is drawn. If small probability (<.05) you REJECT the assumption of no relationship (the null hypothesis). For 95% level, you REJECT null hypothesis if SIG <.05 If SIG > .05 you FAIL TO REJECT REJECTING NULL HYPOTYHESIS means the data suggest that there is a relationship. iv. Hypothesis tests are assessing if one can generalize from information in the sample to draw conclusions about relationships in the population. With very small samples most null hypotheses cannot be rejected while with very large samples almost any hypothesized relationship will be "statistically significant" even when not practically significant. Be cognizant of sample size (N) when making tests. Type I error: rejecting null hypothesis when it is true. Prob of Type I error is 1-confidence level. Type II error: failing to reject null hypothesis when it is false. Power of a test = 1-prob of a type II error. PRR 844 Spring 2001 Page 2 Guidance to Statistical Tests - Hypothesis Tests Nominal x Nominal Chi Square : Test null hypothesis of no relationship between two nominal scale variables . Need minimum of 5 cases per cell in your table, so don’t run with variables that have too many categories (recode if neccessary to collapse categories) . Use the Pearson Chi Square statistic in SPSS. Nominal x Interval T-Test : Tests null hypothesis of no difference in means between two groups. Independent variable divides population into two groups, estimate means for each group on the dependent variable (interval scale). One Sample T-test tests if mean of a single group differs from some constant. Independent Samples - most common test - use this one to compare two groups. Paired T-test - is when you have repeated measures on a single sample, e.g. a pre- and post test score and want to test “gain scores” of individuals vs means of two tests. Note: Two sample T tests generally use different formula depending on whether or not the variances in the two samples are assumed to be the same or different. An F-test is performed to choose between a “pooled sample” variance estimate or “separate” variances. In most cases it doesn’t matter, but procedure is to use the pooled variance estimator if the F-test of differences in variances is not significant. One Way ANOVA: Generalization of T-test to more than two groups. Independent variable is nominal to form groups, means are computed for each group. The F-test tests null hypothesis that all the group means are the same. Can run “Contrasts” to do t-tests between pairs of groups and can adjust for multiple tests using Bonferroni and related adjustments. Read about this in a statistics references if you wish. Interval by Interval Pearson Correlation: Run CORRELATION procedure to get the correlation coefficient between the two variables AND a test of null hypothesis that the correlation in population is zero. Be sure you understand distinction here between the measure of association between the two variables in the sample (correlation coefficient) and the test of hypothesis that correlation is zero (making inference to the population). Regression : is multivariate extension of correlation. A linear relationship between a dependent variable and several independent variables is estimated. t-statistics for each regression coefficient test for a relationship between X and Y while controlling for the other independent variables. Standardized regression coefficients (betas) indicate relative importance of each independent variable. The R square statistic (use adjusted R square) measures amount of variation in Y explained by the X’s. Nonparametric Tests (or Distribution free statistics) All of the above except Chi square are called parametric statistics meaning they assume interval scale measurements and that variables are normally distributed in the population. A number of non-parametric tests have been developed for situations where variables are measured at nominal or ordinal scales. These tests do not assume interval scale properties. The ordinal tests are based only on rankings or “order statistics”. The most common nonparametric tests are: Chi square : Nominal x nominal described above. Rank order or Spearman correlation : for two ordinal variables, Kendalls Tau if many “ties” Mann Whitney U : corresponds to T-Test, difference in means when dependent variable is ordinal difference in ranks. Wilcoxon matched pairs : corresponds to paired t-test with ordinal data Kruskal Wallis one way ANOVA with ordinal data Friedman two way ANOVA with two or more related samples and ordinal data Others: Kolmogorov-Smirnov, Runs, Sign test PRR 844 Spring 2001 Page 3 EXAMPLES OF T-TEST AND CHI SQUARE T-TEST. Tests for differences in means (or percentages) across two subgroups. Null hypothesis is mean of Group 1 = mean of group 2. This test assumes interval scale measure of dependent variable (the one you compute means for) and that the distribution in the population is normal. The generalization to more than two groups is called a one way analysis of variance and the null hypothesis is that all the subgroup means are identical. These are parametric statistics since they assume interval scale and normality. In SPSS use Compare means, several options as follows: Means One Sample T-Test Indep. Samples T-Test Paired samples T-Test One Way ANOVA Compare subgroup means, Options ANOVA for stat test Test H0 : Mean of variable = some constant Two groups, Test H0 : Mean for group 1 = Mean for group 2 Paired variables - applies in pre-test, post-test situation Compare means for more than two groups Chi square is a nonparametric statistic to test if there is a relationship in a contingency table, i.e. Is the row variable related to the column variable? Is there any discernible pattern in the table? Can we predict the column variable Y if we know the row variable X? The Chi square statistic is calculated by comparing the observed table from the sample, with an "expected" table derived under the null hypothesis of no relationship. If Fo denotes a cell in the observed table and Fe a corresponding cell in expected table, then 2 Chi square ( ) = 2 (Fo -Fe) /Fe cells The cells in the expected table are computed from the row (nr ) and column (nc ) totals for the sample as follows: Fe =nr nc / n . CHI SQUARE TEST EXAMPLE: Suppose a sample (n=100) from student population yields the following observed table of frequencies: Male GENDER Female Total 20 30 50 40 10 50 60 40 100 IM-USE Yes No Total EXPECTED TABLE UNDER NULL HYPOTHESIS (NO RELATIONSHIP) Male GENDER Female Total 30 20 50 30 20 50 60 40 100 IM-USE Yes No Total 2 2 2 2 2 = (20-30) /30 + (40-30) /30 + (30-20) /20 + (10-20) /20 PRR 844 Spring 2001 Page 4 100/30 + 100/30 + 100/20 +100/20 = 13.67 Chi square tables report the probability of getting a Chi square value this high for a particular random sample, given that there is no relationship in the population. If doing the test by hand, you would look up the probability in a table. There are different Chi square tables depending on the number of cells in the table. Determine the number of degrees of freedom for the table as (rows-1) X (columns -1). In this case it is (2-1)*(2-1)=1. The probability of obtaining a Chi square of 13.67 given no relationship is less than .001. (The last entry in my table gives 10.83 as the chi square value corresponding to a probability of .001, so 13.67 would have a smaller probability). If using a computer package, it will normally report both the Chi square and the probability or significance level corresponding to this value. In testing your null hypothesis, REJECT if the reported probability is less than .05 (or whatever confidence level you have chosen). FAIL TO REJECT if the probability is greater than .05. REVIEW OF STEPS IN HYPOTHESIS TESTING: For the above example : (1) Nominal level variables, so we used Chi square. (2) State null hypothesis. No relationship between gender and IM-USE 2 (3) Choose confidence level. 95%, so alpha = .05, critical region is > 3.84 2 (4) Draw sample and calculate the statistic; = 13.67 (5). 13.67 > 3.84, so inside critical region, REJECT null hypothesis. Alternatively, SIG= .001 on computer printout, .001<.05 so REJECT null hypothesis. Note we could have rejected null hypothesis at .001 level here. WHAT HAVE WE DONE? We have used probability theory to determine the likelihood of obtaining a contingency table with a Chi square of 13.67 or greater given that there is no relationship between gender and IMUSE. If there is no relationship (null hypothesis is true), obtaining a table that deviates as much as the observed table does from the expected table would be very rare - a chance of less than one in 1000. We therefore assume we didn't happen to get this rare sample, but instead our null hypothesis must be false. Thus we conclude there is a relationship between gender and IMUSE. The test doesn't tell us what the relationship is, but we can inspect the observed table to find out. Calculate row or column percents and inspect these. For row percents divide each entry on a row by the row total. Row percents: GENDER Male Female Total IM-USE Yes .33 .67 1.00 No .75 .25 1.00 Total .50 .50 1.00 To find the "pattern" in table, compare row percents for each row with the "Totals" at bottom. Thus, half of sample are men, whereas only a third of IMusers are male and three quarters of nonusers are male. Conclusion - men are less likely to use IM. -------------------------------------------------------------Column Percents: Divide entries in each column by column total. GENDER Male Female Total IM-USE Yes .40 .80 .60 No .60 .20 .40 Total 1.00 1.00 1.00 PRR 844 Spring 2001 Page 5 PATTERN: 40% of males use IM, compared to 80% of women. Conclude women more likely to use IM. Note in this case the column percents provide a clearer description of the pattern than row percents. OTHER STATISTICAL NOTES AND SAMPLE PROBLEMS a. Measures of strength of a relationship vs a statistical test of a hypothesis. There are a number of statistics that measure how strong a relationship is, say between variable X and variable Y. These include parametric statistics like the Pearson Correlation coefficient, rank order correlation measures for ordinal data (Spearman's rho and Kendall's tau), and a host of non-parametric measures including Cramer's V, phi, Yule's Q, lambda, gamma, and others. DO NOT confuse a measure of association with a test of a hypothesis. The Chi square statistic tests a particular hypothesis. It tells you little about how strong the relationship is, only whether you can reject a hypothesis of no relationship based upon the evidence in your sample. The problem is that the size of Chi square depends on strength of relationships as well as sample size and number of cells. There are measures of association based on chi square that control for the number of cells in table and sample size. Correlation coefficients from a sample tell how strong the relationship is in the sample, not whether you can generalize this to the population. There is a test of whether a correlation coefficient is significantly different from zero that evaluates generalizability from the sample correlation to the population correlation. This tests the null hypothesis that the correlation in the population is zero. b. Statistical significance versus practical significance. Hypothesis tests merely test how confidently we can generalize from what was found in the sample to the population we have sampled from. It assumes random sampling-thus, you cannot do statistical hypothesis tests from a non-probability sample or a census. The larger the sample, the easier it is to generalize to the population. For very large sample sizes, virtually ALL hypothesized relationships are statistically significant. For very small samples, only very strong relationships will be statistically significant. What is practically significant is a quite different matter from what is statistically significant. Check to see how large the differences really are to judge practical significance, i.e. does the difference make a difference?. c. Confidence intervals around parameter estimates. When you use a sample statistic to estimate a population parameter, you base your estimate on a single sample. Estimates will vary somewhat from one sample to another. Reporting results as confidence intervals acknowledges this variation due to sampling error. When probability samples are used we can estimate the size of this error. The standard error of the mean (SEMean)is the standard deviation of the sampling distribution - i.e. how much do means for different samples of a given size from the same population vary? The SEMean provides the basic measure of likely sampling error in a sample estimate. A 95% confidence interval is two (1.96) standard errors (SE) either side of the sample mean. SEMean = standard deviation in population/ square root of n (sample size) SPSS computes standard deviations and/or standard errors for you. You should be able to compute a 95 % confidence interval if you have the sample mean (say X) and a) standard error of mean (SEMean) - (X- 2*SEMean, X + 2* SEMean) b) standard deviation of variable in population ( ) and sample size (n) : SEMean = /sqrt(n), 95% CI= (X- 2*/sqrt(n), X + 2*/sqrt(n)) Examples: a) In sample of size 100, pct reporting previous visit to park is 40%. If SEMean is 5%, then 95% CI is (40% + or - 2 * 5%) = (30%, 50%). b) In sample of size 100, pct reporting previous visit to park is 40%. If standard deviation in population is 30%, then SEMean is /sqrt(n) = 30/sqrt(100) = 30/10 = 3. and 95% CI = (40 + or - 2 SEMean) = 40 + or - 2*3%) = 40 + or - 6% = (34%,46%) PRR 844 Spring 2001 c) If same mean and standard deviation as b) but using bigger sample of 900, note the 95%CI = (40 + or - 2 * 30%/sqrt(900)) = 40 + or - 2*(30/30) = 40 + or - 2% = (38%, 42%) Page 6 PRR 844 Spring 2001 Page 7 Brief Summary of Multivariate Analysis Methods. SPSS procedure in Capitals. 1. Linear Regression: Estimate a linear relationship between a dependent variable and a set of independent variables. All must be interval scale or dichotomous (dummy variables).(See Babbie p 437, T&H, p 619, Also JLR 15(4). Examples: estimating participation in recreation activities, cost functions, spending. REGRESSION, linear. 2. Non-linear models : Similar to above except for the functional form of the relationship. Gravity models, logit models, and some time series models are examples. (See Stynes & Peterson JLR 16(4) for logit, Ewing Leisure Sciences 3(1) for gravity. Examples: Similar to above when relationships are non-linear. Gravity models widely used in trip generation and distribution models. Logit models in predicting choices. REGRESSION, Non-linear, TIME SERIES 3. Cluster analysis : A host of different methods for grouping objects based upon their similarity across several variables. (See Romesburg JLR 11(2) & book review same issue.) Examples: Used frequently to form market segments or otherwise group cases. See Michigan Ski Market Segmentation Bulletin #391 for a good example. CLASSIFY, K-means or Non-hierarchical 4. Factor analysis. Method for reducing a large number of variables into a smaller number of independent (orthogonal) dimensions or factors. (See Kass & Tinsley JLR 11(2); Babbie p 444, T&H pp 627). Examples: Used in theory development (e.g. What are the underlying dimensions of leisure attitudes?) and data reduction (reduce number of independent variables to smaller set). DATA REDUCTION, FACTOR. 5. Discriminant analysis: Predicts group membership using linear "discriminant" functions. This is a variant of linear regression suited to predicting a nominal dependent variable. (See JLR 15(4) ; T&H pp 625). Examples: Predict whether an individual will buy a sail, power, or pontoon boat based upon demographics and socioeconomics. CLASSIFY, Discriminant 6. Analysis of Variance (ANOVA): To identify sources of variation in a dependent variable across one or more independent variables. Tests null hypothesis of no difference in means of dependent variable for three or more subgroups (levels or categories of independent variable). The basic statistical analysis technique for experimental designs. (See T&H pp 573, 598). Multivariate analysis of variance (MANOVA) is the extension to more complex designs. (See JLR 15(4)). Compare Means, ANOVA, for one way; GENERAL LINEAR MODEL for more complex designs. 7. Multidimensional scaling (MDS): Refers to a number of methods for forming scales and identifying the structure (dimensions) of attitudes. Differ from factor analysis in employing non-linear methods. MDS can be based on measures of similarities between objects. Applic in recreation & tourism- mapping images of parks or travel destinations. Identifying dimensions of leisure attitudes. SCALE, Reliability and MDS (See T&H pp 376.) 8.Others: Path analysis (LISREL) (Babbie p. 441), canonical correlation, conjoint analysis (See T& H 359, App C), multiple classification analysis, time series analysis (Babbie p. 443), log linear analysis (LOGLINEAR HILOGLINEAR) linear programming, simulation. SPSS has routine called AMOS for structural modeling, path analysis, causal modeling. See AMOS web site :http://www.smallwaters.com/amos/index.html PRR 844 Spring 2001 Page 8 Measures of Association These measure strength of a relationship in a sample. Most measures of association are like the correlation coefficient. They often vary between 0 and 1 or (-1 and 1 if directional). A correlation of zero means no (linear) association, one a perfect relationship. The preferred measures of association have a PRE (proportionate reduction in error) interpretation. This means they tell us how much better we can predict the dependent variable if we know the value of independent variable. See Babbie pp 416-420 or SPSS Data Analysis Guide Chapter 19 for details. Nominal Variables There are several measures of association based on the Chi square statistic. Beware of Chi square as a measure of association vs test of hypothesis as the magnitude of Chi square depends on sample size N and number of cells in the table. You can’t compare Chi square’s across samples or tables of different size to indicate weaker or stronger relationships. Phi : A PRE measure for 2 by 2 tables, normalizes Chi square for sample size.The Contingency coeficient, and Cramer’s V are not PRE measures. Lambda is the most common PRE measure of association for nominal x nominal - based on improving predictions of one variable knowing value of the other. Symmetric and asymmetric versions depending on which is dependent variable. Ordinal Variables Gamma (Goodman and Kruskal’s gamma) - from concordant & discordant pairs Spearman rank order correlation Kendall’s tau-b - adjusts for ties, Tau-c, Somer’s D are NOT PRE measures. Interval Pearson Correlation assumes two interval scale variables, the correlation squared gives variation explained. Eta for nominal or ordinal x interval, eta squared is variation in dependent variable explained by independent variable. These and other measures are produced by SPSS CROSSTABS procedure. Choose Statistics ALL. PRR 844 Spring 2001 Page 9 The Normal Distribution Important because: 1. Many continuous phenomona follow this distribution. 2. It can be used to approximate many discrete distributions. 3. Central limit theorem makes it the centerpiece of classical statistical inference. i.e., sampling distributions are normal. Properties: Bell shaped, mean=median=mode give center of distribution, symmetric about the mean, has infinite range, interquartile range =1.33 std deviations, ie. plus or minus two-thirds of std. dev. from mean. 68% of values within one standard deviation of mean, 95% within two, and roughly 99% within three standard deviations. Probability distribution for N(,) expressed as: Areas Under Normal Distribution Standardizing the Normal Distribution: Standardized normal distribution has mean=0 and standard deviation= 1-- denoted N(0,1). Can transform a random variable X with normal distribution with mean and standard deviation to random variable Z with N(0,1) by the transformation, Z=(X-)/. For each X, corresponding Z is known as its Z-score. The area under the standard normal distribution are tabulated in tables of the standardized normal distribution. (Be careful to check exactly which areas are given.) Sampling distribution of the mean = the distribution of means in all possible samples of size n from a given population. There is a different distribution for each value of n, which summarizes the range of sample means one may obtain. From this distribution we estimate the probability of obtaining a given sample mean, i.e., determine confidence intervals for sample estimates. Standard error of the mean is defined to be the standard deviation of the sampling distribution of the mean. Central limit theorem. When drawing samples of size n from a population with mean and standard deviation , the sampling distribution of the mean approaches a normal distribution with mean equal to the population mean and standard deviation equal to [/n]. Hence, the standard error of the mean is [/n], where is the standard deviation in the population and n is the sample size. Sampling from finite populations: The above formula assumes sampling with replacement. Most surveys don't sample with replacement. For small populations (N), one must correct for difference by the finite population correction factor(fpc) , fpc = sqrt[(N-n)/(N-1)]. Standard error of mean becomes PRR 844 Spring 2001 Page 10 [/n] *fpc. Sample size determination for the mean: To estimate the sample size needed to achieve a given level of accuracy and confidence in the estimate of the mean, you need 1. Confidence level desired- this determines the value of Z. 2. Sampling error permitted, e. 3. Standard deviation in the population, . Then, necessary sample size is: n=Z²²/e². Transform this equation to estimate sampling error for a given sample size, population variance & confidence level. e = (Z * )/ n] , For Z=2 (95% confidence level), e= 2 * )/ n] , two standard errors. Statistical help on WEB Several outstanding statistics web sites: Electronic textbook from StatSoft http://www.statsoft.com/textbook/stathome.html : can download complete text (5 MB zipped) or browse on line. Elementary Concepts and Basic Statistics chapter provide good foundation. HyperStat Online (Rice Univ) http://davidmlane.com/hyperstat/index.html : A more basic statistics text including lots of links and references. DAU Stat Refresher. http://www.cne.gmu.edu/modules/dau/stat/ : an interactive tutorial from "Defense Acquisition University" (whatever that is - developed at George Mason Univ.). SPSS Web Site : http://www.spss.com