Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistical Analysis SC504/HS927 Spring Term 2008 Week 17 (25th January 2008): Analysing data Recap • Why • How • What Questions to consider • What is my question / interest? • Can the sample tell me about it? • What are the relevant variables, what are their characteristics – are they in the form I want them? (If not, change them into that form). • What is the most appropriate method to use? • If it works what will / might it show me? NB negative results are not necessarily uniformative • If it doesn’t work – why might that be? Levels of measurement • Nominal e.g., colours numbers are not meaningful • Ordinal e.g., order in which you finished a race numbers don’t indicate how far ahead the winner of the race was • Interval e.g., temperature equal intervals between each number on the scale but no absolute zero • Ratio e.g., time equal intervals between each number with an absolute zero. Univariate analysis • Measures of central tendency x x x x ... x n – Mean= – Median= midpoint of the distribution – Mode= most common value 1 2 3 4 n Mode – value or category that has the highest frequency (count) age frequency (count) sex frequency (count) 16-25 12 male 435 26-35 20 female 654 36-45 32 46-55 27 56+ 18 Median – value that is halfway in the distribution (50th percentile) age 12 age 12 14 18 21 median 36 41 14 18 21 36 41 median= (18+21)/2 =19.5 42 Mean – the sum of all scores divided by the number of scores • What most people call the average • Mean: ∑x / N Which One To Use? Mode Median Nominal Ordinal Interval Mean Measures of dispersion – Range= highest value-lowest value (Y Y ) (Y Y ) ... (Y – variance, s2= (Yn 1Y ) = n 1 _ i _ 2 1 _ 2 2 2 n _ Y )2 – standard deviation, s (or SD)= s 2 The standard error of the mean and confidence intervals s – SE n Definitions: Measures of Dispersion • Variance: indicates the distance of each score from the mean but in order to account for both + and – differences from the mean (so they don’t just cancel each other out) we square the difference and add them together (Sum of Squares). This indicates the total error within the sample but the larger the sample the larger the error so we need to divide by N-1 to get the average error. Definitions: Measures of Dispersion • Standard deviation: due to the fact that we squared the sums of the error of each score the variance actually tells us the average error². To get the SD we need to take the square root of the variance. The SD is a measure of how representative the mean is. The smaller the SD the more representative of your sample the mean is. Definitions: Measures of Dispersion • Standard error: the standard error is the standard deviation of sample means. If you take a lot of separate samples and work out their means the standard deviation of these means would indicate the variability between the means of different samples. The smaller the standard error the more representative your sample mean is of the population mean. Definitions: Measures of Dispersion • Confidence Intervals: A 95% confidence interval means that if we collected 100 samples and calculated the means and confidence intervals 95 of those confidence intervals would contain the population mean. Describing data • Numbers / tables – Analyze – Descriptive Statistics- Frequencies / Descriptives • Charts / graphs – Graphs – Pie / Histogram / Bar – Using excel for charts E.g. Descriptive Statistics @/what is your exact age? Valid N (lis twis e) N 3880 3880 Range 18.00 Minimum 12.00 Valid NONE VOC & O LEVEL A LEVEL & ABOVE Total Percent 38.7 34.6 26.7 100.0 Mean 21.2317 Std. Deviation 6.20229 Statistics Highest Qualification Frequency 387 346 267 1000 Maximum 30.00 Valid Percent 38.7 34.6 26.7 100.0 Cumulative Percent 38.7 73.3 100.0 Highest Qualification N Valid Mis sing Median Mode 1000 0 1.0000 .00 Descriptive Statistics @/what is your exact age? Valid N (lis twis e) N Statis tic 3880 3880 Minimum Statis tic 12.00 Maximum Statis tic 30.00 Mean Statis tic Std. Error 21.2317 .09957 Std. Deviation Statis tic 6.20229 Variance 38.468 Bivariate relationships • Asking research questions involving two variables: – Categorical and interval – Interval and interval – Categorical and Categorical • Describing relationships • Testing relationships Interval and interval • Correlation – To be covered next week with OLS Categorical (dichotomous) and interval • T-tests – Analyze – compare means – independent samples t-test – check for equality of variances – t value= observed difference between the means for the two groups divided by the standard error of the difference – Significance of t statistic, upper and lower confidence intervals based on standard error E.g. (with stats sceli.sav) • • • • • • Average age in sample=37.34 Average age of single=31.55 Average age of partnered=39.45 t=7.9/.74 Upper bound=-7.9+(1.96*.74) Lower bound=-7.9-(1.96*.74) Independent Samples Test Levene's Tes t for Equality of Variances F age Equal variances ass umed Equal variances not as sumed .026 Sig. .872 t-tes t for Equality of Means t df Sig. (2-tailed) Mean Difference Std. Error Difference 95% Confidence Interval of the Difference Lower Upper -10.721 998 .000 -7.905 .737 -9.352 -6.458 -10.661 470.095 .000 -7.905 .741 -9.362 -6.448 Categorical and Categorical • Chi Square Test – Tabulation of two variables – What is the observed variation compared to what would be expected if equal distributions? – What is the size of that observed variation compared to the number of cells across which variation could occur? (the chi-square statistic) 2 ( fo fe)2 fe – What is its significance? (the chi square distribution and degrees of freedom) E.g. T • Are the E proportions M M within L T o E T T S m C employment 3 2 3 8 % % % % % status similar f C 0 8 1 9 across the % % % % % T C 3 0 4 7 sexes? % % % % % • Could also think about it sex * EMPLOYMENT STATUS Crosstabulation EMPLOYMENT STATUS the other way EMP EMP round FULLTIME PARTTIME UNEMP Total s ex male female Total Count % within EMPLOYMENT STATUS Count % within EMPLOYMENT STATUS Count % within EMPLOYMENT STATUS 383 2 100 485 62.5% 1.1% 48.5% 48.5% 230 179 106 515 37.5% 98.9% 51.5% 51.5% 613 181 206 1000 100.0% 100.0% 100.0% 100.0% sex * EMPLOYMENT STATUS Crosstabulation sex male EMPLOYMENT STATUS EMP EMP FULLTIME PARTTIME UNEMP 383 2 100 Count Expected Count % within sex female Count Expected Count % within sex Total Count Expected Count % within sex Value 209.840a 265.507 48.211 485 297.3 87.8 99.9 485.0 79.0% .4% 20.6% 100.0% 230 179 106 515 315.7 93.2 106.1 515.0 44.7% 34.8% 20.6% 100.0% 613 181 206 1000 613.0 181.0 206.0 1000.0 61.3% 18.1% 20.6% 100.0% Chi-Square Tests Pearson Chi-Square Likelihood Ratio Linear-by-Linear Association N of Valid Cases Total 2 2 Asymp. Sig. (2-sided) .000 .000 1 .000 df 987 a. 0 cells (.0%) have expected count less than 5. The minimum expected count is 87.17. Now look through your background notes before starting the exercises…