Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
What we won’t cover • Lots of maths!!! • Students coming to statistics support usually want help with using SPSS, choosing the right analysis and interpreting output • They often find maths scary and so you need to think of ways of explaining without the maths Data and variables DATA: the answers to questions or measurements from the experiment VARIABLE = measurement which varies between subjects e.g. height or gender One row per subject One variable per column Data types Data Variables Scale Measurements/ Numerical/ count data Categorical: appear as categories Tick boxes on questionnaires Data types Variables Scale Continuous Measurements takes any value Categorical Discrete: Ordinal: Nominal: Counts/ integers obvious order no meaningful order Populations and samples • Taking a sample from a population statstutor.ac.uk Sample data ‘represents’ the whole population Point estimation Sample data is used to estimate parameters of a population Statistics are calculated using sample data. Parameters are the characteristics of population data www.statstutor.ac.uk sample mean 𝒙 Sample SD 𝑺 Population mean estimates Population SD “outliers” • minority cases, so different from the majority that they merit separate consideration – are they errors? – are they indicative of a different pattern? • think about possible outliers with care, but beware of mechanical treatments… • significance of outliers depends on your research interests summaries of distributions • graphic vs. numeric – graphic may be better for visualization – numeric are better for statistical/inferential purposes general characteristics • kurtosis [“peakedness”] 0.22 0.4 0.8 X X 0.00 -5 0.0 -5 5 D 0.0 -5 5 ‘leptokurtic’ D ’platykurtic’ 5 5 right (positive) skew 4 X 3 • skew (skewness) 2 5 1 4 0.2 0.4 0.6 D 0.8 1.0 1.2 3 X 0 0.0 left (negative) skew 2 1 0 0.0 0.2 0.4 0.6 D 0.8 1.0 1.2 Descriptive Statistics An Illustration: Which Group is Smarter? Class A--IQs of 13 Students 102 115 128 109 131 89 98 106 140 119 93 97 110 Class B--IQs of 13 Students 127 162 131 103 96 111 80 109 93 87 120 105 109 Each individual may be different. If you try to understand a group by remembering the qualities of each member, you become overwhelmed and fail to understand the group. Descriptive Statistics Which group is smarter now? Class A--Average IQ 110.54 Class B--Average IQ 110.23 They’re roughly the same! With a summary descriptive statistic, it is much easier to answer our question. Descriptive Statistics Types of descriptive statistics: • Organize Data – Tables – Graphs • Summarize Data – Central Tendency – Variation Descriptive Statistics Types of descriptive statistics: • Organize Data – Tables • Frequency Distributions – Graphs • Bar Chart or Histogram • Frequency Polygon Frequency Distribution Frequency Distribution of IQ for Two Classes IQ Frequency 82.00 87.00 89.00 93.00 96.00 97.00 98.00 102.00 103.00 105.00 106.00 107.00 109.00 111.00 115.00 119.00 120.00 127.00 128.00 131.00 140.00 162.00 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 Total 24 Descriptive Statistics Summarizing Data: – Central Tendency (or Groups’ “Middle Values”) • Mean • Median • Mode – Variation (or Summary of Differences Within Groups) • • • • Range Interquartile Range Variance Standard Deviation Mean Most commonly called the “average.” Add up the values for each case and divide by the total number of cases. Y-bar = (Y1 + Y2 + . . . + Yn) n Y-bar = Σ Yi n Mean What’s up with all those symbols, man? Y-bar = (Y1 + Y2 + . . . + Yn) n Y-bar = Σ Yi n Some Symbolic Conventions in this Class: • Y = your variable (could be X or Q or or even “Glitter”) • “-bar” or line over symbol of your variable = mean of that variable • Y1 = first case’s value on variable Y • “. . .” = ellipsis = continue sequentially • Yn = last case’s value on variable Y • n = number of cases in your sample • Σ = Greek letter “sigma” = sum or add up what follows • i = a typical case or each case in the sample (1 through n) Mean Class A--IQs of 13 Students 102 115 128 109 131 89 98 106 140 119 93 97 110 Σ Yi = 1437 Y-barA = Σ Yi = 1437 = 110.54 n 13 Class B--IQs of 13 Students 127 162 131 103 96 111 80 109 93 87 120 105 109 Σ Yi = 1433 Y-barB = Σ Yi = 1433 = 110.23 n 13 Mean 1. Means can be badly affected by outliers (data points with extreme values unlike the rest) 2. Outliers can make the mean a bad measure of central tendency or common experience Income in the U.S. All of Us Mean Bill Gates Outlier Median The middle value when a variable’s values are ranked in order; the point that divides a distribution into two equal halves. When data are listed in order, the median is the point at which 50% of the cases are above and 50% below it. The 50th percentile. Median Class A--IQs of 13 Students 89 93 97 98 102 106 109 110 115 119 128 131 140 Median = 109 (six cases above, six below) Median If the first student were to drop out of Class A, there would be a new median: 89 93 97 98 102 106 109 110 115 119 128 131 140 Median = 109.5 109 + 110 = 219/2 = 109.5 (six cases above, six below) Median 1. The median is unaffected by outliers, making it a better measure of central tendency, better describing the “typical person” than the mean when data are skewed. All of Us Bill Gates outlier Median 2. If the recorded values for a variable form a symmetric distribution, the median and mean are identical. 3. In skewed data, the mean lies further toward the skew than the median. Symmetric Skewed Mean Mean Median Median Median The middle score or measurement in a set of ranked scores or measurements; the point that divides a distribution into two equal halves. Data are listed in order—the median is the point at which 50% of the cases are above and 50% below. The 50th percentile. Mode The most common data point is called the mode. The combined IQ scores for Classes A & B: 80 87 89 93 93 96 97 98 102 103 105 106 109 109 109 110 111 115 119 120 127 128 131 131 140 162 A la mode!! BTW, It is possible to have more than one mode! Mode It may mot be at the center of a distribution. 2.0 1.8 1.6 Count Data distribution on the right is “bimodal” (even statistics can be openminded) 1.4 1.2 1.0 82.00 89.00 96.00 98.00 103.00 106.00 109.00 115.00 120.00 128.00 140.00 87.00 93.00 97.00 102.00 105.00 107.00 111.00 119.00 127.00 131.00 162.00 IQ Mode 1. 2. 3. It may give you the most likely experience rather than the “typical” or “central” experience. In symmetric distributions, the mean, median, and mode are the same. In skewed data, the mean and median lie further toward the skew than the mode. Symmetric Median Skewed Mean Mode Mode Median Mean Descriptive Statistics Summarizing Data: Central Tendency (or Groups’ “Middle Values”) Mean Median Mode – Variation (or Summary of Differences Within Groups) • • • • Range Interquartile Range Variance Standard Deviation Range The spread, or the distance, between the lowest and highest values of a variable. To get the range for a variable, you subtract its lowest value from its highest value. Class A--IQs of 13 Students 102 115 128 109 131 89 98 106 140 119 93 97 110 Class A Range = 140 - 89 = 51 Class B--IQs of 13 Students 127 162 131 103 96 111 80 109 93 87 120 105 109 Class B Range = 162 - 80 = 82 Interquartile Range A quartile is the value that marks one of the divisions that breaks a series of values into four equal parts. The median is a quartile and divides the cases in half. 25th percentile is a quartile that divides the first ¼ of cases from the latter ¾. 75th percentile is a quartile that divides the first ¾ of cases from the latter ¼. The interquartile range is the distance or range between the 25th percentile and the 75th percentile. Below, what is the interquartile range? 25% of cases 0 250 25% 25% 500 750 25% of cases 1000 Variance A measure of the spread of the recorded values on a variable. A measure of dispersion. The larger the variance, the further the individual cases are from the mean. Mean The smaller the variance, the closer the individual scores are to the mean. Mean Variance Variance is a number that at first seems complex to calculate. Calculating variance starts with a “deviation.” A deviation is the distance away from the mean of a case’s score. Yi – Y-bar Variance The deviation of 102 from 110.54 is? Deviation of 115? Class A--IQs of 13 Students 102 115 128 109 131 89 98 106 140 119 93 97 110 Y-barA = 110.54 Variance The deviation of 102 from 110.54 is? 102 - 110.54 = -8.54 Class A--IQs of 13 Students 102 115 128 109 131 89 98 106 140 119 93 97 110 Y-barA = 110.54 Deviation of 115? 115 - 110.54 = 4.46 Variance • We want to add these to get total deviations, but if we were to do that, we would get zero every time. Why? • We need a way to eliminate negative signs. Squaring the deviations will eliminate negative signs... A Deviation Squared: (Yi – Y-bar)2 Back to the IQ example, A deviation squared for 102 is: of 115: (102 - 110.54)2 = (-8.54)2 = 72.93 (115 - 110.54)2 = (4.46)2 = 19.89 Variance If you were to add all the squared deviations together, you’d get what we call the “Sum of Squares.” Sum of Squares (SS) = Σ (Yi – Y-bar)2 SS = (Y1 – Y-bar)2 + (Y2 – Y-bar)2 + . . . + (Yn – Y-bar)2 Variance Class A, sum of squares: (102 – 110.54)2 + (115 – 110.54)2 + (126 – 110.54)2 + (109 – 110.54)2 + (131 – 110.54)2 + (89 – 110.54)2 + (98 – 110.54)2 + (106 – 110.54)2 + (140 – 110.54)2 + (119 – 110.54)2 + (93 – 110.54)2 + (97 – 110.54)2 + (110 – 110.54) = SS = 2825.39 Class A--IQs of 13 Students 102 115 128 109 131 89 98 106 140 119 93 97 110 Y-bar = 110.54 Variance The last step… The approximate average sum of squares is the variance. SS/N = Variance for a population. SS/n-1 = Variance for a sample. Variance = Σ(Yi – Y-bar)2 / n – 1 Variance For Class A, Variance = 2825.39 / n - 1 = 2825.39 / 12 = 235.45 How helpful is that??? Standard Deviation To convert variance into something of meaning, let’s create standard deviation. The square root of the variance reveals the average deviation of the observations from the mean. s.d. = Σ(Yi – Y-bar)2 n-1 Standard Deviation For Class A, the standard deviation is: 235.45 = 15.34 The average of persons’ deviation from the mean IQ of 110.54 is 15.34 IQ points. Review: 1. Deviation 2. Deviation squared 3. Sum of squares 4. Variance 5. Standard deviation Standard Deviation 1. Larger s.d. = greater amounts of variation around the mean. For example: 19 2. 3. 4. 25 31 13 25 37 Y = 25 Y = 25 s.d. = 3 s.d. = 6 s.d. = 0 only when all values are the same (only when you have a constant and not a “variable”) If you were to “rescale” a variable, the s.d. would change by the same magnitude—if we changed units above so the mean equaled 25, the s.d. on the left would be 3, and on the right, 6 Like the mean, the s.d. will be inflated by an outlier case value. Standard Deviation • Note about computational formulas: – A book provides a useful short-cut formula for computing the variance and standard deviation. – This is intended to make hand calculations as quick as possible. – They obscure the conceptual understanding of our statistics. – SPSS and the computer are “computational formulas” now. Practical Application for Understanding Variance and Standard Deviation Even though we live in a world where we pay real MONEY IN RUPEES for goods and services (not percentages of income), most INDIAN employers issue raises based on percent of salary. Why do supervisors think the most fair raise is a percentage raise? Answer: 1) Because higher paid persons win the most money. 2) The easiest thing to do is raise everyone’s salary by a fixed percent. If your budget went up by 5%, salaries can go up by 5%. The problem is that the flat percent raise gives unequal increased rewards. . . Practical Application for Understanding Variance and Standard Deviation TANDROOST Toilet Cleaning Services Salary Pool: Rs. 200,000 Incomes: President: Rs. 100K; Manager: 50K; Secretary: 40K; and Toilet Cleaner: 10K Mean: Rs. 50K Range: Rs. 90K Variance: Rs. 1,050,000,000 Standard Deviation: Rs. 32.4K Now, let’s apply a 5% raise. These can be considered “measures of inequality” Practical Application for Understanding Variance and Standard Deviation After a 5% raise, the pool of money increases by Rs.10K to Rs.210,000 Incomes: President: Rs.105K; Manager: 52.5K; Secretary: 42K; and Toilet Cleaner: 10.5K Mean: Rs.52.5K – went up by 5% Range: Rs.94.5K – went up by 5% Variance: Rs.1,157,625,000 Measures of Inequality Standard Deviation: Rs.34K –went up by 5% The flat percentage raise increased inequality. The top earner got 50% of the new money. The bottom earner got 5% of the new money. Measures of inequality went up by 5%. Last year’s statistics: TANDROOST Toilet Cleaning Services annual payroll of Rs.200K Incomes: Rs.100K, 50K, 40K, and 10K Mean: Rs.50K Range: Rs.90K; Variance: Rs.1,050,000,000; Standard Deviation: Rs.32.4K Descriptive Statistics Summarizing Data: Central Tendency (or Groups’ “Middle Values”) Mean Median Mode Variation (or Summary of Differences Within Groups) Range Interquartile Range Variance Standard Deviation – …Wait! There’s more Box-Plots A way to graphically portray almost all the descriptive statistics at once is the box-plot. A box-plot shows: Upper and lower quartiles Mean Median Range Outliers Box-Plots 180.00 IQR = 27; There is no outlier. 162 160.00 140.00 123.5 120.00 M=110.5 106.5 100.00 96.5 82 80.00 IQ The Data 2 Regions – Monthly Sales The Calculations Calculate Min, Max, Median, Quartile 1 and Quartile 3 Calculate Box Heights SAMPLING Sampling denotes the selection of a part of the aggregate statistical material with a view to obtaining information about the whole. This aggregate or totality of statistical information on a particular character of all the members covered by an investigation, is called population. Types of sampling • Simple random, Stratified, Systematic, Cluster, and Multistage. Simple random It is the simplest of all sampling techniques where all units in a population has equal chance of being included in the sample. There are two methods: 1. With replacement ; 2. Without replacement • In ‘with replacement’ method, the probability of selection of any particular number of the population at any drawing remains a constant 1/N; because before any draw the population contains all the N members. • Interestingly, this result is also true in without replacement method, although the population size varies at each stage of selection. Thus the probability of obtaining the population member Xk (suppose) at the ith draw is a constant 1/N in both the cases i.e., • P(Xi=Xk)=1/N for i=1,2,…..n and K=1,2,……N. 98 89 87 75 51 69 41 10 35 8 79 100 98 85 31 95 29 17 99 57 Random number series - 1 65 33 98 67 42 62 60 72 79 14 22 78 11 78 11 70 90 50 1 8 72 28 33 47 61 80 13 59 81 91 41 79 35 98 58 8 51 27 34 46 41 79 81 28 33 46 44 87 46 85 82 32 17 57 12 93 69 28 30 47 93 7 48 26 82 76 15 21 11 30 15 75 61 69 91 15 26 94 15 47 Random number series – 2 297 117 273 66 214 293 256 140 108 80 42 169 15 33 281 156 13 214 165 241 299 284 198 122 279 237 197 163 203 47 26 112 58 138 44 39 98 15 274 79 198 81 113 60 114 142 149 91 150 269 Stratified sampling is generally used when the population is heterogeneous, but can be subdivided into strata within each of which the heterogeneity is not so prominent. Some prior knowledge is necessary for subdivision, termed as stratification. If a proper stratification can be made such that the strata differ from one another as much as possible, but there is much homogeneity within each of them, then a stratified sample will yield better estimates than a random sample of the same size. This is because in stratified sampling different sections of population are suitably represented through the sub samples, which in random sampling some of these sections may be over or under represented or may even be omitted. The principle purpose of stratification are – I. To increase the overall estimates, II. To ensure that all sections of populations are adequately represented III.To avoid heterogeneity of the population. Solve this problem?? A company has a total of 360 employees in four different categories: How many from each category should be included in a stratified random sample of size 20 ? Managers 36 Drivers 54 Administrative Staff 90 Production Staff 180 To create a sample of size 20 we need 20/360 or 1/18 of the workforce. So we take this fraction of the number of employees in each category. Managers 1/18 × 36 = 2 Drivers 1/18 × 54 = 3 Administrative Staff 1/18 × 90 = 5 Production Staff 1/18 × 180 = 10 TOTAL = 20 Systematic sampling: Systematic sampling involves the selection of sample units at equal intervals, after all the units in the population have been arranged in some order. If the population size is finite, the units may be serially numbered and arranged. From the first K of these, a single unit is chosen at random. This unit and every k-th unit thereafter constitutes a Systematic sample. In order to obtain a systematic sample of 500 villages out of 40,000 in Assam, i.e., one out of 80 on an average , all the villages have to be numbered serially. From the first 80 of these a village is selected at random, suppose with the serial number 27. Then the villages with serial numbers 27, 107, 187, 267, 347,…. Constitute the systematic sample. If the characteristics under study is independent of the order of arrangement of the units, then a systematic sample is practically equivalent to a random sample. The actual selection of the sample is easier and quicker. Systematic sampling is suitable when the units are described on serial numbered cards, e.g., workers listed on cards. Then the sample can be drawn easily by looking at the serial numbers. The sample may be biased if there are periodic features associated with the sampling interval. Multi-stage Sampling: Multi-stage Sampling refers to a sampling procedure which is carried out in several stages. The population is divided into large groups , called first stage units. These 1st stage units are again divided into smaller units, called 2nd stage units- the 2nd stage units into 3rd stage units, and so on, until we reach the ultimate units. e.g., in order to introduce a scheme on an experimental basis in the villages, we may have to select a few villages from the whole of the state. If we apply 3 stage sampling, sub-divisions may be used as 1st stage units. Cluster sampling: It involves grouping the population and then selecting the groups or clusters rather than individual elements for inclusion in the sample. Suppose some deptt. Store wishes to sample its credit holders. It has issued its cards to 15000 customers. The sample size is to be kept say 450. For cluster sampling this list of 15000 card holders could be formed into 100 clusters of 150 card holders each. Three clusters might then be selected randomly. The sample size must be larger than simple random sampling to ensure same level of accuracy, because possibilities of both sampling or non-sampling error is more. The clustering approach can make the sampling procedure relatively easier and increase efficiency of field works, specially in the case of personal interviews. How can exam score data be summarised? Exam marks for 60 students (marked out of 65) mean = 30.3 sd = 14.46 Summary statistics n • Mean = x i 1 n x Standard deviation (s) is a measure of how much the individuals differ from the mean n 2 x x i s i 1 n 1 Large SD = very spread out data Small SD = there is little variation from the mean For exam scores, mean = 30.5, SD = 14.46 IQ is normally distributed Above average Average Mean = 100, SD = 15.3 95% 1.96 x SD’s from the mean 95% of values P(score > 130) = 0.025 100 70 mean 1.96 SD 100 1.96 15.3 70 130 mean 1.96 SD 100 1.96 15.3 130 95% of people have an IQ between 70 and 130 Assessing Normality Charts can be used to informally assess whether data is: Normally distributed Or….Skewed The mean and median are very different for skewed data. Sometimes the median makes more sense! 2/3rd people 50% people Source: Households Below Average Income: An analysis of the income distribution1994/95 – 2011/12, Department for Work and Pensions www.statstutor.ac.uk Choosing summary statistics Which average and measure of spread? Scale Normally distributed Mean (Standard deviation) Skewed data Median (Interquartile range) Categorical Ordinal: Median (Interquartile range) Nominal: Mode (None) Hypothesis Testing Hypothesis testing • An objective method of making decisions or inferences from sample data (evidence) • Sample data used to choose between two choices i.e. hypotheses or statements about a population • We typically do this by comparing what we have observed to what we expected if one of the statements (Null Hypothesis) was true Hypothesis testing Framework What the text books might say! • Always two hypotheses: HA: Research (Alternative) Hypothesis • What we aim to gather evidence of • Typically that there is a difference/effect/relationship etc. H0: Null Hypothesis statstutor.ac.uk • What we assume is true to begin with • Typically that there is no difference/effect/relationship etc. Discussion • How could you help a student understand what hypothesis testing is and why they need to use it? Could try explaining things in the context of “The Court Case”? • Members of a jury have to decide whether person is guilty or innocent based on evidence a Null: The person is innocent Alternative: The person is not innocent (i.e. guilty) • The null can only be rejected if there is enough evidence to doubt it • i.e. the jury can only convict if there is beyond reasonable doubt for the null of innocence • They do not know whether the person is really guilty or innocent so they may make a mistake Types of Errors Controlled via sample size (=1-Power of test) Typically restrict to a 5% Risk = level of significance Study reports NO difference (Do not reject H0) H0 is true Difference Does NOT exist in population HA is true Difference DOES exist in population Study reports IS a difference (Reject H0) X X Type I Error Type II Error Prob of this = Power of test Steps to undertaking a Hypothesis test Define study question Set null and alternative hypothesis Calculate a test statistic Calculate a p-value Make a decision and interpret your conclusions Choose a suitable test What does it mean for two categorical variables to be related? • Remember that Chi-Square is used to test for a relationship between 2 Categorical variables. Ho: There is no relationship between the variables. Ha: There is a relationship between the variables. • If two categorical variables are related, it means the chance that an individual falls into a particular category for one variable depends upon the particular category they fall into for the other variable. • Let’s say that we wanted to determine if there is a relationship between religion (Christian, Jew, Muslim, Other) and smoking. When we test if there is a relationship between these two variables, we are trying to determine if being part of a particular religion makes an individual more likely to be a smoker. If that is the case, then we can say that Religion and Smoking are related or associated. Chi-squared test statistic • The chi-squared test is used when we want to see if two categorical variables are related • The test statistic for the Chi-squared test uses the sum of the squared differences between each pair of observed (O) and expected values (E) n Oi Ei i 1 Ei 2 2 TABLE A Chi-Square test for 2-way tables • Suppose we are studying two categorical variables in a population, where the first variable has r levels (i.e. possible outcomes) and the second one has s levels. • We can summarize a sample from this population using a table with r rows and c columns. • A two-way table, also called contingency table, displays the counts of how many individuals fall into each possible combination of categories of two categorical variables. So, each cell of the table (total number of cells is r xc) represents a combination of categories of the two variables. • The following table presents the data on race and smoking. The two variables of interest, race and smoking, have r = 4 and c = 2, resulting in 4x2=8 combinations of categories. Race NSmoke Smoke Caucasian 620 75 Black 240 41 Hispanic 130 29 Other 190 38 Chi-Square test for 2-way tables • By considering the number if observation falling into each category, we will see how to test the hypotheses of the form: H0: The two variables are not associated. Ha: The two variables are associated. • Two different experimental situations will lead to contingency tables 1. If we have two populations under study, both of which have a particular trait with respect to a categorical variable. In this case the null hypothesis is a statement of homogeneity among the two populations. 2. If we have one population under study, and we are interested to check the relationship between two categorical variables. In this case the null hypothesis is a statement of independence between the two variables. • For sufficiently large samples, the same test is appropriate for both of these situations. This test is called chi-square test, and in the following we will go over the steps in for testing the relationship between two variables. Some Notation! • For i taking values from 1 to r (number of rows) and j taking values from 1 to c (number of columns), denote: Ri = total count of observations in the i-th row. Cj = total count of observations in the j-th column. Oij = observed count for the cell in the i-th row and the j-th column. Eij = expected count for the cell in the i-th row and the j-th column if the two variables were independent, i.e if H0 was true. These counts are calculated as Ri C j Row total Column tot al Expected , thus Eij Total sample size n n Example Race NSmoke Smoke Total Caucasian O11 = 620 O12 = 75 R1 = 695 Black O21 = 240 O22 = 41 R2 = 281 Hispanic O31 = 130 O32 = 29 R3 = 159 Other O41 = 190 O42 = 38 R4 = 228 Total C1 = 1180 C2 = 183 n=1363 E11=(695x1180)/1363 E21=(281x1180)/1363 E31=(159x1180)/1363 E41=(228x1180)/1363 E12=(695x183)/1363 E22=(281x183)/1363 E32=(159x183)/1363 E42=(228x183)/1363 Chi-Square Analysis Details The 5 Steps in a Chi-Square Test: • Step 1: Write the null and alternative hypothesis. H0: There is no relationship between the variables. Ha: There is a relationship between the variables. • • Step 2: Compute expected values Step 3: Calculate Test Statistic and p-value. The test statistic measure the difference between the observed counts and the expected counts assuming independence. 2 2 ( O E ) (Observed Expected) ij ij 2 Expected Eij all cells i, j This is called chi-square statistic because if the null hypothesis is true, then it has a chi-square distribution with (r-1)x(c-1) degrees of freedom. Chi-Square Analysis Details • Step 3 Cont. Find the p-value. If the χ2- statistic is large, it implies that the observed counts are not close to the counts we would expect to see if the two variables were independent. Thus, ''large'' χ2 gives evidence against the null hypothesis, and supports the alternative. The p-value of the chi-square test is the probability that the χ2- statistic, is as large or larger than the value we obtained if H0 is true. Thus, the p-value for Chi-Square test is ALWAYS the area to the right of the test statistic under the curve, i.e. p-value = P(X> χ2), where X has a chi-square distribution with (r-1)x(c-1) df curve. To get this probability we need to use a chi-square distribution with (r-1)x(c-1) df (Table A). Chi-Square Analysis Details • • Step 4: Decide whether or not the result is statistically significant. The results are statistically significant if the p-value is less than alpha, where alpha is the significance level (usually α = 0.05). Step 5: Report the conclusion in the context of the situation. The p-value is ______ which is < a, this result is statistically significant. Reject the H0 Conclude that (the two variables) are related. The p-value is ______ which is > a, this result is NOT statistically significant. We cannot reject the H0 Cannot conclude that (the two variables) are related. Detailed Example • Derek wants to know if the geographical area that a student grew up in is associated with whether or not that the student drinks alcohol. Below are the results he obtained from a random sample of PSU students No Yes Total Big City 21 65 86 Rural 11 130 141 Small Town 18 198 216 Suburban 37 345 382 Total 87 738 825 Detailed Example 1. Ho: There is no relationship between the geographical area that a student grew up and whether or not that the student drinks alcohol. Ha: There is relationship between the geographical area that a student grew up and whether or not that the student drinks alcohol. 2. To check the conditions we need to calculate the expected counts for each cell. E11 = (R1xC1)/n = (86x87)/825 = 9.07, E12 = (R1xC2)/n = (86x738)/825 = 76.93, … E32 = (R3xC2)/n = ___________________, … Detailed Example 3. Chi- Square statistic and P-value: χ2 = sum {(Observed – Expected)2/Expected} = (21-9.07)2/9.07+ (65-76.93)2/76.93 + (11-14.87)2/14.87+ (130-126.13)2/126.13 + (18-22.78)2/22.78+ (198-193.22)2/193.22 + (37-40.28)2/40.28+ (345-341.72)2/341.72 = 20.091 df = (4-1)x(2-1) =3 p-value= P(X> 20.091) < P(Xc> 16.17) = 0.001 (see Table A) 4. Since the p-value< 0.05, the test is significant, and we can reject the null. 5. We can conclude that there is a relationship between the geographical area that a student grew up and whether or not that the student drinks alcohol. Example: Titanic • The ship Titanic sank in 1912 with the loss of most of its passengers • 809 of the 1,309 passengers and crew died = 61.8% • Research question: Did class (of travel) affect survival? Chi squared Test? • Null: There is NO association between class and survival • Alternative: There IS an association between class and survival What would be expected if the null is true? • Same proportion of people would have died in each class! • Overall, 809 people died out of 1309 = 61.8% What would be expected if the null is true? • Same proportion of people would have died in each class! • Overall, 809 people died out of 1309 = 61.8% Chi-Squared Test Actually Compares Observed and Expected Frequencies Expected number dying in each class = 0.618 * no. in class Using SPSS Analyse Descriptive Statistics Crosstabs Click on ‘Statistics’ button & select Chi-squared Test Statistic = 127.859 p- value p < 0.001 Note: Double clicking on the output will display the p-value to more decimal places www.statstutor.ac.uk Hypothesis Testing: Decision Rule • We can use statistical software to undertake a hypothesis test e.g. SPSS • One part of the output is the p-value (P) • If P < 0.05 reject H0 => Evidence of HA being true (i.e. IS association) • If P > 0.05 do not reject H0 (i.e. NO association) Comparing means T-tests Paired or Independent (Unpaired) Data? T-tests are used to compare two population means ₋ Paired data: same individuals studied at two different times or under two conditions PAIRED T-TEST ₋ Independent: data collected from two separate groups INDEPENDENT SAMPLES T-TEST Comparison of hours worked in 1988 to today Paired or unpaired? If the same people have reported their hours for 1988 and 2014 have PAIRED measurements of the same variable (hours) Paired Null hypothesis: The mean of the paired differences = 0 If different people are used in 1988 and 2014 have independent measurements Independent Null hypothesis: The mean hours worked in 1988 is equal to the mean for 2014 H 0 : 1988 2014 SPSS data entry Paired Data Independent Groups What is the t-distribution? The t-distribution is similar to the standard normal distribution but has an additional parameter called degrees of freedom (df or v) For a paired t-test, v = number of pairs – 1 For an independent t-test, v ngroup1 ngroup2 2 Used for small samples and when the population standard deviation is not known Small sample sizes have heavier tails Relationship to normal • As the sample size gets big, the t-distribution matches the normal distribution Normal curve Oneway ANOVA • Analysis of variance is used to test for differences among more than two populations. It can be viewed as an extension of the t-test we used for testing two population means. • The specific analysis of variance test that we will study is often referred to as the oneway ANOVA. ANOVA is an acronym for ANalysis Of VAriance. The adjective oneway means that there is a single variable that defines group membership (called a factor). Comparisons of means using more than one variable is possible with other kinds of ANOVA analysis. Why Not Use Multiple T-tests? • It might seem logical to use multiple t-tests if we wanted to compare a variable for more than two groups. For example, if we had three groups, we might do three t-tests: group 1 versus group 2, group 1 versus group 3, and group 2 versus group3. • However, doing three hypothesis tests to compare groups changes the probability that we are making an error (the alpha error rate). When conducting multiple tests of significance, the chance of making at least one alpha error over the series of tests is greater than the selected alpha level for each individual test. Thus, if we do multiple t-tests on the same variables with an alpha level of 0.05, the chances that we are making a mistake in applying our findings to the population is actually greater than 0.05. Step 1. Assumptions for the Test • Level of measurement of the group variable can be any level of variable that identifies groups. • Level of measurement of the test variable is interval. • The test variable is normally distributed in the population: – skewness and kurtosis between –1.0 and +1.0, or – number is each group is greater than 10 (central limit theorem) • The variances (dispersion) of the groups are equal. Step 2. Hypotheses and alpha • The research hypothesis is that the mean of at least one of the population groups is different from the means of the other groups. • The null hypothesis is that the means of all of the population groups are equal. • If we don’t have a specific reason for setting the level of significance to a specific probability, we can use the traditional benchmark of 0.05. This means that we are willing to risk making a mistake in our decision to reject the null hypothesis if it only happens once in every 20 decisions, or our decision would be correct 19 out of 20 times. Step 3. Sampling distribution and test statistic • In the ANOVA test, the probability is obtained from the “F” distribution instead of the normal curve distribution. • The test statistic is also referred to as the F-ratio or F-test because it follows the f-distribution. Step 4. Computing the Test Statistic • Conceptually the test statistic is computed in a way similar to the independent samples t-test. Both are computed by dividing the differences in means by the measure of variability among the groups. • We identify the probability of the test statistic from the SPSS statistical output. Step 5. Decision and Interpretation • If the probability of the test statistic is less than or equal to the probability of the level of significance (alpha error rate), we reject the null hypothesis and conclude that our data supports the research hypothesis. • If the probability of the test statistic is greater than the probability of the level of significance (alpha error rate), we fail to reject the null hypothesis and conclude that our data does not support the research hypothesis. Interpreting Differences in Population Means • If we fail to reject the null hypothesis, we can state that we found no differences among the means for the population groups for this characteristic. We do not say they are equal. • If we reject the null hypothesis, we can conclude that the mean for at least one population group is different from the others. The ANOVA test itself does NOT tell us which group means are different. To determine this, we use a Post Hoc test, such as the Tukey HSD (honestly significant differences), LSD (least significant difference) Post Hoc Test. Post Hoc Test for Difference in Means • Just as we used a post hoc test to identify which cells in a frequency table were responsible for the statistically significant result, we use a post hoc test to identify the differences in pairs of means that produce a statistically significant result in an ANOVA table. • We only look at the post hoc test when the probability of the ANOVA statistic causes us to reject the null hypothesis, i.e. the probability of the test statistic is less than the level of significance. • The Post Hoc Test may NOT reveal differences among group means even when we reject the null hypothesis in the ANOVA test. Inflation of Type I Error (Alpha) • Type I Error: Probability of falsely rejecting null hypothesis when it is true. • The only time you need to worry about inflation of Type I error rate is when you look for a lot of effects in your data. • The more effects you look for, the more likely it is that you will turn up an effect that doesn't really exist (Type I error!). • Doing all possible pair-wise comparisons (t-test) on a oneway ANOVA would increase the overall Type I error rate. ANOVA post hoc Test in SPSS (1) Next step is to examine the distribution of the dependent variable. You can check whether the dependent variable is normally distributed or not in: Analyze > Descriptive Statistics > Descriptives… ANOVA post hoc Test in SPSS (2) After moving [age] into “Variable(s):” box, click “Options…” button to select the distribution statistics. ANOVA post hoc Test in SPSS (3) Select “Kurtosis” and “Skewness” to examine whether [age] is normally distributed or not. Then, click “Continue” and “OK” buttons. ANOVA post hoc Test in SPSS (4) [Age] satisfied the criteria for a normal distribution. The skewness of the distribution (.590) was between -1.0 and +1.0 and the kurtosis of the distribution (-.150) was between -1.0 and +1.0. ANOVA post hoc Test in SPSS (5) You can conduct ANOVA by clicking: Analyze > Compare Means > One-Way ANOVA… ANOVA post hoc Test in SPSS (6) Now, click “Post Hoc…” button to select post hoc test option. ANOVA post hoc Test in SPSS (7) Select “Tukey” in “Equal Variances Assumed” panel. Enter alpha in the “Significance level:” textbox. It is same as the alpha level (.01) in the problem. Then, click “Continue” and “OK” buttons. Data Sheet Control Mean S.Dev T1 32 32 1.50 34 Mean S.Dev T2 Mean S.Dev T3 Mean S.Dev T4 Mean S.Dev 34 0.42 33 33 0.38 33 32 0.25 35 34 34 33 32 35 31 33 33 32 34 Arrangement for ANOVA Treatment Observation Control-1 32 Control-2 34 Control-3 31 T2R1 34 T2R2 34 T2R3 33 T3R1 33 T3R2 33 T3R3 33 T4R1 33 T4R2 32 35 0.38 Correlation Correlation quantifies the extent to which two quantitative variables, X and Y, “go together.” W hen high values of X are associated with high values of Y, a positive correlation exists. W hen high values of X are associated with low values of Y, a negative correlation exists. Now, we have some data, but How to start? The first step is create a scatter plot of the data. Let us deal with an example!! We use the following data set to illustrate correlational methods. In this crosssectional data set, each observation represents a district of Assam. The X variable is socioeconomic status measured as the percentage of children in a neighborhood receiving mind day meals at school. The Y variable is the percentage of school children owning bicycle. Twelve districts are considered: X District Dhubri Kokrajahr Dhemaji Dibrugarh Morigaon Kamrup Goalpara Sonitpur Sivsagar Darrang Nagaon Barpeta (% receiving midday meal) 50 11 2 19 26 73 81 51 11 2 19 25 Y (% owning bicycle) 22.1 35.9 57.9 22.2 42.4 5.8 3.6 21.4 55.2 33.3 32.4 38.4 Y (% of bicycle) A scatter plot of the illustrative data is shown to the right. The plot reveals that high values of X are associated with low values of Y. That is to say, as the number of children receiving X (% of mid day meal) Correlation Coefficient Correlation coefficients (denoted r) are statistics that quantify the relation between X and Y in unit-free terms. W hen all points of a scatter plot fall directly on a line with an upward incline, r = +1 When all points fall directly on a downward incline, r = -1. Such perfect correlation is seldom encountered. W e still need to measure correlational strength, –defined as the degree to which data point adhere to an imaginary trend line passing through the “scatter cloud.” Strong correlations are associated with scatter clouds that adhere closely to the imaginary trend line. Weak correlations are associated with scatter clouds that adhere marginally to the trend line. The closer r is to +1, the stronger the positive correlation. The closer r is to -1, the stronger the negative correlation. Examples of strong and weak correlations are shown below. Note: Correlational strength can not be quantified visually. It is too subjective and is easily influenced by axis-scaling. The eye is not a good judge of correlational strength. Oneway ANOVA • Analysis of variance is used to test for differences among more than two populations. It can be viewed as an extension of the t-test we used for testing two population means. • The specific analysis of variance test that we will study is often referred to as the oneway ANOVA. ANOVA is an acronym for ANalysis Of VAriance. The adjective oneway means that there is a single variable that defines group membership (called a factor). Comparisons of means using more than one variable is possible with other kinds of ANOVA analysis. Why Not Use Multiple T-tests • It might seem logical to use multiple t-tests if we wanted to compare a variable for more than two groups. For example, if we had three groups, we might do three t-tests: group 1 versus group 2, group 1 versus group 3, and group 2 versus group3. • However, doing three hypothesis tests to compare groups changes the probability that we are making an error (the alpha error rate). When conducting multiple tests of significance, the chance of making at least one alpha error over the series of tests is greater than the selected alpha level for each individual test. Thus, if we do multiple t-tests on the same variables with an alpha level of 0.05, the chances that we are making a mistake in applying our findings to the population is actually greater than 0.05.