Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Non-parametric tests (examples) Some repetition of key concepts (time permitting) Free experiment status Exercise Group tasks on non-parametric tests (worked examples of will be provided!) Free experiment supervision/help Did you get the compendium? Remember: For week 12, regression and correlation, 100+ pages in compendium: No need to read all of it – read the introductions to each chapter, get the feel for the first simple examples – multiple regression and –correlation is for future reference Two types of statistical test: Parametric tests: Based on assumption that the data have certain characteristics or "parameters": Results are only valid if: (a) the data are normally distributed; (b) the data show homogeneity of variance; (c) the data are measurements on an interval or ratio scale. 25 20 15 Group 1: M = 8.19 (SD = 1.33), 10 5 Group 2: M = 11.46 (SD = 9.18) 0 1 2 Nonparametric tests Make no assumptions about the data's characteristics. Use if any of the three properties below are true: (a) the data are not normally distributed (e.g. skewed); (b) the data show in-homogeneity of variance; (c) the data are measurements on an ordinal scale (ranks). Non-parametric tests are used when we do not have ratio/interval data, or when the assumptions of parametric tests are broken Just like parametric tests, which nonparametric test to use depends on the experimental design (repeated measures or within groups), and the number of/level of Ivs Non-parametric tests are minimally affected by outliers, because scores are converted to ranks Examples of parametric tests and their non-parametric equivalents: Parametric test: Non-parametric counterpart: Pearson correlation Spearman's correlation (No equivalent test) Chi-Square test Independent-means t-test Mann-Whitney test Dependent-means t-test Wilcoxon test One-way Independent Measures Analysis of Variance (ANOVA) Kruskal-Wallis test One-way Repeated-Measures ANOVA Friedman's test Non-parametric tests make few assumptions about the distribution of the data being analyzed They get around this by not using raw scores, but by ranking them: The lowest score get rank 1, the next lowest rank 2, etc. Different from test to test how ranking is carried out, but same principle The analysis is carried out on the ranks, not the raw data Ranking data means we lose information – we do not know the distance between the ranks This means that non-par tests are less powerful than par tests, and that non-par tests are less likely to discover an effect in our data than par tests (increased chance of type II error) This is the non-parametric equivalent of the independent t-test Used when you have two conditions, each performed by a separate group of subjects. Each subject produces one score. Tests whether there a statistically significant difference between the two groups. Example: Difference between men and dogs We count the number of ”doglike” behaviors in a group of 20 men and 20 dogs over 24 hours The result is a table with 2 groups and their number of doglike behaviors We run a Kolmogorv-Smirnov test (Vodka test) to see if data are normally distributed. The test is significant though (p<.0.009), so we need a non-parametric test to analyze the data The MN test looks for differences in the ranked positions of scores in the two groups (samples) Example ... Mann-Whitney test, step-by-step: Does it make any difference to students' comprehension of statistics whether the lectures are in English or in Klingon? Group 1: Statistics lectures in English. Group 2: Statistics lectures in Serbo-Croat DV: Lecturer intelligibility ratings by students (0 = "unintelligible", 100 = "highly intelligible"). Ratings - So Mann-Whitney is appropriate. English group (raw scores) English group (ranks) Serbo-croat group (raw scores) Serbo-croat group (ranks) 18 17 17 15 15 10.5 13 8 17 15 12 5.5 13 8 16 12.5 11 3.5 10 1.5 16 12.5 15 10.5 10 1.5 11 3.5 17 15 13 8 12 5.5 Mean: S.D.: 14.63 2.97 Mean: S.D.: 13.22 2.33 Median: 15.5 Median: 13 Step 1: Rank all the scores together, regardless of group. How to Rank scores: (a) Lowest score gets rank of “1”; next lowest gets “2”; and so on. (b) If two or more scores with the same value are “tied”. (i) Give each tied score the rank it would have had, had it been different from the other scores. (ii) Add the ranks for the tied scores, and divide by the number of tied scores. Each of the ties gets this average rank. (iii) The next score after the set of ties gets the rank it would have obtained, had there been no tied scores. Example: raw score: 6 “original” rank: 1 “actual” rank: 1 34 2 2.5 34 3 2.5 48 4 4 Formula for Mann-Whitney Test statistic: U Nx (Nx + 1) U = N1 * N2 + ---------------- - Tx 2 T1 and T2 = Sum of ranks for groups 1 and 2 N1 and N2 = Number of subjects in groups 1 and 2 Tx = largest of the two rank totals Nx = Number of subjects in Tx-group Step 2: Add up the ranks for group 1, to get T1. Here, T1 = 83. Add up the ranks for group 2, to get T2. Here, T2 = 70. Step 3: N1 is the number of subjects in group 1; N2 is the number of subjects in group 2. Here, N1 = 8 and N2 = 9. Step 4: Call the larger of these two rank totals Tx. Here, Tx = 83. Nx is the number of subjects in this group; here, Nx = 8. Step 5: Find U: Nx (Nx + 1) U = N1 * N2 + ---------------- - Tx 2 In our example: U = 8*9 8 * (8 + 1) + ---------------- - 83 2 U = 72 + 36 - 83 = 25 If there are unequal numbers of subjects - as in the present case - calculate U for both rank totals and then use the smaller U. In the present example, for T1, U = 25, and for T2, U = 47. Therefore, use 25 as U. Step 6: Look up the critical value of U, (in a table), taking into account N1 and N2. If our obtained U is smaller than the critical value of U, we reject the null hypothesis and conclude that our two groups do differ significantly. N2 N1 5 6 7 8 9 10 5 2 3 5 6 7 8 6 3 5 6 8 10 11 7 5 6 8 10 12 14 8 6 8 10 13 15 17 9 7 10 12 15 17 20 10 8 11 14 17 20 23 Here, the critical value of U for N1 = 8 and N2 = 9 is 15. Our obtained U of 25 is larger than this, and so we conclude that there is no significant difference between our two groups. Conclusion: Ratings of lecturer intelligibility are unaffected by whether the lectures are given in English or in SerboCroat. Mann-Whitney using SPSS - procedure: Mann-Whitney using SPSS - procedure: Mann-Whitney using SPSS - output: SPSS gives us two boxes as the output: Ranks Intelligibility Language Englis h Serbo-croat Total N 8 9 17 Mean Rank 10.38 7.78 Sum of Ranks 83.00 70.00 Sum of ranks b Test Statistics The U statistic Significance value of the test Can halve this if One-way hypothesis Intelligibility Mann-Whitney U 25.000 Wilcoxon W 70.000 Z -1.067 Asymp. Sig. (2-tailed) .286 a Exact Sig. [2*(1-tailed .321 Sig.)] a. Not corrected for ties . b. Grouping Variable: Language The Wilcoxon test: Used when you have two conditions, both performed by the same subjects. Each subject produces two scores, one for each condition. Tests whether there a statistically significant difference between the two conditions. Wilcoxon test, step-by-step: Does background music affect the mood of factory workers? Eight workers: Each tested twice. Condition A: Background music. Condition B: Silence. DV: Worker's mood rating (0 = "extremely miserable", 100 = "euphoric"). Ratings data, so use Wilcoxon test. Worker: Silence Music Difference Rank 1 15 10 5 4.5 2 12 14 -2 2.5 3 11 11 0 Ignore 4 16 11 5 4.5 5 14 4 10 6 6 13 1 12 7 7 11 12 -1 1 8 8 10 -2 2.5 Mean: 12.5, SD: 2.56 Mean: 9.13, SD: 4.36 Median: 12.5 Median: 10.5 Step 1: Find the difference between each pair of scores, keeping track of the sign (+ or -) of the difference - different from a Mann Whitney U test, where the data themselves are ranked! Step 2: Rank the differences, ignoring their sign. Lowest = 1. Tied scores dealt with as before. Ignore zero difference-scores. Step 3: Add together the positive-signed ranks. = 22. Add together the negative-signed ranks. = 6. Step 4: "W" is the smaller sum of ranks; W = 6. N is the number of differences, omitting zero differences: N = 8 - 1 = 7. Step 5: Use table of critical W-values to find the critical value of W, for your N. Your obtained W has to be smaller than this critical value, for it to be statistically significant. N 6 7 8 9 10 One Tailed Significance levels: 0.025 0.01 Two Tailed significance levels: 0.05 0.02 0 2 0 4 2 6 3 8 5 0.005 0.01 0 2 3 The critical value of W (for an N of 7) is 2. Our obtained W of 6 is bigger than this. Our two conditions are not significantly different. Conclusion: Workers' mood appears to be unaffected by presence or absence of background music. Wilcoxon using SPSS - procedure: Wilcoxon using SPSS - procedure: Wilcoxon using SPSS - output: Ranks N s ilence - music Negative Ranks Pos itive Ranks Ties Total 4a 3b 1c 8 Mean Rank 5.50 2.00 Sum of Ranks 22.00 6.00 a. s ilence < mus ic What negative ranks refer to: Silence less score than w. music b. s ilence > mus ic What positive ranks refer to: Silence higher score than w. music c. s ilence = mus ic As for MN-test, z-score becomes more accurate with higher sample size Ties = no changes in score w./wo. music Test Statisticsb Z Asymp. Sig. (2-tailed) s ilence music -1.357 a .175 a. Bas ed on positive ranks. b. Wilcoxon Signed Ranks Tes t Number of SD´s from mean Significance value Non-parametric tests for comparing three or more groups or conditions: Kruskal-Wallis test: Similar to the Mann-Whitney test, except that it enables you to compare three or more groups rather than just two. Different subjects are used for each group. Friedman's Test (Friedman´s ANOVA): Similar to the Wilcoxon test, except that you can use it with three or more conditions (for one group). Each subject does all of the experimental conditions. One IV, with multiple levels Levels can differ: (a) qualitatively/categorically e.g. effects of managerial style (laissex-faire, authoritarian, egalitarian) on worker satisfaction. effects of mood (happy, sad, neutral) on memory. effects of location (Scotland, England or Wales) on happiness ratings. (b) quantitatively e.g. effects of age (20 vs 40 vs 60 year olds) on optimism ratings. effects of study time (1, 5 or 10 minutes) before being tested on recall of faces. effects of class size on 10 year-olds' literacy. effects of temperature (60, 100 and 120 deg.) on mood. Why have experiments with more than two levels of the IV? (1) Increases generality of the conclusions: E.g. comparing young (20) and old (70) subjects tells you nothing about the behaviour of intermediate age-groups. (2) Economy: Getting subjects is expensive - may as well get as much data as possible from them – i.e. use more levels of the IV (or more IVs) (3) Can look for trends: What are the effects on performance of increasingly large doses of cannabis (e.g. 100mg, 200mg, 300mg)? Kruskal-Wallis test, step-by-step: Does it make any difference to students’ comprehension of statistics whether the lectures are given in English, SerboCroat - or Cantonese? (similar case to MN-test, just one more language, i.e. group of people) Group A – 4 ppl: Lectures in English; Group B – 4 ppl: Lectures in Serbo-Croat; Group C – 4 ppl: Lectures in Cantonese. DV: student rating of lecturer's intelligibility on 100-point scale ("0" = "incomprehensible"). Ratings - so use a non-parametric test. 3 groups – so KW-test English (raw score) English (rank) Serbo-Croat (raw score) Serbo-Croat (rank) Cantonese (raw score) Cantonese (rank) 20 3.5 25 7.5 19 1.5 27 9 33 10 20 3.5 19 1.5 35 11 25 7.5 23 6 36 12 22 5 Step 1: Rank the scores, ignoring which group they belong to. Lowest score gets lowest rank. Tied scores get the average of the ranks they would otherwise have obtained (note the difference from the Wilcoxon test!) Formula: 2 12 Tc H 3 N 1 nc N N 1 N is the total number of subjects; Tc is the rank total for each group; nc is the number of subjects in each group; H is the test statistic Step 2: Find "Tc", the total of the ranks for each group. Tc1 (the total for the English group) is 20. Tc2 (for the Serbo-Croat group) is 40.5. Tc3 (for the Cantonese group) is 17.5. Step 3: Find H. 2 12 Tc H 3 N 1 N N 1 n c N is the total number of subjects; Tc is the rank total for each group; nc is the number of subjects in each group. 12 Tc 2 H 3 N 1 nc N N 1 2 Tc nc 2 20 40.5 4 4 2 17.5 4 2 586.62 12 ( H 586.62 3 13 ) 6.12 12 * 13 Step 4: In KW-test, we use degrees of freedom: Degrees of freedom are the number of groups minus one. d.f. = 3 - 1 = 2. Step 5: H is statistically significant if it is larger than the critical value of ChiSquare for this many d.f. [Chi-Square is a test statistic distribution we use] Here, H is 6.12. This is larger than 5.99, the critical value of Chi-Square for 2 d.f. (SPSS gives us this, no need to look in a table, but we could do it) So: The three groups differ significantly: The language in which statistics is taught does make a difference to the lecturer's intelligibility. NB: the test merely tells you that the three groups differ; inspect group medians to decide how they differ. Using SPSS for the Kruskal-Wallis test: "1" for "English", "2" for "Serbo-Croat", "3" for "Cantonese". Independent measures-test type: One column gives scores, another column identifies which group each score belongs to. Scores column Group column Using SPSS for the Kruskal-Wallis test: Analyze > Nonparametric tests > k independent samples Using SPSS for the Kruskal-Wallis test : Choose variable Identify groups Ranks intelligibility language Englis h Serbo-croat Cantonese Total N 4 4 4 12 Mean Rank 5.00 10.13 4.38 Mean rank values Test Statisticsa, b Chi-Square df Asymp. Sig. intelligibility 6.190 2 .045 a. Kruskal Wallis Tes t b. Grouping Variable: language Test statistic (H) DF Significance How do we find out how the four groups differed? One way is to construct a box-whisker plot – and look at median values What we really need is some contrasts and post-hoc tests like for ANOVA One solution is to run series of Mann-Whitney tests, controlling for the build-up of Type I error Need several MW-tests, each with a 5% chance of a Type I error – when serialling them this chance builds up (language 1 vs. language 2, language 1 vs. 3 etc. ...) We therefore do a Bonferroni correction – use p<0.05 divided with number of MW-tests conducted We can get away with only comparing with the control condition – so MN-test for each of the three languages compared to the control group We then see if any differences are significant Friedman's Test (Friedman´s ANOVA): Similar to the Wilcoxon test, except that you can use it with three or more conditions (for one group). Each subject does all of the experimental conditions. Friedman’s test, step-by-step: Effects on worker mood of different types of music: Five workers. Each is tested three times, once under each of the following conditions: Condition 1: Silence. Condition 2: “Easy-listening” music. Condition 3: Marching-band music. DV: mood rating ("0" = unhappy, "100" = euphoric). Ratings - so use a non-parametric test. NB: To avoid practice and fatigue effects, order of presentation of conditions is varied/randomized across subjects. Silence (raw score) Silence (ranked score) Easy (raw score) Easy (ranked score) Band (raw score) Band (ranked score) Wkr 1: 4 Wkr 2: 2 1 1 5 7 2 2.5 6 7 3 2.5 Wkr 3: 1.5 1 1 6 7 8 1.5 3 2 8 5 9 3 2 3 6 Wrkr 4: 3 Wrkr 5: 3 Step 1: Rank each subject's scores individually. Worker 1's scores are 4, 5, 6: these get ranks of 1, 2, 3. Worker 4's scores are 3, 7, 5: these get ranks of 1, 3, 2 . Wkr 1: Wkr 2: Wkr 3: Silence (raw score) Silence (ranked score) Easy (raw score) Easy (ranked score) Band (raw score) Band (ranked score) 4 2 6 1 1 1.5 5 7 6 2 2.5 1.5 6 7 8 3 2.5 3 1 1 7 8 3 2 5 9 2 3 Wrkr 4: 3 Wrkr 5: 3 Step 2: Find the rank total for each condition, using the ranks from all subjects within that condition. Rank total for ”Silence" condition: 1+1+1.5+1+1 = 5.5. Rank total for “Easy Listening” condition = 11. Rank total for “Marching Band” condition = 13.5. Step 3: Work out “r2“ (the test statistic name for Friedman´s ANOVA) 12 2 r Tc 3 N C 1 N C C 1 2 C is the number of conditions (here 3 types of music). N is the number of subjects (here 5 workers). Tc2 is the sum of the squared rank totals for each condition (5.5, 11 and 13.5 respectively for the three types of music). r 2 12 2 Tc 3 N C 1 N C C 1 To get Tc2 : (1) Square each rank total: 5.52 = 30.25. 112 = 121. 13.52 = 182.25. (2) Add together these squared totals. 30.25 + 121 + 182.25 = 333.5. In our example, 12 2 r Tc 3 N C 1 N C C 1 2 12 r 333.5 3 5 4 6.7 5 3 4 2 r2 = 6.7 Step 4: Degrees of freedom = number of conditions minus one. DF = 3 - 1 = 2. Step 5: Assessing the statistical significance of r2 depends on the number of subjects and the number of groups. (a) Less than 9 subjects: Use a special table of critical values for r2. (b) 9 or more subjects: Use a Chi-Square table for critical values. Compare your obtained r2 value to the critical value of Chi-Square for your number of DF If your obtained r2 is bigger than the critical Chi-Square value, your conditions are significantly different. The test only tells you that some kind of difference exists; look at the median score for each condition to see where the difference comes from. We have 5 subjects and 3 conditions, so use Friedman table for small sample sizes: Obtained r2 is 6.7. For N = 5, a r2 value of 6.4 would occur by chance with a probability of 0.039. Our obtained value is bigger than 6.4, so p<0.039. Conclusion: The conditions are significantly different. Music does affect worker mood. Using SPSS to perform Friedman’ s ANOVA Repeated measures - each row is one participant's data. Just like for Wilcoxon and other repeated measures tests Using SPSS to perform Friedman’ s ANOVA Analyze > Nonparametric Tests > k related samples Using SPSS to perform Friedman’ s ANOVA Analyze > Nonparametric Tests > k related samples Note: here you select a Kolmogorov-Smirnov test for checking if your sample data are normally distributed Using SPSS to perform Friedman’ s ANOVA Drag over variables to be included in the test Output from Friedman’ s ANOVA Descriptive Statistics N s ilence eas y marching 5 5 5 Mean 3.6000 6.6000 7.0000 Ranks s ilence eas y marching Mean Rank 1.10 2.20 2.70 Std. Deviation 1.51658 1.14018 1.58114 Minimum 2.00 5.00 5.00 Test Statisticsa N Chi-Square df Asymp. Sig. 5 7.444 2 .024 a. Friedman Tes t Significance Maximum 6.00 8.00 9.00 Test statistic r2 NB: slightly different value from 6.7 worked out by hand Mann-Whitney: Two conditions, two groups, each participant one score Wilcoxon: Two conditions, one group, each participant two scores (one per condition) Kruskal-Wallis: 3+ conditions, different people in all conditions, each participant one score Friedman´s ANOVA: 3+ conditions, one group, each participant 3+ scores Which nonparametric test? 1. Differences in fear ratings for 3, 5 and 7-year olds in response to sinister noises from under their bed 1. Effects of cheese, brussel sprouts, wine and curry on vividness of a person's dreams 2. Number of people spearing their eardrums after enforced listening to Britney Spears, Beyonce, Robbie Williams and Boyzone 3. Pedestrians rate the aggressiveness of owners of different types of car. Group A rate Micra owners; group B rate 4x4 owners; group C rate Subaru owners; group D rate Mondeo owners. Consider: How many groups? How many levels of IV/conditions? 1. Differences in fear ratings for 3, 5 and 7-year olds in response to sinister noises from under their bed [3 groups, each one score, 2 conditions Kruskal-Wallis]. 2. Effects of cheese, brussel sprouts, wine and curry on vividness of a person's dreams [one group, each 4 scores, 4 conditions - Friedman´s ANOVA]. 3. Number of people spearing their eardrums after enforced listening to Britney Spears, Beyonce, Robbie Williams and Boyzone [one group, each 4 scores, 4 conditions – Friedman´s ANOVA] 4. Pedestrians rate the aggressiveness of owners of different types of car. Group A rate Micra owners; group B rate 4x4 owners; group C rate Subaru owners; group D rate Mondeo owners. [4 groups, each one score – Kruskal-Wallis] What is a ”population”??? Types of measure Normal distribution Standard Error Effect size What, again!?!? The term does not necessarily refer to a set of individuals or items (e.g. cars). Rather, it refers to a state of individuals or items. Example: After a major earthquake in a city (in which no one died) the actual set of individuals remains the same. But the anxiety level, for example, may change. The anxiety level of the individuals before and after the quake defines them as two populations. “Population” is an abstract term we use in statistics My brain is the size of a walnut! Scientists are interested in how variables change, and what causes the change Anything that we can measure and which changes, is called a variable ”Why do people like the color red?” Variable: Preference of the color red Variables can take many forms, i.e. numbers, abstract values, etc. Values are measureable Measuring size of variables is important for comparing results between studies/projects Different measures provide different quality of data: Nominal (categorical) data Non-parametric Ordinal data Interval data Parametric Ratio data Nominal data (categorical, frequency data) When numbers are used as names No relationship between the size of the number and what is being measured Two things with same number are equivalent Two things with different numbers are different E.g. Numbers on the shirts of soccer players Nominal data are only used for frequencies How many times ”3” occurs in a sample How often player 3 scores compared to player 1 Ordinal data Provides information about the ordering of the data Does not tell us about the relative differences between values For example: The order of people who complete a race – from the winner to the last to cross the finish line. Typical scale for questionnaire data Interval data When measurements are made on a scale with equal intervals between points on the scale, but the scale has no true zero point. Examples: Celsius temperature scale: 100 is water's boiling point; 0 is an arbitrary zero-point (when water freezes), not a true absence of temperature. Equal intervals represent equal amounts, but ratio statements are meaningless - e.g., 60 deg C is not twice as hot as 30 deg! -4 -3 -2 -1 0 1 1 2 3 4 5 6 2 7 3 4 8 9 Ratio data When measurements are made on a scale with equal intervals between points on the scale, and the scale has a true zero point. e.g. height, weight, time, distance. Measurements of relevance include: Reaction times, numbers correct answered, error scores in usability tests. His brain has a standard error ... If we take repeated samples, each sample has a mean height, a standard deviation (s), and a shape/distribution. s1 s2 X2 s3 X3 Samples . . . . . . Due to random fluctuations, each sample is different - from other samples and from the parent population. These differences are predictable - we can use samples to make inferences about their parent populations. X1 X 30 X 25 X 33 X 30 X 29 Often we have more than one sample of a population This permits the calculation different sample means, whose value will vary, giving us a sampling distribution Sampling distribution = 10 Mean = 10 SD = 1.22 4 3 M = 10 M=9 M = 11 M=9 2 1 M = 10 M=8 Frequency M = 12 0 6 M = 10 M = 11 7 8 9 10 11 Sample Mean 12 13 14 The sampling distribution informs about the behavior of samples from the population We can calculate SD for the sampling distribution This is called the Standard Error of the Mean (SE) SE shows how much variation there is within a set of sample means Therefore also how likely a specific sample mean is to be erroneous, as an estimate of the true population mean means of different samples actual population mean SE = SD of the sample means distribution We can estimate SE via one sample x n Estimate SE = SD of the sample divided with the square root of the sample size (n) If the SE is small, our obtained sample mean is more likely to be similar to the true population mean than if the SE is large x n Increasing n reduces the size of the SE A sample mean based on 100 scores is probably closer to the population mean than a sample mean based on 10 scores (!) Variation between samples decreases as sample size increases – because extreme scores become less important to the mean 2 2 X 0.20 100 10 Suppose the n = 16 instead of 100 2 2 X 0.50 16 4 Almost finished ... Frequency of errors Frequency of errors made 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 Number of errors made The Normal curve is a mathematical abstraction which conveniently describes ("models") many frequency distributions of scores in real-life. 8 9 10 length of time before someone looks away in a staring contest: length of pickled gherkins: Francis Galton (1876) 'On the height and weight of boys aged 14, in town and country public schools.' Journal of the Anthropological Institute, 5, 174-180: Francis Galton (1876) 'On the height and weight of boys aged 14, in town and country public schools.' Journal of the Anthropological Institute, 5, 174-180: Height of 14 year-old children 16 country 14 town 10 8 6 4 2 0 51 -5 2 53 -5 4 55 -5 6 57 -5 8 59 -6 0 61 -6 2 63 -6 4 65 -6 6 67 -6 8 69 -7 0 frequency (%) 12 height (inches) Size of score axis Frequency axis Properties of the Normal Distribution: 1. It is bell-shaped and asymptotic at the extremes. 2. It's symmetrical around the mean. 3. The mean, median and mode all have same value. 4. It can be specified completely, once mean and SD are known. 5. The area under the curve is directly proportional to the relative frequency of observations. e.g. here, 50% of scores fall below the mean, as does 50% of the area under the curve. e.g. here, 85% of scores fall below score X, corresponding to 85% of the area under the curve. Relationship between the normal curve and the standard deviation (SD): frequency All normal curves share this property: The SD cuts off a constant proportion of the distribution of scores: 68% 95% 99.7% -3 -2 -1 mean +1 +2 +3 Number of standard deviations either side of mean About 68% of scores will fall in the range of the mean plus and minus 1 s.d.; 95% in the range of the mean +/- 2 s.d.'s; 99.7% in the range of the mean +/- 3 s.d.'s. e.g.: I.Q. is normally distributed, with a mean of 100 and s.d. of 15. Therefore, 68% of people have I.Q's between 85 and 115 (100 +/- 15). 95% have I.Q.'s between 70 and 130 (100 +/- (2*15). 99.7% have I.Q's between 55 and 145 (100 +/- (3*15). 68% 85 (mean - 1 s.d.) 115 (mean + 1 s.d.) Just by knowing the mean, SD, and that scores are normally distributed, we can tell a lot about a population. If we encounter someone with a particular score, we can assess how they stand in relation to the rest of their group. e.g.: someone with an I.Q. of 145 is quite unusual: This is 3 SD's above the mean. I.Q.'s of 3 SD's or above occur in only 0.15% of the population [ (100-99.7) / 2 ]. Note: divide with 2 as there are 2 sides to the normal distribution! Conclusions: Many psychological/biological properties are normally distributed. This is very important for statistical inference (extrapolating from samples to populations) My scaly butt is of large size! Just because the test statistic is significant, does not mean that the effect measured is important - it may account for only a very small part of the variance in the dataset, even though it is bigger than the random variance So we calculate effect sizes – a measure of the magnitude of an observed effect A common effect size is Pearsons correlation coefficient – normally used to measure the strenght of the relationship between two variables We call this ”r” ”r” is the proportion of the total variance in the dataset that can be explained by the experiment It falls between 0 (experiment explains no variance at all, effect size = zero) and 1 (experiment explains all variance, perfect effect size) Three normal levels of r: r = 0.1 – small effect, 1% of total variance explained r = 0.3 – medium effect, 9% of total variance explained r = 0.5 – large effect, 25% of variance explained Note: Not linear scale - r-values of 0.2 is not twice of 0.1 r is standardized – we can compare across studies Effect sizes are objective measures of the importance of a measured effect The bigger the effect size of something, the easier it is to find experimentally, i.e.: If IV manipulation has a major effect on the DV, effect size is large r can be calculated from a lot of test statistics, notably z-scores r = z-score / square root of sample size