Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Research Methods Winter 2008 Chapter 9 – Measures of Central Tendency and Dispersion Instructor: Dr. Harry Webster 1 Chapter 9 Describing Distributions with Numbers A) Finding the median involve ordering data and positions. B) means, mode, and standard deviation do not involve position. Measures of central tendency are: median, mode, and means. Measures of spread are: Variance and standard deviation. 2 A) Median, quartiles, five number summary and boxplots. Median: the midpoint of an arranged (ordered from smallest to largest) distribution of data. The 50th percentile. (Percentile: ranking out of 100) Calculating the median: 1. Arrange scores from smallest to largest. 2. Use formula: (n + 1)/2 to find the location of the median. 3a. If you have an odd number of scores, the formula will lead you to the median score. 3 Ex1., 2 3 4 5 6 7 8 9 10 Formula: (n + 1)/2 (9 + 1 )/2 = 5 (location of median) Count 5 scores and we get 6. 6 is the median score. Ex2., 23455567888 Formula (n + 1)/2 (11 + 1)/2 = 6 (location of median). Count 6 scores and we get 5. As there are many 5’s we must indicate (underline) which 5 is the median score. 4 B) Mean and Standard Deviation Mean: An average of scores. Pronounced ‘x-bar’; symbol = X Sum of scores divided by number of cases. Ex. 1 2 3 4 = 10/4 = 2.5 Sensitive to outliers. Ex., 1 2 3 40 = 46/4 = 11.5 5 Standard Deviation Most frequently used expression of spread/variability Is a measure of the average spread of scores from the mean. Small standard deviations involve a set of scores that are close to the mean. Large standard deviations involve a set of scores that are further away from the mean. Is influenced by outliers (the mean is used to calculate the standard deviation). 6 The Standard Deviation Formula St.Dev. 2 (x x ) n 1 sum of everything in parenthese s x each score x mean n number of scores 7 To calculate the standard deviation (S.D. or St. Dev.) Step 1. Find the mean. Step 2. Find the distance of each score from the mean. Step 3. Square each result to get rid of negatives. Step 4. Add up the squared deviations (from the mean). Step 5. Divide by n-1. This gives the variance. Step 6. Find the square root. This gives the St. Dev. 8 Example: Data set: 1 2 4 6 7 1. Find the mean 20/5 = 4 Deviation 3. Squared Deviation 2. 1 – 4 = -3 3x3 = 9 2 – 4 = -2 2x2 = 4 4–4= 0 0 6–4= 2 2x2 = 4 7–4= 3 3x3 = 9 4. Total 26 9 5. Divide the sum of the squared deviations by n-1 26/5-1 = 6.5 This is the variance. 6. Square root the variance Square root of 6.5 = 2.55 This is the standard deviation. 10 Use medians when there are outliers. Ex. income. Use means and standard deviations when the distribution appears symmetrical. Ex. Test grades, performance on athletic variables that are measured in time. Use the Mode with Nominal, Ordinal, Interval, and Ratio levels of measurements. The mode is the only measure of central tendency that can be used with Nominal data such as gender of respondents, preferred type of music, marital status, etc. 11 6. We have used 2 sets of data (7 2 2 1 3 4 5 6 and 7 2 2 1 3 4 50 6) to determine five number summaries and standard deviations. Using the numbers, show the effects of outliers. 12 CHAPTER 9 NORMAL DISTRIBUTIONS When a graph depicts proportion of scores instead of frequency of scores it is called a density graph. The proportions add up to 1 (100%). When the density graph is smoothed into a line, it is called a density curve. 13 • The mean is further towards the tail of the distribution as it takes into account the size of those scores (ex., outliers). • The median depicts position in a distribution of data only; it is not affected by the more extreme scores. • Normal Curve Skewed to the right 14 • Normal Curves/Normal Distributions: • The most important curve in Social Science and Commerce statistics. • Many biological variables fall on a normal curve. Ex., height. • Many psychological variables are ‘forced’ into a normal curve. Ex., I.Q., some psychological inventories. • Many sociological/economic variables don’t fall into a normal curve. 15 • Ex. income, education. Features of Normal Curves (Normal Distributions): 1. Given the mean and standard deviation, we can draw the normal curve. 2. Mean is center of the distribution; cuts it in half. This is also the median or 50th percentile. 3. The curve is symmetrical; one side of the mean mirrors the other. 4. The standard deviation determines the shape of the curve. The smaller the standard deviation, the closer the scores are to one another, the ‘taller’ the curve. 16 17 The standard deviation breaks the normal curve into segments that reflect the percent of scores in the set of scores. The 50th percentile is at st. dev. zero. Standard deviations for the mean 18 • The 68-95-99.7 Rule • 68% of all scores fall between -1 and +1 standard deviation. • 95% of all scores fall between -2 and +2 standard deviations. • 99.7% of all scores fall between -3 and +3 standard deviations. • As the tails of the normal curve do not touch the horizontal axis, we cannot determine the number of standard deviations for 100% of the scores. • This is to leave room for extreme outliers. 19 Ex., Women’s height. Mean = 65 “ St. Dev. = 2.5 “ 20 Chapter 5 What is a Confidence Interval. 21 INTRODUCTION TO STATISTICAL INFERENCE Statistical inference is a technique to make decisions regarding the probability that the population would behave in the same way as the sample. As it is based on probability, then the rules of probability must be followed. Therefore, the assumptions which must be met are: 1) Randomness: the predictable pattern of outcomes after very many trials. 1a) If samples are chosen randomly, then the pattern of outcomes is a normal distribution. This is called a sampling distribution. 22 2) We assume the mean of the normal distribution reflects the mean of the population parameter. Statistical inference helps us determine how confident we are about where a result falls on the sampling distribution in two ways: 1. Confidence Intervals: How confident we are that our sample’s result captured the population parameter within a certain range (margin of error). 2. Tests of significance: We make a claim about the population and use the sample’s results to test that claim. Want to determine the probability of our claim 23 being right. CHAPTER 5 WHAT IS A CONFIDENCE INTERVAL? A confidence interval estimates a population parameter from a sample statistic at a certain level of confidence. Here confidence means the probability of being right. We also refer to it as a Confidence Statement. We take the sample’s statistic (data) and estimate what the population’s answer would be. Involves how sure we are (confidence level) and margin of error (the margin where we believe the population’s answer falls. 24 We take the sample’s statistic (data) and estimate what the population’s answer would be. Involves how sure we are (confidence level) and margin of error (the margin within which we believe the population’s answer falls. We can develop Confidence Statements for: A) Data given in percents/proportions. B) Data given in means. (The only difference is a change of formula) 25 p population parameter p̂ statistics (results) from samples Take any statistic and estimate the probabilit y (conf. level) of it capturing the population parameter within a certain margin of scores (margin of error). 26 A) When the statistic is given in percents or proportions. The formula to find a confidence interval for any level of confidence is: pˆ z* pˆ (1 pˆ ) / n z z score is a standard score p̂ statistics (results) from samples n sample size 27 p̂ = sample statistic (proportion or percent) z* = z scores (standard scores) n = number of subjects in the sample 28 Example: Mayor Tremblay is two weeks from election day. He wants to know his chances of winning the election. A polling company asks 1000 people who they would vote for if the election were held today and 57% say Mayor Tremblay. Tremblay wants to be 90% confident that he will win. .57 1.64 .57(1 .57) / 1000 .57 + - .0256 or 57% plus and minus 2.5%29 The margin of error is 2.5%. By subtracting and adding it to the percent of people who said they would vote for Mayor Tremblay (57%) we find the range of scores (margin of error) within which we are 90% confident lies the population parameter. Confidence Statement Mayor Tremblay can be 90% confident that between 54.4% and 59.5% of all voters will vote for him if the election were held today. (The all reflects the population parameter) The confidence statement is the whole sentence; the margin of error is between 54.4% and 59.5%; the 30 confidence level is 90%. CHAPTER 9 DESCRIBING RELATIONSHIPS: SCATTERPLOTS AND CORRELATIONS Scatterplots: Involves the relationship between two or more quantitative (ordinal, interval or ratio: NB – not nominal) variables measured on the same individuals/objects. (For our course, we will deal with two variables.) 31 The graph that depicts this relationship is called a scatterplot. Sometimes, scatterplots have an explanatory variable (on the horizontal axis) and a response variable (on the vertical axis). The explanatory variable is the independent variable. The response variable is the dependent variable. 32 Each dot in a scatterplot reflects two pieces of information (variables) about an individual. In this example, the individuals are countries. The graph depicts the relationship between gross domestic product per person and longevity. (p. 271) 33 Some scatterplots have no explanatory and response variables; only the relationship between two variables. Ex., The Archaeopteryx: the femur (leg bone) and humerus (arm bone); the size of one does not ‘explain’ or ‘contribute’ to the size of the other. (p. 274) 34 This scatterplot has a definite shape: as one variable increases, the other tends to increase. This is called a positive association. Association betw een Ice Cream Sales and Temperature Ice Cream Sales 10 8 6 4 2 0 10 12 14 16 18 Temperature 20 35 When one variable decreases and the other increases, it is called a negative association. Ice Cream Price and Sales Sales 10 5 0 0 0.5 1 1.5 2 2.5 Price 36 When there is no relationship between the change in one variable and the change in another variable, there is no association. Ice Cream Sales Scatterplot of Ice Cream Sales and TV Violence 9 8 7 6 5 4 3 2 1 0 0 2 4 6 TV Violence Ratings 8 37 To examine a scatterplot: 1. Look at the overall pattern and any important deviations. 2. Describe the scatterplot using the form, direction and strength of the relationship. 3. Look for outliers 4. The closer the data are to forming a linear line, the stronger the association (either negative or positive). Ex., The Archaeopteryx: There is a strong positive association between the size of the femur and the humerus with no outliers. 38 When the association between two variables is expressed mathematically, it is called a correlation. Features of Correlations 1. It is expressed as r. 2. The range is from -1.00 to +1.00. 3. -1.00 is a perfect negative correlation; +1.00 is a perfect positive correlation. These are never seen with real data. Zero is no correlation - there is no relationship between the variables. 39 4. Correlations use standard scores so we can compute them for any two variables (doesn’t have to be the same unit of measurement). 5. Correlations measures the strength of straight-line (linear) associations between variables. 6. Correlations are affected by outliers. The more data there is, the less an outlier will influence the correlation. 40 41 Correlations between: .8 - 1.00 .6 - .79 .4 - .59 .2 - .39 0.0 - .19 Are considered: Very Strong Strong Moderate Weak Very Weak 42 2. What is wrong with the following statement: a) The correlation between the first snow storm of any given year and the number of car accidents that day is r = - 1.3 b) The correlation between gender and income is about r = .66 43 3. Give an example for each of the following: a) A strong positive correlation b) A strong negative correlation c) No correlation 44 EXERCISES 1. Professor Lively, runs every day for at least 30 minutes and checks her pulse rate. Time 34.12 35.52 34.52 34.05 34.13 35.52 36.17 Pulse 152 124 140 152 146 128 136 45 1a) Draw a scatterplot for these data. 1b) The correlation, is r = -.815 Briefly describe what this means. 46 • • • • • • • • • • 2. The following are the closing quotes for the Nasdaq and Microsoft for ten trading days. Nasdaq: Microsoft: 1742 54 1785 57 1770 55 1789 56 1784 56 1804 57 1862 60 1845 60 1826 59 1824 59 47 2a) The correlation is r = .974 Describe what this means: a. Does NASDAQ performance cause Microsoft sales to rise? b. Does microsoft sale cause NASDAQ performance to rise? c. Neither a. nor b. Justify your answer. 48 Causation: The reason something occurs; what makes it happen. Requires experimental research designs where there is a great deal of control of all variables. Philosophically, causation requires a ‘leap of faith’ from excluding all other possible explanations to granting the independent variable the power to have caused the behavior. 49 a) Simple Causation: Very rare in real life. A causes B to happen. Ex., paying students $250 to get 80%+ in a course. This would increase the number of students who get 80%+. If everything else is kept constant, we could say that the $250 had an effect on students’ behavior; it caused an increase in grades. A B 50 b) Common Response A causes B and C When changes in two variables are caused by a third, common, variable. Ex., July is season for highest ice cream sales; July is also the month where the most people drown. Ice cream does not cause drowning; the warm weather increases both sales and drownings. B A C 51 c) Confounding Response: We know two variables cause a change in a third but we don’t know the ‘weight’ of each variable. Ex., person smokes and drinks too much. Heart is affected; we know that both contribute but do not know how much each contribute. Need to do experimental research to ‘sort out’ the influences. Helps to isolate each variable’s effect on heart. 52 When experimentation is not possible, we can approach causation if the following conditions are met: 1. The association between two variables is strong. 2. The association between two variables is consistent. 3. The alleged cause precedes the effect. 4. The alleged cause is plausible. 53 3. People who drink diet soft drinks tend to gain more weight over a one year period than people who do not. Does drinking diet drinks make people gain weight? Give a more plausible explanation. 54