Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Research Methods Winter 2008 Oraganizing & Analyzing Data Chapter 9 Graphs & Charts Instructor: Dr. Harry Webster 1 Where Do Data Come From? Anything that we can put into a number is data. Good data requires valid measurements; appropriate way to investigate the topic. Types of investigations in Social Science and Commerce that usually have data are: 2 1. Observational Study: Uses predetermined categories (target behaviors/events) and observes frequency. Seeks to describe behavior or event. Ex., observe children for helpful behavior. Have target behaviors identified before beginning observation. Define precisely helpful behavior. Ex., p. 7 textbook. Compared proximity of residence to power lines for children with leukemia to those without leukemia. 3 2. Survey Studies: Offer a series of questions usually to a large number of people. Ex., ability and attitude/opinion questionnaires. Looks at how many people gave a certain answer (frequency & percents). 2a) Sample surveys: A survey where much attention is given to securing the sample in a random manner where, as much as possible, everyone has an equal chance of being selected. Ex., public opinion polls, marketing research. 4 3. Census Tries to gather data from everyone in a country. Underestimates the homeless and some minority groups. Governments do this to establish voting districts, economic and social trends. For Canadian Census 4. Experiments Researcher changes one thing (independent variable, IV; called treatment in textbook); keeps everything else the same, and determines if behavior (dependent variable, DV) is affected. 5 Basic design is two groups where one group gets one level of the IV and the other gets another level of the IV (this includes its absence). Since everything else kept constant, then any change in behavior (DV) is due to the change in the IV. Can make cause and effect (what causes what) statements with experiments. Cause and effect statements can only be made about groups; not individuals. They are probable causes not certain ones (remember: statistics is about probability not certainty). 6 7 Ex., Does a tutor impact academic performance? Group A No tutor in QM Group B A tutor in QM Both groups are otherwise the same. Compare grades in QM. 8 Vocabulary (From Ch 4 Test 1) Individual: anything that is measured; includes people, objects, animals. The ‘thing’ that provides the measurement. Ex., person Variable: any characteristic of an individual; anything that we actually measure. Ex., height, weight, schooling, singing ability, opinion. 9 A measurement can be in a count (frequency) or in a rate (proportion or percent). Usually, a rate is better. Offers the advantage of comparing two different measurements. Ex., Metro is 2 minutes late 10 times = count. Bus is 2 minutes late 25 times = count. Metro is 2 minutes late 10% of the time = rate. Bus is 2 minutes late 15% of the time = rate. 10 Predictive Validity: A measurement has predictive validity if it can be used to predict success or performance on tasks that are related to what we measured. Ex., Do college grades predict university grades? To some extent yes, but motivation and drive are missing from the prediction. (ref: p. 141) SAT (Scholastic Aptitude Tests) and high school grades predict about 34% of the variation in college grades. This means that about 1/3rd of the grade differences in college are predicted by the grades in high school and on the SAT tests. 11 Scales or Levels of Measurement: Nominal Scale: categorizes objects/people. No indication of size. Ex., male (1) and female (2). Ordinal Scale: reflects larger/smaller but not by how much. Ex., professional tennis rankings. 12 Interval Scale: reflects difference in equal units. Get an idea of size of difference. Ex., Temperature: 10 & 20 degrees Celsius. They are measured in equal units but one is not twice the other. There is no absolute zero. Zero degrees does not mean an absence of temperature. Ex.,IQ : (intelligence quotient) where a person ‘s IQ is determined in relation to the average in the population. There is no absolute zero IQ. Ratio Scale: reflects equal units and an absolute zero: Weight in Kgs or lbs; Distance in Metres, … Few in Social Science and Commerce disciplines: Ex., An item costs $.50, then $.25, then zero - is given 13 away. Exercises Chapter 8 1. Give a valid way to measure determination. 2. Give an example of a biased measurement. 3. Give an example of a reliable measurement. 14 4. Identify the scales of measurement for the following: a) Ice skating scores b) Points in a hockey game. c) Times for speed skating. d) Gold, silver and bronze medals. e) Penalties in hockey match. 15 5. If you had a Canadian Savings Bond that gave you $100 interest on $2000 per year and an Hydro Quebec Bond that gave you $60 interest on $1000 per year: a) What is the rate of return on the CSB? b) What is the rate of return on the Hydro Qc Bond? c) What is your total rate of return? 16 Part Two Organizing Data 1. Graphics, Good and Bad Chapters 10 Nominal data: Bar Graph & Pie Chart Ordinal, Interval, Ratio data: Line Graph 2. Displaying Distributions with Graphs. Chapter 11 3. Describing Distributions with Numbers. Chapter 12 Measures of central tendency: median, means. Measures of spread of distribution: quartiles, minimum scores, maximum scores, range, standard deviation 4. Normal distributions. Chapter 13 17 Chapter 10 Graphs: Good and Bad Good graphs represent the data without distorting it. Categorical variable: Nominal data that reflects categories. Ex., gender. Quantitative (continuous) variable: Ordinal, interval or ratio. Data that can be averaged. 18 For Nominal (categorical) data use either Pie Charts or Bar Graphs. Ref: p. 176 19 • Pie Chart of education level of people aged 25-34 in U.S.A. in 1998. (ref: p. 177) 20 For Nominal (categorical) data use either Pie Charts or Bar Graphs. Ref: p. 176 21 • Pie Chart of education level of people aged 25-34 in U.S.A. in 1998. (ref: p. 177) 22 Bar Graph of education level of people aged 25-34 in U.S.A. in 1998 (ref: p 177) Education Level 25-34 Year Olds in 1998 35 30 Percent 25 20 15 10 5 0 Less H.S. H.S. Some College Bachelor's Advanced 23 For ordinal, interval or ratio (quantitative) data use a line graph or histogram. Line Graph: Generally plotting data over time. Ex., Average cost of regular unleaded gas: 19902000. (Ref: p. 181) 24 To evaluate data in line graphs: 1. Look for overall pattern (trend). 2. Look for deviations from the pattern. 3. Look for seasonal variation that repeats itself each year. Verify to see if the seasonal variation has been removed by using seasonal adjustment. Ex., unemployment rates in January are adjusted to remove the expected increase due to fewer jobs after holiday sales jobs end. 25 Making good graphs: 1. Keep it simple. 2. Put titles, labels and legends. Titles go above the graph, labels indicate the axes and legends explain the data if there is more than one variable. 3. Only put the graphic of the data in the plotting area. Don’t clutter with gridlines or fancy graphics. 26 3. Make a line graph for the following data on the Annual Canadian Consumer Price Index (1986 = 100) Year 1993 1994 1995 1996 1997 Clothing/ Footwear 130.8 131.8 131.8 131.3 133 Transportation 125.7 131.3 138.1 143.5 147.9 27 Chapter 9 Displaying Distributions with Graphs Use histograms for ordinal, interval or ratio scale data. (quantitative/continuous data) Making a histogram: Step 1: Divide the range of the data into classes of equal width. Rule of thumb: Use the square root of the number of scores to determine the number of classes. Usually, no less than 4 and no more than 9. Don’t use this rule of thumb when there is an established way to divide the data into classes: ex., grades. 28 Step 2: Count the number of individuals in each class. Step 3: Draw the histogram. Horizontal axis is for the variable under study; vertical axis is for frequency. Each bar represents a class. Use bars of equal width 29 Ex., Some of the grades in one of my Intro to Psych sections many semesters ago: (Total = 30) 72 81 63 64 76 71 85 60 62 81 60 43 84 65 71 84 77 60 70 63 83 81 41 46 75 72 63 90 23 63 Step 1: Divide the range of the data into equal classes. Range = 23 - 90; can divide the data into 8 classes. 20-29; 30-39; 40-49; 50-59; 60-69; 70-79; 80-89; 9099. 30 Step 2. Count the number of individuals in each class. Classes 20-29 30-39 40-49 50-59 60-69 70-79 80-89 90-99 Individuals 1 0 3 0 10 8 7 1 Total 30 31 Step 3:Draw the histogram: Introduction to Psychology Grades Frequency 10 5 0 20 30 40 50 60 70 80 90 100 Grades 32 Interpreting Histograms 1. Must use judgment to determine number of classes. 2. Outliers: are extreme deviations; falls outside of the pattern. 3. Look for overall pattern and deviations. 4. To find overall pattern ignore outliers; look for center of distribution and spread. 5. Describe the shape as simply as possible. a) Symmetrical: right and left sides are nearly mirror images. b) Skewed: Can be skewed to the right (trails off to the right) or skewed to the left (trails off to the left). 33 34 35 Histogram skewed to the left 7 6 Frequency 5 4 3 2 1 0 0 3 6 9 12 15 Points in Football Games 18 21 36 Comparing Bar Graph and Histogram Education Level 25-34 Year Olds in 1998 35 30 Percent 25 20 15 10 5 0 Less H.S. H.S. Some College Bachelor's Advanced Highway Gas Mileage for Midsize Cars 2000 Frequency 10 5 0 21 23 25 27 29 31 33 Mileage 37 Chapter 9 Describing Distributions with Numbers We will see: A) median, quartiles, five number summary, boxplots, These involve ordering data and positions. Boxplot is a graphic (picture) of the five number summary. B) means, standard deviation. Do not involve position. Measures of central tendency are: median, means. Measures of spread are: quartiles, standard deviation. 38 A) Median, quartiles, five number summary and boxplots. Median: the midpoint of an arranged (ordered from smallest to largest) distribution of data. The 50th percentile. (Percentile: ranking out of 100) Calculating the median: 1. Arrange scores from smallest to largest. 2. Use formula: (n + 1)/2 to find the location of the median. 3a. If you have an odd number of scores, the formula will lead you to the median score. 39 Ex1., 2 3 4 5 6 7 8 9 10 Formula: (n + 1)/2 (9 + 1 )/2 = 5 (location of median) Count 5 scores and we get 6. 6 is the median score. Ex2., 23455567888 Formula (n + 1)/2 (11 + 1)/2 = 6 (location of median). Count 6 scores and we get 5. As there are many 5’s we must indicate (underline) which 5 is the median score. 40 3b. If we have an even number of scores then the formula brings us in between two numbers. Ex., 12 13 14 15 16 17 Formula (n + 1)/2 (6 + 1)/2 = 3.5 location of median. Count 3.5 scores and we are between 14 and 15. Then find the average of these two scores. 14 + 15 = 29/2 = 14.5 median score. This is indicated by putting a mark between the 14 and the 15. 12 13 14 ~15 16 17 41 B) Mean and Standard Deviation Mean: An average of scores. Pronounced ‘x-bar’; symbol = X Sum of scores divided by number of cases. Ex. 1 2 3 4 = 10/4 = 2.5 Sensitive to outliers. Ex., 1 2 3 40 = 46/4 = 11.5 42 Standard Deviation Most frequently used expression of spread/variability Is a measure of the average spread of scores from the mean. Small standard deviations involve a set of scores that are close to the mean. Large standard deviations involve a set of scores that are further away from the mean. Is influenced by outliers (the mean is used to calculate the standard deviation). 43 The Standard Deviation Formula St.Dev. 2 (x x ) n 1 sum of everything in parenthese s x each score x mean n number of scores 44 To calculate the standard deviation (S.D. or St. Dev.) Step 1. Find the mean. Step 2. Find the distance of each score from the mean. Step 3. Square each result to get rid of negatives. Step 4. Add up the squared deviations (from the mean). Step 5. Divide by n-1. This gives the variance. Step 6. Find the square root. This gives the St. Dev. 45 Example: Data set: 1 2 4 6 7 1. Find the mean 20/5 = 4 Deviation 3. Squared Deviation 2. 1 – 4 = -3 3x3 = 9 2 – 4 = -2 2x2 = 4 4–4= 0 0 6–4= 2 2x2 = 4 7–4= 3 3x3 = 9 4. Total 26 46 5. Divide the sum of the squared deviations by n-1 26/5-1 = 6.5 This is the variance. 6. Square root the variance Square root of 6.5 = 2.55 This is the standard deviation. 47 Use medians when there are outliers. Ex. income. Use means and standard deviations when the distribution appears symmetrical. Ex. Test grades, performance on athletic variables that are measured in time. 48 6. We have used 2 sets of data (7 2 2 1 3 4 5 6 and 7 2 2 1 3 4 50 6) to determine five number summaries and standard deviations. Using the numbers, show the effects of outliers. 49 CHAPTER 13 NORMAL DISTRIBUTIONS When a graph depicts proportion of scores instead of frequency of scores it is called a density graph. The proportions add up to 1 (100%). When the density graph is smoothed into a line, it is called a density curve. 50 • The mean is further towards the tail of the distribution as it takes into account the size of those scores (ex., outliers). • The median depicts position in a distribution of data only; it is not affected by the more extreme scores. • Normal Curve Skewed to the right 51 • Normal Curves/Normal Distributions: • The most important curve in Social Science and Commerce statistics. • Many biological variables fall on a normal curve. Ex., height. • Many psychological variables are ‘forced’ into a normal curve. Ex., I.Q., some psychological inventories. • Many sociological/economic variables don’t fall into a normal curve. 52 • Ex. income, education. Features of Normal Curves (Normal Distributions): 1. Given the mean and standard deviation, we can draw the normal curve. 2. Mean is center of the distribution; cuts it in half. This is also the median or 50th percentile. 3. The curve is symmetrical; one side of the mean mirrors the other. 4. The standard deviation determines the shape of the curve. The smaller the standard deviation, the closer the scores are to one another, the ‘taller’ the curve. 53 54 The standard deviation breaks the normal curve into segments that reflect the percent of scores in the set of scores. The 50th percentile is at st. dev. zero. Standard deviations for the mean 55 • The 68-95-99.7 Rule • 68% of all scores fall between -1 and +1 standard deviation. • 95% of all scores fall between -2 and +2 standard deviations. • 99.7% of all scores fall between -3 and +3 standard deviations. • As the tails of the normal curve do not touch the horizontal axis, we cannot determine the number of standard deviations for 100% of the scores. • This is to leave room for extreme outliers. 56 Ex., Women’s height. Mean = 65 “ St. Dev. = 2.5 “ 57 The standard deviation of an individual score is called the standard score. Standard score is also known as a z- score. The standard score allows us to determine the percentile rank of that score. 58 What is the percentile rank of a woman who is 68” tall? For this we need the Standard Score formula: Standard Score (St. Sc.) = Score - mean St. Dev. St. Sc. = 68 - 65 = 1.2 2.5 A standard score of 1.2 means this woman is +1.2 standard deviations above the mean. It translates into a percentile ranking of 88.49 using Table B (p. 552). 59 INTRODUCTION TO STATISTICAL INFERENCE Statistical inference is a technique to make decisions regarding the probability that the population would behave in the same way as the sample. As it is based on probability, then the rules of probability must be followed. Therefore, the assumptions which must be met are: 1) Randomness: the predictable pattern of outcomes after very many trials. 1a) If samples are chosen randomly, then the pattern of outcomes is a normal distribution. This is called a sampling distribution. 60 2) We assume the mean of the normal distribution reflects the mean of the population parameter. Statistical inference helps us determine how confident we are about where a result falls on the sampling distribution in two ways: 1. Confidence Intervals: How confident we are that our sample’s result captured the population parameter within a certain range (margin of error). 2. Tests of significance: We make a claim about the population and use the sample’s results to test that claim. Want to determine the probability of our claim 61 being right. CHAPTER 21 WHAT IS A CONFIDENCE INTERVAL? A confidence interval estimates a population parameter from a sample statistic at a certain level of confidence. Here confidence means the probability of being right. We also referred to it as a Confidence Statement. We take the sample’s statistic (data) and estimate what the population’s answer would be. Involves how sure we are (confidence level) and margin of error (the margin where we believe the population’s answer falls. 62 We take the sample’s statistic (data) and estimate what the population’s answer would be. Involves how sure we are (confidence level) and margin of error (the margin where we believe the population’s answer falls. We will cover how to develop Confidence Statements for: A) Data given in percents/proportions. B) Data given in means. (The only difference is a change of formula) 63 p population parameter p̂ statistics (results) from samples Take any statistic and estimate the probabilit y (conf. level) of it capturing the population parameter within a certain margin of scores (margin of error). 64 A) When the statistic is given in percents or proportions. The formula to find a confidence interval for any level of confidence is: pˆ z* pˆ (1 pˆ ) / n z z score is a standard score p̂ statistics (results) from samples n sample size 65 p̂ = sample statistic (proportion or percent) z* = z scores (standard scores) n = number of subjects in the sample 66 Example: Mayor Villeneuve is two weeks from election day. He wants to know his chances of winning the election. A polling company asks 1000 people who they would vote for if the election were held today and 57% say Villeneuve. Villeneuve wants to be 90% confident that he will win. .57 1.64 .57(1 .57) / 1000 .57 + - .0256 or 57% plus and minus 2.5%67 The margin of error is 2.5%. By subtracting and adding it to the percent of people who said they would vote for Mayor Villeneuve (57%) we find the range of scores (margin of error) within which we are 90% confident lies the population parameter. Confidence Statement Mayor Villeneuve can be 90% confident that between 54.4% and 59.5% of all voters will vote for him if the election were held today. (The all reflects the population parameter) The confidence statement is the whole sentence; the margin of error is between 54.4% and 59.5%; the 68 confidence level is 90%. CHAPTER 9 - RELATIONSHIPS: SCATTERPLOTS AND CORRELATIONS Scatterplots: Involves the relationship between two or more quantitative (ordinal, interval, ratio) variables measured on the same individuals/objects. (For our course, we will deal with two variables.) 69 The graph that depicts this relationship is called a scatterplot. Sometimes, scatterplots have an explanatory variable (on the horizontal axis) and a response variable (on the vertical axis). The explanatory variable is the independent variable. The response variable is the dependent variable. 70 Each dot in a scatterplot reflects two pieces of information (variables) about an individual. In this example, the individuals are countries. The graph depicts the relationship between gross domestic product per person and longevity. (p. 271) 71 Some scatterplots have no explanatory and response variables; only the relationship between two variables. Ex., The Archaeopteryx: the femur (leg bone) and humerus (arm bone); the size of one does not ‘explain’ or ‘contribute’ to the size of the other. (p. 274) 72 This scatterplot has a definite shape: as one variable increases, the other tends to increase. This is called a positive association. Association betw een Ice Cream Sales and Temperature Ice Cream Sales 10 8 6 4 2 0 10 12 14 16 18 Temperature 20 73 When one variable decreases and the other increases, it is called a negative association. Ice Cream Price and Sales Sales 10 5 0 0 0.5 1 1.5 2 2.5 Price 74 When there is no relationship between the change in one variable and the change in another variable, there is no association. Ice Cream Sales Scatterplot of Ice Cream Sales and TV Violence 9 8 7 6 5 4 3 2 1 0 0 2 4 6 TV Violence Ratings 8 75 To examine a scatterplot: 1. Look at the overall pattern and any important deviations. 2. Describe the scatterplot using the form, direction and strength of the relationship. 3. Look for outliers 4. The closer the data are to forming a linear line, the stronger the association. Ex., The Archaeopteryx: There is a strong positive association between the size of the femur and the humerus with no outliers. 76 When the association between two variables is expressed mathematically, it is called a correlation. Features of Correlations 1. It is expressed as r. 2. The range is from -1.00 to +1.00. 3. -1.00 is a perfect negative correlation; +1.00 is a perfect positive correlation. These are never seen with real data. Zero is no correlation - there is no relationship between the variables. 77 4. Correlations use standard scores so we can compute them for any two variables (doesn’t have to be the same unit of measurement). 5. Correlations measures the strength of straight-line (linear) associations between variables. 6. Correlations are affected by outliers. The more data there is, the less an outlier will influence the correlation. 78 79 Correlations between: .8 - 1.00 .6 - .79 .4 - .59 .2 - .39 0.0 - .19 Are considered: Very Strong Strong Moderate Weak Very Weak 80 Causation: The reason something occurs; what makes it happen. Requires experimental research designs where there is a great deal of control of all variables. Philosophically, causation requires a ‘leap of faith’ from excluding all other possible explanations to granting the independent variable the power to have caused the behavior. 81 a) Simple Causation: Very rare in real life. A causes B to happen. Ex., paying students $250 to get 80%+ in a course. This would increase the number of students who get 80%+. If everything else is kept constant, we could say that the $250 had an effect on students’ behavior; it caused an increase in grades. A B 82 b) Common Response A causes B and C When changes in two variables are caused by a third, common, variable. Ex., July is season for highest ice cream sales; July is also the month where the most people drown. Ice cream does not cause drowning; the warm weather increases both sales and drownings. B A C 83 c) Confounding Response: We know two variables cause a change in a third but we don’t know the ‘weight’ of each variable. Ex., person smokes and drinks too much. Heart is affected; we know that both contribute but do not know how much each contribute. Need to do experimental research to ‘sort out’ the influences. Helps to isolate each variable’s effect on heart. 84 When experimentation is not possible, we can approach causation if the following conditions are met: 1. The association between two variables is strong. 2. The association between two variables is consistent. 3. The alleged cause precedes the effect. 4. The alleged cause is plausible. 85