Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Quantitative Data Analysis Chapter 8 What is Quantitative Analysis? Quantitative analysis is a scientific approach to answering questions Raw data are processed and manipulated resulting in meaningful information Raw Data Quantitative Analysis Meaningful Information Quantitative data analysis Making sense of the numbers for meaningful interpretation It involves: 1.Organizing the data 2.Doing the calculations 3.Interpreting the information 4.Explaining limitations What are the Options for Summarizing Distributions? • Measures of Central Tendency: • Mode • Median • Mean What are the Options for Summarizing Distributions? • Measures of Variation: • Range • Interquartile range • Variance • Standard deviation The Mode The most frequent value in a distribution. Respondent's Religious Preference (GSS94) 2000 Count 1000 0 PROTESTANT CATHOLIC JEWISH NONE OTHER RS RELIGIOUS PREFERENCE In a distribution of Americans’religious affiliations, Protestant Christian is the most frequently occurring value—the largest single group. The Median The position average, or the point that divides the distribution in half (the 50th percentile). HIGHEST YEAR OF SCHOOL COMPLETED Valid Mis sing Total 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 DK Total NAP Frequency 4 1 3 6 12 15 19 29 109 85 102 168 929 277 321 146 433 97 119 46 64 3 2988 4 2992 Percent .1 .0 .1 .2 .4 .5 .6 1.0 3.6 2.8 3.4 5.6 31.0 9.3 10.7 4.9 14.5 3.2 4.0 1.5 2.1 .1 99.9 .1 100.0 Valid Percent .1 .0 .1 .2 .4 .5 .6 1.0 3.6 2.8 3.4 5.6 31.1 9.3 10.7 4.9 14.5 3.2 4.0 1.5 2.1 .1 100.0 Cumulative Percent .1 .2 .3 .5 .9 1.4 2.0 3.0 6.6 9.5 12.9 18.5 49.6 58.9 69.6 74.5 89.0 92.2 96.2 97.8 99.9 100.0 The median in a frequency distribution is determined by identifying the value corresponding to a cumulative percentage of 50. The Mean The mean is just the arithmetic average. Mean = Sum of value of cases/number of cases The Mean, cont’d For example, to calculate the mean value of eight cases, we add the values of all the cases and divide by the number of cases (N): (28 + 117 + 42 + 10 + 77 + 51 + 64 + 55) /8 = 444/8 = 55.5 Measures of Variation It is important to know that the median household income in the United States is a bit over $40,000 a year, We need to know the Variation in income: The fact that incomes range from zero up to hundreds of millions of dollars Measures of variation capture how widely or densely spread income (for instance) is. 10/29 Measures of Variation • Four popular measures of variation for quantitative variables are the range, the interquartile range, the variance, and the standard deviation (which is the single most popular measure of variability). The Range The simplest measure of variation Calculated as the highest value in a distribution minus the lowest value, plus 1: Range = Highest value – Lowest value + 1 It often is important to report the range of a distribution, to identify the whole range of possible values that might be encountered. The Range, cont’d. Say that you surveyed 10 people, and asked them how many times they saw the movie Star Wars, and their answers looked like this: The range for “times respondent saw Star Wars” is 20 – 0 + 1= 21. However, since the range can be drastically altered by just one exceptionally high or low value (termed an outlier), it’s not a good summary measure for most purposes. Number of times Respondent saw Star Wars: 0 2 2 3 4 4 5 20 2 1 Interquartile Range The interquartile range avoids the problem created by outliers, by showing the range where most cases lie. Quartiles are the points in a distribution corresponding to the first 25% of the cases, the first 50% of the cases, and the first 75% of the cases. Interquartile Range • Star Wars example of number of times respondents saw Star Wars, • First 25% of cases fall within the range of 0 and 1.75 times. • Second quartile fall within the range of 1.75 and 2.5 times. • Third quartile falls within 2.5 and 4.25 times. • Last quartile is between 4.25 and 20 times. Interquartile Range, cont’d The interquartile range is the difference between the first quartile and the third quartile (plus 1). In our Star Wars example, the interquartile range is 4.25 – 1.75 + 1 = 3.50. The Variance Statistical definition, is the average squared deviation of each case from the mean; •You take each case’s distance from the mean, •square that number, •and take the average of all such numbers. Variance Takes into account the amount by which each case differs from the mean. It is affected by outliers, such as the person who saw Star Wars 20 times. Mainly useful for computing the standard deviation, which comes next. The Standard Deviation The standard deviation is simply the square root of the variance. It is the square root of the average squared deviation of each case from the mean: s= å(Yi - Y) 2 N Symbol key: ¯ Y = mean; N = number of cases; S = sum over all cases; Yi = value of case i on variable Y; = square root. Standard Deviation Standard deviation has mathematical properties that make it the preferred measure of variability in many cases, particularly when a variable is normally distributed. 10 8 6 4 2 Std. Dev = 12.67 Mean = 75.0 N = 25.00 0 45.0 65.0 85.0 Scores A graph of a normal distribution looks like a bell, with one “hump” in the middle, centered around the population mean, and the number of cases tapering off on both sides of the mean. Normal Distribution A normal distribution is symmetric: If you folded it in half at its center (at the population mean), the two halves would match perfectly. 10 8 6 4 2 Std. Dev = 12.67 Mean = 75.0 N = 25.00 0 45.0 65.0 85.0 Scores If a variable is normally distributed, 68% of the cases (almost exactly 2/3) will lie between plus and minus 1 standard deviation from the distribution’s mean, and 95% of the cases will lie between 1.96 standard deviations above and below the mean. Normal Distribution • The normal curve is a tool a statistician can use to tell how far the sample is likely to be off from the overall population, i.e. how big a "margin of error" there is likely to be in his/her poll. Different Statistics for Different Data Nominal Mode X Ordinal Interval/Ratio X X Median X X Mean X X Range X X Interquartile Range X Variance X Standard Deviation X Relationships between variables Crosstabulation (cross tabs) display the distribution of one variable for each category in another variable Also known as bivariate distribution Cross tabs are presented first with frequencies and then with percentages Crosstabulation of Voting in 2000 by Family Income: Cell Counts and Percentages FAMILY INCOME: CELL COUNTS Voting <$20,000 Voted 178 Did not vote 182 Total (n) (360) $20,000-$34,999 $35,000 - $59,999 239 135 (374) 364 168 (532) $60,000+ 761 193 (954) FAMILY INCOME: PERCENTAGES Voting <$20,000 Voted 49% Did not vote 51% Total 100% $20,000-$34,999 $35,000 - $59,999 64% 36% 100% 68% 32% 100% Source: General Social Survey, 2004. Weighted. $60,000+ 80% 20% 100% Summary statistics describe particular features of a distribution and facilitate comparison among distributions. The next step is to test for associations . . . Which calculation do I use? It depends upon what you want to know. Do you want to know how many individuals checked each answer? Frequency Do you want the proportion of people who answered in a certain way? Percentage Do you want the average number or average score? Mean Do you want the middle value in a range of values or scores? Median Do you want to show the range in answers or scores? Range Do you want to compare one group to another? Cross tab Do you want to report changes from pre to post? Change score Do you want to show the degree to which a response varies from the mean? Standard deviation