Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 3: Statistics for describing, exploring, and comparing data Chapter Problem: A common belief is that women talk more than men. Is that belief founded in fact, or is it a myth? Data set 8 in Appendix B includes different sample groups from the results provided by researchers show that the sample mean for males is 15,668.5 for a 1 sample size 186, and for female is 16,215.0 for a sample size 210 per day. Chapter 3-1 Overview and 3-2.1 Measures of Center 3-1 Overview: Discuss the characteristics of a data set: CVDOT Statistics: Descriptive statistics: summarize or describe the characteristics of a data set; Chapters 2 and 3 discuss the fundamental principles of descriptive statistics inferential statistics: use sample data to make inferences (or generalizations) about a population; focus of later chapters 3-2 Measures of Center: Objective: discuss the characteristics “center”, mean and median of a data set, effect of outliers on the mean and median Definition: a measure of center is a value at the center of middle of a data set Definition: The arithmetic mean of a set of values is the sum of the data values divided by total number n of value. We call it “mean” (means arithmetic mean) and mean is denoted by x-bar x x 2 x Sum of all sample values n number of sample values 3-2.2 Mean, Median & Notations Mean is relative reliable, since means of samples drawn from the same population don‟t vary as much as other measure of center since it takes every data value into account. But mean is sensitive to every value, especially when there is outliers (is a disadvantage). Greek letter sigma denote the sum of a set of values x is the variable for individual data value n is the number of values in a sample N represents the number of values in a population x x is the mean of a set of sample values n x is the mean of all values in a population N ~ x The median (x-tilde) of a data set is the measure of center that is the middle value when the original data values are in ascending order If the number of values is odd, the median x is exactly the middle of the list If the number of values is even, the median is mean (average) of the two middle numbers; Note: Median is not affected by outliers 3 3-2.3 Example of mean and median Monitoring Lead in Air To find median, first you need to arrange the data in ascending order Data taken are 5.4, 1.10, 0.42, 0.73, 0.48, 1.10 Mean is 1.538 g/m3 0.42, 0.48, 0.73, 1.10, 1.10, 5.4 (the number is 6); The median is 0.915 g/m3 = (0.73+1.10)/2 0.42, 0.48, 0.66, 0.73, 1.10, 1.10, 5.4 (the number is 7); The median is 0.73 g/m3 These examples show that median is not sensitive to extreme values and median is often used for data sets with a few extremes. Example 2 Find the mean and median for the word counts from 5 men: 27,531; 15,684; 5,638; 27,997; and 25,433. (20,456.6; 25,433) Find the mean and median if including the additional 8,077 words. (18,393.3, 20,558.5) 4 3-2.3 Mode and Midrange The mode of a data set is the value that occurs most frequently Data set is called bimodal when there are two values occur with the same greatest frequency, each one is a mode Data set is called multimodal when there are more than two values occur with the same greatest frequency, each one is a mode Data set is called no mode when no value is repeated Midrange is the measure of center, is calculated by average of the minimum and maximum value of the data set: Midrange = (max + min)/2 Example from lead in the air: Midrange = (5.4+0.42)/2 = 2.910 g/m3 Example from word counts: 27,531; 15,684; 5,638; 27,997; 25,433; Midrange = ? Midrange is rarely used, since it only uses max and min and too sensitive to those extremes. It is easy to compute and is one of the values to define the “center” of the data set. “Midrange” is different from the “mean” Round-off rule – Carry one more decimal place than is present in the original set of values 5 3-2.4 Examples Example 1: Comparison of ages of best actress and best actors Comparison of Ages of Best Actresses and Best Actors Best Actresses Best Actors Mean 35.7 43.9 Median 33.5 42 Mode 35 41 and 42 Midrange 50.5 52.5 What does this data tell? Measures of center suggests that best actresses are younger than best actors. In Ch 9, we will discuss the methods for determining whether such differences are satisfactory significant. Example 2: Find the mean, median, mode, and midrange of the randomly selected cans of Coke: 12.3, 12.1, 12.2, 12.3, 12.2 (12.22, 12.20, 12.2, 12.3, 12.2) 6 3-2.5 More Examples The following examples identify a major reason why the mean and median are not meaning statistics that accurately and effectively serve as measures of center. Find the mean and median of the following Zip codes: 12601, 90210, 02116, 76177, 19102 Ranks of stress level from different jobs: 2 3 1 7 9 Surveyed respondents are coded as 1 (for democrat), 2 (for republican), 3 (for liberal), 4 (conservative), or 5 (for others) Mean salary of secondary school teachers: from 50 states, $37,200, $ 49,400, $40,000, ….$37,800. The mean is $42,210. but is this mean salary of all secondary school teachers in U.S.? Why or why not? The above example did not take into the considerations of the number of secondary school teachers in each state. The mean for all secondary school teachers in the U.S. is $45,200, not $42,210. 7 3-2.6 Mean from a Frequency Distribution Mean from a Frequency Distribution is defined as Where x is the class midpoint, formula 3-2 f x x Example: x f x f f f is the frequency for that class 2718 35.8 76 Age of actress Frequency f 21 - 30 28 25.5 714 31 - 40 30 35.5 1065 41 - 50 12 45.5 546 51 - 60 2 55.5 111 61 - 70 2 65.5 131 71 - 80 2 75.5 151 Totals 76 Class Midpoint x f x 2718 You can use TI, with midpoints in L1, f in L2, then calculate 8 3-2.7 Weighted Mean Weighted mean – used when the values with different degrees of importance, is defined as Formula 3-3: x w x w Example: mean of 3 test scores (85, 90, 75) Test 1: 20% Test 2: 30% Test 3: 50% 9 x ( w x) w (20 85) (30 90) (50 75) 20 30 50 8150 81.5 100 3-2.8 Best measure of Center? Use mean most, then median mean Median Measures of center Mode Midrange Value that Sort the data Occurs (max + min)/2 Most frequently Even Odd The mode is good The midrange number number for data at the Is rarely used (1) Sensitive to of value of value nominal level of Extreme value measurement Add the 2 middle (2) Sample means Median is the numbers, to vary less than value in the then divide by 2 exact middle other measures of center The median is a good choice 10 If there are some extreme values Find the sum Of all values, then divide by the number of values 3-2.9 Skewness of data 11 Definition: a distribution of data is skewed if it is not symmetric and extends more to one side than the other Skewed to the left (negatively skewed; has a longer left tail) if mean and the median to the left of mode Skewed to the right (positively skewed; has a longer right tail) if mean and the median to the right of the mode A distribution is symmetric (zero skewness) if the left half of its histogram is roughly a mirror image of its right half, mean=median=mode 3-2.10 Summary and Homework #8 Section 3-2 12 We have learned types of measurements of center of a data set; mean from a frequency distribution, weighted mean, best measure of center, and skewness. The mean and median cannot always be used to identify the shape of the distribution. Question: What is the highest point of the graph whether it is symmetric or skewness? HW #8, Pages 94-96, #5-17 odd, 33-34 (answer for 34: mean = 84.8, grade = B) 3-3.1 Measures of Variation Objective 3-3: Learn the characteristic of variation; such as standard deviation and variance. Learn how to use a data set for finding the value of the range and standard deviation; Interpreting values of standard deviations and reasons of standard deviation Definition: The range of a set of data is the difference between the maximum value and the minimum value Range = (maximum value) – (minimum value) Not useful, since it depends on max and min (i.e. extreme sensitive to the extreme values) Definition: The standard deviation is a set of sample values is the measure of variation of values about the mean. s s (x x) Formula 3-4 standard deviation 2 n 1 n ( x ) ( x) n( n 1) 2 2 simple Formula 3-5 shortcut formula for sample standard deviation (formula used by calculators and computer programs) 13 3-3.2 Properties of Standard Deviation (S.D.) 1. 2. 3. 4. S.D. is a measure of variation of all values from the mean S.D. is always 0. It is zero only when all of the data values are the same; large S.D. values indicate greater amount of variation S.D. can increase dramatically with the inclusion of one or more outliers The units of the S.D. s are the same as the units of original values, e.g. minutes, feet, pounds, etc.. Compute the mean Subtract the mean from each value xx 2 Square the difference ( x x ) 2 Add the all the squares (x x) (x x) Divide the total by n-1 (i.e. one less than the number) n 1 Find the square root of the result of step 5 (x x) x 2 5. 6. s 2 n 1 Find the standard deviation of the waiting times from the multiple times. Those times (in minutes) are 1, 3, 14. 14 3-3.3 Standard Deviation of a Population Standard deviation of a population is the formula of sample deviation, except divided by N (N is the population size) The population standard deviation is defined as (x ) 2 N Since we generally deal with sample data, thus we usually use the formula 3-4. (x x) s 2 n 1 15 3-3.4 Variance of a Sample and Population Definition – The variance of a set of values is a measure of variation equal to the square of the standard deviation Sample variance: s2 square of the standard deviation Population variance: 2 square of the population standard deviation s2 is called unbiased estimator of the population variation 2 Example: Use the waiting times of 1 min, 3 min, and 14 min to find the variance of waiting time Q: Is smaller variance better? Note: The units of variance are different from the units of original data set; the standard deviation has the same unit as the data set Notations: s = sample standard deviation s2 = sample variance = population standard deviation 2 = population variance SD – standard deviation VAR – variance Round-Off Rule – carry one more decimal place than the original set of data for the final answer (don‟t round-off in the middle of a calculation) 16 3-3.5 Why learn Standard Deviation and interpretation Standard deviation measures the variation among values Range Rule of Thumb to estimate standard deviation Small standard deviation means values are close together, while large standard deviation means values are spread farther apart is used to roughly estimate standard deviation which is based on the principle that for most data sets the vast majority (such as 95%) of sample values lie within 2 standard deviation s; where s Range/4 (range = max – min) If the standard deviation s is known, we can use it to estimate min and max of sample values Minimum “usual” value = (mean) – 2 * (standard deviation) Maximum “usual” value = (mean) + 2 * (standard deviation) Example 1: IQ test, mean is 100, S.D. is 15; Min is 70, max is 130 Interpretation: Based on these results, we expect that typical IQ scores fall between 70 and 130. How do you interpret IQ 65 or IQ 135? Example 2: Pulse rate of women: mean is 76, S. D. is 12.5; min is 51 beats/min, and max is 101 beats/minute Interpretation Typical women pulses are from 51 to 101 beats/min If someone has pulse rate 110 would be unusual, since 110 is outside the limits 17 3-3.6 Empirical (or 68-95-99.7) Rule for data with normal distribution Empirical rule – for data set having a distribution that is approximately bell-shaped has the following properties: About 68% of all values fall within 1 standard deviation of the mean, i.e. between (mean s) and (mean + s) About 95% of all values fall within 2 standard deviation of the mean, i.e. between (mean2s) and (mean+2s) About 99.7% of all values fall within 3 standard deviation of the mean, i.e. between (mean3s) and (mean+3s) Example of IQ scores, mean is 100, standard deviation is 15. What percentage of IQ scores are between 70 and 130? 18 3-3.7 Chebyshev‟s Theorem The proportion (or fraction) of any data set lying with K standard deviations of the mean is always at least 1-1/K2 where K >1 Example – IQ score: When K=2, we can interpret that at least ¾ (75%) of all values lie within 2 standard deviation of the mean When K=3, we can interpret that at least 8/9 (or 89%) of all values lie within 3 standard deviation of the mean At least 75% of people have IQ between 70 and 130 (2 SD from mean) At least 89% of people have IQ between 55 and 145 (3 SD from mean) Comparison: Example – IQ score using empirical rule About 68% of people have IQ between 85 and 115 (1 SD from mean) About 95% of people have IQ between 70 and 130 (2 SD from mean) About 99.7% of people have IQ between 55 and 145 (3 SD from mean) 19 3-3.8 Coefficient of Variation in Different Populations Coefficient of variation (CV) for a set of nonnegative sample population data (expressed as %) is used to describe the standard deviation relative to the mean with the following: Sample s cv 100% x Population cv 100% Example: Heights and Weights of Men(data set 1 in Appendix B) For heights: mean x 68.34in , s.d. = 3.02in For weights: mean x 172.55lb , s.d. = 26.33lb We want to compare variation among heights to variation among weights. Heights: CV = 4.42%; weights: CV = 15.26% Interpretation: ? The heights has considerably less variation than weights, does it make sense? 20 3-3.9 Summary and HW #9 (3-3) Range rule of thumb Empirical rule (only applicable to normal (bell-shaped) distribution) s range/4 Min (usual) = mean – 2*s Max (usual) = mean + 2*s 68% within 1 S.D. means data values are within (mean- s) and (mean + s) 95% within 2 S.D. means data values are within (mean-2s) and (mean +2s) 99.7%within 3 S.D. means data values are within (mean-3s) and (mean +3s) Chebyshev‟s Theorem helps to approximate the values of data set (applicable to any data set, but has limited usefulness) HW #9 (3-3) Pp. 110 -113, # 5-11odd, 17, 31- 35odd 21 3-4.1 Measure of Relative Standing and Boxplots Objective: To learn the “measure” that can be used to compare values from the same or different data set, z score, and able to convert data values to zscores, quartiles, percentiles, and boxplots Definition: a z score (or standardized value) is the number of standard deviations that a given value x is above or below the mean xx For sample data z s For population data Round z to two decimal places z x A man is 76.in tall with 237.1 lb weight. Find the Z-score for the height and weight. (mean height = 68.34in, s.d.= 3.02in, mean weight = 172.55 lb and s.d. = 26.33lb); z-score for this man is 2.60 in height, 2.45 in weight. Interpretation: The man is 2.6 above the mean height, 2.45 above the mean weight. The height is more extreme than the weight. Example: Lyndon Johnson 75” (mean 71.5”, S.D. 2.1”), Shaquille O‟Neal 85”(mean 80”, S.D. 3.3”) Interpretation: ? 22 3-4.2 z-score and unusual values Use range rule of thumb, a value is “unusual” if it is more than 2 S.D. from the mean: min (usual) = mean – 2*s, and max (usual)= mean + 2*s Use z-score, a value is „unusual” if it is less than -2, or greater than +2 Ordinary values: -2 z score +2 Unusual values: z score < -2 or z score >2 z scores: measures of position relative to the mean, a z- score of +2 means 2 standard deviations above the mean, z score of -3 means 3 standard deviation below the mean Example: Over the past 30 years, heights of basketball players at Newport University have a mean of 74.5in, and a s.d. of 2.5in. The latest recruit has a height of 79.0in Find z score Is the height of 79.0in unusual among the heights of players over the past 30 years? Why or why not? 23 3-4.3 Percentiles Definition: Is one type of quantiles (fractiles) which partition data into groups with roughly the same numbers of values in each group Percentiles are measures of location. There are 99 percentiles and are denoted by P1, P2, P3, …P99, which is divide a set of data into 100 groups about 1% of the values in each group Example: 50th percentile, denoted by P50, has about 50% of the data values below it, and about 50% of the data values above it; 50th percentile is the same as the median. Formula is (round the result to the nearest whole number) Percentile of value x Another way is number of values less x 100 (or k total number of values n = total number of values in the data set k L n 100 L *100) n k = percentile being used L = location that gives the position of a value Find the percentile for the value of $29 millions. Table 3-4 in the Text (click here) 4.5 5 6.5 7 20 20 29 30 35 40 40 41 50 52 60 65 68 68 70 70 70 72 74 75 80 100 113 116 120 125 132 150 160 200 225 24 3-4.4 Converting from kth percentile to the corresponding data value Start Compute L = (k/100)n n = # of values k = percentile Sort the data (arrange the data from low to high) The value of Pk is the Lth value counting from the lowest Change L by rounding it up to the next whole number No Is L a whole number? Yes Example: Find the 17th percentile of the previous Data set. The value of Pk is mean of the values Lth location and (L+1)th location 25 3-4.5 Example: Setting Speed Limits The table is the recorded speeds miles/hour randomly selected on 405 highway 68 68 72 73 65 74 73 72 68 65 65 73 66 71 68 74 66 71 65 73 59 75 70 56 66 75 68 75 62 72 60 73 61 75 58 74 60 73 58 75 Find the 85th percentile of the listed speeds Given that speed limits are usually rounded to a multiple of 5, what speed limit is suggested by these data? Explain your choice Does the existing speed limit on Highway 405 conform to the 85th percentile rule (i.e. the speed limit is set so that 85% of drivers are at or below the speed limit) 26 3-4.6 Quartiles Definition Quartiles are measures of location, denoted by Q1, Q2, and Q3 , which divide a data set into four groups with about 25% of the values in each group (percentile divide the data into 100 groups.) Three quartiles Q1, Q2, Q3 Divide the sorted data value into 4 equal parts Q1 (first quartile) separate the bottom 25% from the rest Q2 (second quartile) separate the bottom 50% from the rest (Q2 is also the median) (also 50 percentile) Q3 (third quartile) separate the bottom 75% from the rest Interquartile range (IQR) = Q3 - Q1 Semi-interquartile = (Q3 - Q1)/2 Mid-quartile = (Q3 + Q1)/2 10-90 percentile range = P90 - P10 Example: find the values of Min, Q1, Q2, Q3, Max, IQR of movie budget; (click here for the table) 27 3-4.8 5-Number Summary and Boxplot Definition – for a set of data, the 5-number summary consists of : (1) Minimum, (2) Q1, (3) Q2 (the median), (4) Q3 , (5) Maximum A boxplot (or box-and-whisker diagram) is a graph of data set that consists of a line extending from the min to max and a box with lines drawn at the Q1, the median, and the Q3 (summary and example next page) A graph which is useful for revealing The center of the data The spread of distribution of the data The presence of outliers Outlier is a value that is located away from almost all of the other values, an extreme value falls outside the general pattern; A data x value is an outlier if x –Q3 > 1.5 IQR or Q1-x > 1.5 IQR An outlier can have a dramatic effect on the Mean Standard deviation The scale of histogram, so the true nature of the distribution is totally observed 28 3-4.9 Procedures for Construct a Boxplot, HW #10 Find the 5-number summary, min, Q1, median, Q3, and the max Construct a scale with values that include the min and max data value Construct a box (rectangle) extending from Q1 to Q3, and draw a line in the box at the median value Draw lines extending outward from the box to min and max data value Example on the board, use the movie budget (click here) 5-number summary Boxplots don‟t show detail info as histograms or stem-and leaf plots – not the best choice when dealing with a single data set; but it‟s great for comparing different data sets (use the same scale) Do women really talk more than men? Use the 5-number summary Min Q1 Q2 Q3 Max Men 695 10009 14290 20565 47016 Women 1674 11010 15917 20571 40055 Read Table 3-3 Comparison of word counts of men and women for “mean”, “median”, “midrange”, “range”, “S.D” Example: Here are measured reaction times (in seconds) in a test of driving skills; 2.4, 2.5, 2.8, 2.0, 2.4, 2.9, 3.2, 3.5, 2.7, 2.7, 2.8, 2.6; find the five-number-summary. HW #10: P. 127-128, #1, 5-7, 9, 13, 15, 19, 23, 27 29 Review (1) Ch 1 – You should be able to do the following learned distinguish between a population and a sample; and parameter and statistic Understand the importance of good experimental design, including the control of variable effects, replication, and randomization Recognize the importance of good sampling methods in general, a simple random sample in particular Understand if sample data are not collected in an appropriate way, the data may be completely useless Ch 2: You should be able to do: Summarize data by constructing a frequency distribution or relative frequency distribution Visually display the nature of the distribution by constructing a histogram or relative frequency histogram Investigate important characteristics of a data set by creating visual display, such as a frequency polygon, dotplot, stemplot, pareto chart, pie chart, scatterplot or time-series graph Understand and interpret those result 30 Review (3) - Continued You should be able to Calculate measures of center by finding the mean and median Calculate measures of variation by finding the standard deviation, variance, and range Understand and interpret the standard deviation by using the tools such as range rule of thumb Compare individual values by using z score, quartiles, or percentiles, identify outliers Investigate and explore the spread of data, the center of the data, and the range of values by constructing a boxplot Understand and interpret those result such as standard deviation us a measure of how much data vary, and use standard deviation to distinguish between values that are usual and unusual 31 Examples Always consider certain key factors: • Context of the data • Source of the data • Sampling method • Measures of center • Measures of variation • Distribution • Outliers • Changing patterns over time • Conclusion • Practical implications 32