Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 4 Numerical Methods for Describing Data Parameter - Suppose we want to know the MEAN length of all the value fish in about Lake Lewisville ... • Fixed a population • Is Typical unknown this a value that is known? Can we find it out? At any given point in time, how many values are there for the mean length of fish in the lake? Statistic - Suppose we want to know the MEAN length of calculated all the fish infrom Lake Lewisville. • Value a sample What can we do to estimate this unknown parameter? Measures of Central Tendency • Mode – the observation that occurs the most often – Can be more than one mode – If all values occur only once – there is no mode – Not used as often as mean & median Measures of Central Tendency Median - the middle value of the data; it divides the observations in half To find: list the observations in numerical order single middle value is n is odd sample median average of the two middle values if n is even Where n = sample size Suppose we catch a sample of 5 fish from the lake. The lengths of the fish (in inches) are listed below. Find the median length of fish. The numbers are in orderThe median length of & n is odd – so find the fish is 5 inches. middle observation. 3 4 5 8 10 Suppose we caught a sample of 6 fish from the lake. The median length is … The numbers are in order The & median length is 5.5 inches. n is even – so find the middle two observations. Now, average these two values. 3 5.5 4 5 6 8 10 Measures of Central Tendency parameter Mean is the arithmetic average. m is the lower case Greek letter mu statistic – Use m to represent a population mean S is the capital – Use x to represent a sample mean Greek Formula: letter sigma – it means to sum the values that follow x x n Suppose we caught a sample of 6 fish from the lake. Findthe themean mean length the To find length of of fish - fish. add the observations and divide by n. 3 4 5 6 8 10 6 x 6 3 4 5 6 8 10 Now find how each observation deviates from the mean. x 3 4 5 6 8 10 Sum (x - x) -3 3-6 -2 -1 0 2 4 0 The mean is considered This ispoint the deviation the balance of from the mean. the distribution because it “balances” Find the the positive andrest of the deviations from the mean negative deviations. What is the sum Will sum always of this the deviations equal from thezero? mean? YES Imagine a ruler with pennies placed at 3”, 4”, 5”, 6”, 8” and 10”. To balance the ruler on your finger, you would need to place your finger at the mean of 6. The mean is the balance point of a distribution What happens to the median & mean if the length of 10 inches was 15 inches? The median is . . . The mean is . . . 5.5 6.833 3 4 5 6 8 15 6 What happened? 3 4 5 6 8 15 What happens to the median & mean if the 15 inches was 20? The median is . . . The mean is . . . 5.5 7.667 2 4 5 6 8 20 6 What happened? 3 4 5 6 8 20 Statistics that are not affected by extreme values are said to be resistant. Is the median resistant? Is the mean resistant? YES NO Suppose we caught a sample of 20 fish with the following lengths. Create a histogram for the lengths of fish. (Use a class width of 1.) Mean = 6.5 Median = 6.5 Look at the placement of the mean and median in this symmetrical Calculate the mean and median. distribution. 3 6 5 4 6 7 10 5 6 9 7 9 7 8 8 7 4 6 5 8 Suppose we caught a sample of 20 fish with the following lengths. Create a histogram for the lengths of fish. (Use a class width 1.) Mean = 6.8 Median = 5.5 Look at the placement of the mean and median inCalculate this skewed the mean and median. distribution. 3 6 5 4 6 12 10 5 15 3 7 4 3 8 3 13 4 11 5 9 Suppose we caught a sample of 20 fish with the following lengths. Create a histogram for the lengths of fish. (Use a class width of 1.) Mean = 7.75 Median = 8.5 Look at the placement of the mean and median inCalculate this skewed the mean and median. distribution. 3 6 5 4 6 9 10 10 10 9 7 9 10 10 8 7 9 10 5 8 Recap: • In a symmetrical distribution, the mean and median are equal. • In a skewed distribution, the mean is pulled in the direction of the skewness. • In a symmetrical distribution, you should report the mean! • In a skewed distribution, the median should be reported as the measure of center! Trimmed mean: Purpose is to remove outliers from a data set To calculate a trimmed mean: • Multiply the percent to trim by n • Truncate that many observations from BOTH ends of the distribution (when listed in order) • Calculate the mean with the shortened data set Find the mean of the following set of data. 12 14 19 Mean = 23.8 10%(10) = 1 20 22 24 25 26 26 50 Find a 10% trimmed. So remove one observation from each side! 14 19 20 22 24 25 26 26 xT 22 8 What values are used to describe categorical data? Suppose that each person in a sample of 15 cell phone users is asked if he or she is satisfied with the cell phone service. Pronounced p-hat population proportion is Here are The the responses: p. Y N Y Yby the N letter N Y Y What wouldY denoted be the possible responses? N Y Y Y N N 60% ofofthe Find the9sample proportion thesample peoplewas ˆ p 0 . 6 who answered “yes”: satisfied with their cell 15 numberphone of successes service. pˆ n Why is the study of variability important? this can of soda • There is variability Does in virtually everything contain exactly 12 ounces? • Allows us to distinguish between usual & unusual values • Reporting only a measure of center doesn’t provide a complete picture of the distribution. A B C 20 30 40 50 60 70 20 30 40 50 60 70 20 30 40 50 60 70 What is the mean and median of these three graphs? Measures of Variability The simplest numeric measure of variability is range. What is the range of these data sets? Range = largest observation – smallest observation A B C 20 30 40 50 60 70 20 30 40 50 60 70 20 30 40 50 60 70 The first two data sets have a range of 50 (70-20) but the third data set has a much smaller range of 10. Measures of Variability How would a dotplot look if the average deviation was 0? What does it mean to have an average deviation of 0? 1 2 3 4 5 Measures of Variability Another measure of the variability in a data set uses the deviations from the mean (x – x). What What is the is a mean deviation of this from distribution? the mean? A 45 20 30 40 50 60 70 Measures of Variability What can we do to the deviations so that we could Can we find an average Remember the sample of 6 fish that we Another measure of the variability in finddeviation? an average? a caught ... data setfrom uses the the lake deviations from the They(x were the following lengths: mean – x). Population variance is denoted by3”, s2 4”, 5”, 6”, 8”, 10” The the deviations and estimated divided n. average Degree The mean by length was 6 of inches. Recallof freedom squared called thethe variance. that we is calculated deviations from (explained the mean. What was the sum of later) these deviations? 2 2 s x x n 1 Remember the sample of 6 fish that we caught from the lake . . . Find the variance of the length fish.the Firstof square x 3 4 5 6 8 10 Sum deviations (x - x) (x - x)2 Finding the average of What could we do so that -3 9 the deviations would we would be able to find -2 4 always equal 0! an average deviation? -1 1 What is the sum 0 0 of the deviations 2 4 Divide this by 5. squared? 4 16 0 34 s2 = 6.5 Measures of Variability The square root of variance is called standard deviation. A typical deviation from the mean is the standard deviation. s2 = 6.8 inches2 so s = 2.608 inches The fish in our sample deviate from the mean of 6 by an average of 2.608 inches. The most commonly used measures of Calculation of standard center and variability are the mean and standard deviation, respectively. deviation of a sample s x x n 1 Population standard deviation is denoted by s (where n is used in the denominator). 2 Degrees of Freedom (df) • The number of independent observations that are free to vary However, once these five values occur, then the sixth value is no longer free to vary. It Suppose we consider the sample 6 fish MUST be a specific value inof order for the wheredeviations the mean from is 6 inches. the mean (of 6) to have a sum of zero. Thus, out of a sample of n, n - 1 observations are free to vary. Five of these values are free to be any possible length of fish! Measures of Variability Interquartile range (IQR) is the range of the middle half of the data. Lower quartile (Q1) is the median of the lower half of the data Upper quartile (Q3) is the median of the upper half of the data IQR = Q3 – Q1 The Chronicle of Higher Education (2009-2010 issue) published the accompanying data on the percentage of the population with a bachelor’s or higher degree in 2007 for each of the 50 states and the District of Columbia. 21 27 35 25 22 26 27 30 38 32 25 26 24 25 26 29 19 29 31 24 33 30 22 19 22 34 35 24 24 28 30 35 29 27 26 17 26 20 27 30 25 47 20 23 23 23 Find the interquartile range for this set of data. 26 27 34 25 34 21 17 27 23 35 25 25 27 22 31 26 47 27 19 30 23 38 26 32 27 25 32 26 19 24 25 26 26 28 29 33 19 20 29 24 31 26 24 29 33 34 30 20 22 24 19 26 22 29 34 35 21 24 26 24 28 29 30 34 35 22 29 25 27 26 26 30 17 35 26 22 20 25 27 30 25 35 47 22 20 25 23 27 23 30 23 35 26 23 27 25 34 27 25 30 34 38 First put the data in order & find the Find the lower quartile (Q ) by finding 1 ) by finding the Find the upper quartile (Q 3 the median. median medianof ofthe thelower upperhalf. half. IQR = 30 – 24 = 6 Which measure(s) of variability (spread) is/are resistant? Only the IQR! Wolf Stat Company Activity How does the mean and standard deviation change with linear transformations? Linear transformation rule • When adding a constant to a random variable, the mean changes but not the standard deviation. • When multiplying a constant to a random variable, the mean and the standard deviation changes. An appliance repair shop charges a $30 service call to go to a home for a repair. It also charges $25 per hour for labor. From past history, the average length of repairs is 1 hour 15 minutes (1.25 hours) with standard deviation of 20 minutes (1/3 hour). Including the charge for the service call, what is the mean and standard deviation for the charges for labor? m 30 25(1.25) $61.25 1 s 25 $8.33 3 Stat Land Game Activity ? Move 1s How do you combine the mean and standard deviation of two independent random variables? Rules for Combining two variables • To find the mean for the sum (or difference), add (or subtract) the two means • To find the standard deviation of the sum (or differences), ALWAYS add the variances, then take the square root. m a b m a mb ma b ma mb 2 a s a b s s 2 b If variables are independent Bicycles arrive at a bike shop in boxes. Before they can be sold, they must be unpacked, assembled, and tuned (lubricated, adjusted, etc.). Based on past experience, the times for each setup phase are independent with the following means & standard deviations (in minutes). What are the mean and standard deviation for the total bicycle setup times? Phase Mean SD Unpacking Assembly Tuning 3.5 21.8 12.3 0.7 2.4 2.7 mT 3.5 21.8 12.3 37.6 minutes sT 0.7 2 2.42 2.7 2 3.680 minutes Another graph- Boxplots What are some advantages of boxplots? • Ease of construction • Convenient handling of outliers • Construction is not subjective (like histograms) • Used with medium or large size data sets (n > 10) • Useful for comparative displays Boxplots The five-number summary is the minimum median, third When tovalue, Use first quartile, Univariate numerical data quartile, and maximum value How to construct a Skeleton Boxplot – Calculate the five number summary – Draw a horizontal (or vertical) scale – Construct a rectangular box Use fromfor the moderate lower quartile (Q1) to the upper quartile (Q3) data to large – Draw lines from the lower quartile to the Don’t use smallest observation and fromsets. the upper quartile to the largest observation with data sets of To describe n < 10. – comment on the center, spread, and shape of the distribution and if there is any unusual features Remember the data on the percentage of the population with a bachelor’s or higher degree in 2007 for each of the 50 states and the District of Columbia. 17 23 25 27 31 47 19 23 26 27 32 19 24 26 28 33 20 24 26 29 34 20 24 26 29 34 21 24 26 29 34 22 25 26 30 35 22 25 27 30 35 22 25 27 30 35 23 25 27 30 38 First scale Draw aalines box from Q1 Drawdraw lineafor for the the to Q3 median whiskers 10 20 30 Percentages 40 50 Modified boxplots To display outliers: • Identify mild & extreme outliers An observation is an outliers if it is more than 1.5(iqr) away from the nearest Modified boxplots are generally preferred quartile. because provide iqr and Q1 1.5they Q3 1more .5iqr information about the data distribution. An outlier is extreme if it is more than 3(iqr) away from the nearest quartile. Q1 3iqr and Q3 3iqr • whiskers extend to largest (or smallest) data observation that is not an outlier Remember the data on the percentage of the To describe: population with a bachelor’s or higher degree in The distribution percent the population 2007 for each of theof50 statesof and the District of with a bachelor’s degree or higher for the U.S. Columbia. statesisand of the Columbia There oneDistrict outlier at upper is positively withdistribution, an 47%. median 17 skewed 19 at the 19 20 outlier 20 at21 22The22 22 23 end but none at 26% 23 percentage 24 isend. 24 24 itwith 24 a range 25 of 2530%. 25 25 at23 the lower Is extreme? 25 27 31 47 26 27 32 26 28 33 26 29 34 24-1.5(6) = 15 30+1.5(6) = 39 26 26 26 27 27 27 29 29 30 30 30 30 for the38 34 Place 34 aDraw 35 lines 35 for 35the solid dot whiskers outlier First, draw the box Next calculate scale, the fences and the for the for line outliers. median 30+3(6) = 48 10 20 30 Percentages 40 50 Symmetrical boxplots Approximately symmetrical boxplot Notice that the range Notice that all 3 of the lower half and the range of the upper boxplots are identical, but their corresponding half of this histograms are very distribution are approximately equal so different. Can you we can say that it is determine the number of modes from a approximately boxplot? However, the range of symmetrical. Skewed boxplot the two halves of this distribution are definitely different sizes, so it would be skewed in the direction of the longest side. The 2009-2010 salaries of NBA players published on the web site hoopshype.com were used to construct the comparative boxplot of salary data for five teams. Discuss the similarities and differences. Normal Curve the following into your calculator: • Put Bell-shaped, symmetrical, unimodal (Window: x: [0,20] & y: [0,0.3]) curve • Y1: Transition points between cupping normalpdf(X,10,2) Y2: normalpdf(X,10,1.5) upward and downward occur at m ± s Y3: normalpdf(X,10,3) • As the standard deviation increases, the curve flattens and spreads What happens? Let’s use our calculator to • As the standard deviation decreases, graph some normal curves the curve gets taller and thinner What’s my area? Input the following command into a graphing calculator in order to graph a normal curve with a mean of 20 and standard deviation of 3. Y1 = normalpdf(X,20,3) (Window x: [10,30] y: [0,0.2]) Use the command 2nd trace, 7 to find the area under the curve for the: (Round to 3 decimal places.) Lower limit: 17 Lower limit: 14 Lower limit: 11 Upper limit: 23 Upper limit: 26 Upper limit: 29 Area: ________ Area: ________ Area: ________ What’s my area? Graph a normal curve with a mean of 50 and standard deviation of 5. Y1 = normalpdf(X,50,5) (x: [30,70] y: [0,0.1]) Find the area under the curve for the following: Lower limit: 45 Lower limit: 40 Lower limit: 35 Upper limit: 55 Upper limit: 60 Upper limit: 65 Area: ________ Area: ________ Area: ________ What pattern do you notice? Interpreting Center & Variability Empirical Rule99.7% • Approximately 68% of the observations are 68% 95% within 1 standard deviation of the mean Can ONLY be used with distributions that are95% mound shaped! • Approximately of the observations are within 2 standard deviation of the mean • Approximately 99.7% of the observations are within 3 standard deviation of the mean The height of male students at PWSH is approximately normally distributed with a mean of 71 inches and standard deviation of 2.5 inches. a)What percent of the male students are shorter than 66 inches? About 2.5% b) Taller than 73.5 inches? About 16% c) Between 66 & 73.5 inches? About 81.5% Measures of Relative Standing Z-score A z-score tells us how many standard deviations the value is from the mean. value - mean z - score standard deviation One example of standardized score. What do these z-scores mean? -2.3 2.3 standard deviations below the mean 1.8 1.8 standard deviations above the mean -4.3 4.3 standard deviations below the mean Sally is taking two different math achievement tests with different means and standard deviations. The mean score on test A was 56 with a standard deviation of 3.5, while the mean score on test B was 65 with a standard deviation of 2.8. Sally scored a 62 on test A and a 69 on test B. On which test did Sally score the best? Z-score on test A Z-score on test B 62 56 z 1.714 3 .5 69 65 z 1.429 2.8 She did better on test A. Measures of Relative Standing Percentiles A percentile is a value in the data set where r percent of the observations fall AT or BELOW that value In addition to weight and length, head circumference is another measure of health in newborn babies. The National Center for Health Statistics reports the following summary values for head circumference (in cm) at birth for boys. Head circumference (cm) 32.2 33.2 34.5 35.8 37.0 38.2 38.6 Percentile 5 10 25 50 75 90 What percent of newborn boys had head circumferences greater than 37.0 cm? 25% 10% of newborn babies have head circumferences bigger than what value? 38.2 cm 95