Download Handout 1 - UTRGV Faculty Web

Sec 3.1 Measures of Central Tendency The population arithmetic mean, μ (pronounced “mew”), is computed using all the individuals in a population. The population mean is a parameter If x1, x2, …, xN are the N observations of a variable from a population, then the population mean, µ, is ∑ = If x1, x2, …, xn are the n observations of a variable from a sample, then the sample mean, ̅ , is ̅ = ∑ The median of a variable is the value that lies in the middle of the data when arranged in ascending order. We use M to represent the median. Steps in Finding the Median of a Data Set: • • If the number of observations is odd, then the median is the data value exactly in the middle of the data set. That is, the median is the observation that lies in then (n + 1)/2 position If the number of observations is even, then the median is the mean of the two middle observations in the data set. That is, the median is the mean of the observations that lie in the n/2 position and the n/2 + 1 position. Example: The following data represent the travel times (in minutes) to work for all seven employees of a start-up web development company. 23, 36, 23, 18, 5, 26, 43 Determine the median of this data. Example: Find the mean and median. Use the mean and median to identify the shape of the distribution. Verify your result by drawing a histogram of the data. The following data represent the asking price of homes for sale in Lincoln, NE. The mean asking price is $168,320 and the median asking price is $148,700. Therefore, we would conjecture that the distribution is skewed right. The mode of a variable is the most frequent observation of the variable that occurs in the data set. A set of data can have no mode, one mode, or more than one mode. If no observation occurs more than once, we say the data have no mode. Sec 3.2 Measures of Dispersion The range, R, of a variable is the difference between the largest data value and the smallest data values. Range = R = Largest Data Value – Smallest Data Value The population standard deviation is symbolically represented by σ (lowercase Greek sigma). ∑ = where x1, x2, . . . , xN are the N observations in the population and μ is the population mean. The short-cut formula is following: The sample standard deviation, s, of a variable is s= ∑ ̅ where x1, x2, . . . , xn are the n observations in the sample and ̅ is the sample mean. The short-cut formula is following: Example: The following data represent the travel times (in minutes) to work for all seven employees of a start-up web development company. 23, 36, 23, 18, 5, 26, 43 Compute the population standard deviation of this data. The Empirical Rule: If a distribution is roughly bell shaped, then • Approximately 68% of the data will lie within 1 standard deviation of the mean. That is, approximately 68% of the data lie between μ – 1σ and μ + 1σ. • Approximately 95% of the data will lie within 2 standard deviations of the mean. That is, approximately 95% of the data lie between μ – 2σ and μ + 2σ. • Approximately 99.7% of the data will lie within 3 standard deviations of the mean. That is, approximately 99.7% of the data lie between μ – 3σ and μ + 3σ. Note: We can also use the Empirical Rule based on sample data with of μ and s used in place of σ. used in place Example: The following data represent the serum HDL cholesterol of the 54 female patients of a family doctor. (a) Compute the population mean and standard deviation. (b) Draw a histogram to verify the data is bell-shaped. (c) Determine the percentage of all patients that have serum HDL within 3 standard deviations of the mean according to the Empirical Rule. (d) Determine the percentage of all patients that have serum HDL between 34 and 69.1 according to the Empirical Rule. (e) Determine the actual percentage of patients that have serum HDL between 34 and 69.1. Chebyshev’s Inequality: For any data set or distribution(regardless of shape of the distribution), at least 1 100% of the observations lie within k standard deviations of the mean, where k is any number greater than 1. That is, at least 1 100% of the data lie between μ – kσ and μ + kσ for k > 1. Note: We can also use Chebyshev’s Inequality based on sample data. Sec 3.3 Measures of Central Tendency and Dispersion from Grouped Data Approximate the Mean of a Variable from a Frequency Distribution Population Mean Sample Mean th where xi is the midpoint or value of the i class th fi is the frequency of the i class n is the number of classes Approximate the Standard Deviation of a Variable from a Frequency Distribution Population Sample Standard Deviation Standard Deviation th where xi is the midpoint or value of the i class th fi is the frequency of the i class Example: The National Survey of Student Engagement is a survey that (among other things) asked first year students at liberal arts colleges how much time they spend preparing for class each week. The results from the 2007 survey are summarized below. Approximate the standard deviation number of hours spent preparing for class each week. The weighted mean, , of a variable is found by multiplying each value of the variable by its corresponding weight, adding these products, and dividing this sum by the sum of the weights. th where w is the weight of the i observation th xi is the value of the i observation. Example: Bob goes to the “Buy the Weigh” Nut store and creates his own bridge mix. He combines 1 pound of raisins, 2 pounds of chocolate covered peanuts, and 1.5 pounds of cashews. The raisins cost $1.25 per pound, the chocolate covered peanuts cost $3.25 per pound, and the cashews cost $5.40 per pound. What is the cost per pound of this mix. Sec 3.4 Measures of Position The z-score represents the distance that a data value is from the mean in terms of the number of standard deviations. Population z-score Sample z-score The z-score is unitless. It has mean 0 and standard deviation 1. Example: The mean height of males 20 years or older is 69.1 inches with a standard deviation of 2.8 inches. The mean height of females 20 years or older is 63.7 inches with a standard deviation of 2.7 inches. Data is based on information obtained from National Health and Examination Survey. Who is relatively taller? Kevin Garnett’s height is 4.96 standard deviations above the mean. Candace Parker’s height is 4.56 standard deviations above the mean. Kevin Garnett is relatively taller. The kth percentile, denoted, Pk , of a set of data is a value such that k percent of the observations are less than or equal to the value. Example: Interpret the percentile The Graduate Record Examination (GRE) is a test required for admission to many U.S. graduate schools. The University of Pittsburgh Graduate School of Public Health requires a GRE score no less than the 70th percentile for admission into their Human Genetics MPH or MS program. Interpret this admissions requirement In general, the 70th percentile is the score such that 70% of the individuals who took the exam scored worse, and 30% of the individuals scores better. In order to be admitted to this program, an applicant must score as high or higher than 70% of the people who take the GRE. Put another way, the individual’s score must be in the top 30%. Quartiles divide data sets into fourths, or four equal parts. st • The 1 quartile, denoted Q , divides the bottom 25% the data from the top 75%. st 1 th Therefore, the 1 quartile is equivalent to the 25 percentile. nd • The 2 quartile divides the bottom 50% of the data from the top 50% of the data, so nd th that the 2 quartile is equivalent to the 50 percentile, which is equivalent to the median. rd • The 3 quartile divides the bottom 75% of the data from the top 25% of the data, so that rd th the 3 quartile is equivalent to the 75 percentile. Example: A group of Brigham Young University—Idaho students (Matthew Herring, Nathan Spencer, Mark Walker, and Mark Steiner) collected data on the speed of vehicles traveling through a construction zone on a state highway, where the posted speed was 25 mph. The recorded speed of 14 randomly selected vehicles is given below: 20, 24, 27, 28, 29, 30, 32, 33, 34, 36, 38, 39, 40, 40 Find and interpret the quartiles for speed in the construction zone. Step 1: The data is already in ascending order. Step 2: There are n = 14 observations, so the median, or second quartile, Q , is the mean of the th 2 th 7 and 8 observations. Therefore, M = 32.5. Step 3: The median of the bottom half of the data is the first quartile, Q . 1 20, 24, 27, 28, 29, 30, 32 The median of these seven observations is 28. Therefore, Q = 28. The median of the top half of 1 the data is the third quartile, Q . Therefore, Q = 38. 3 • • • 3 25% of the speeds are less than or equal to the first quartile, 28 miles per hour, and 75% of the speeds are greater than 28 miles per hour. 50% of the speeds are less than or equal to the second quartile, 32.5 miles per hour, and 50% of the speeds are greater than 32.5 miles per hour. 75% of the speeds are less than or equal to the third quartile, 38 miles per hour, and 25% of the speeds are greater than 38 miles per hour. The interquartile range, IQR, is the range of the middle 50% of the observations in a data set. That is, the IQR is the difference between the third and first quartiles and is found using the formula IQR = Q – Q 3 1 Example: A group of Brigham Young University—Idaho students (Matthew Herring, Nathan Spencer, Mark Walker, and Mark Steiner) collected data on the speed of vehicles traveling through a construction zone on a state highway, where the posted speed was 25 mph. The recorded speed of 14 randomly selected vehicles is given below: 20, 24, 27, 28, 29, 30, 32, 33, 34, 36, 38, 39, 40, 40 Determine and interpret the interquartile range of the speed data. The range of the middle 50% of the speed of cars traveling through the construction zone is 10 miles per hour. th Suppose a 15 car travels through the construction zone at 100 miles per hour. How does this value impact the mean, median, standard deviation, and interquartile range? Checking for Outliers by Using Quartiles Step 1 Determine the first and third quartiles of the data. Step 2 Compute the interquartile range. Step 3 Determine the fences. Fences serve as cutoff points for determining outliers. Lower Fence = Q1 – 1.5(IQR) Upper Fence = Q3 + 1.5(IQR) Step 4 If a data value is less than the lower fence or greater than the upper fence, it is considered an outlier. Sec 3.5 The Five-Number Summary and Boxplots The five-number summary of a set of data consists of the smallest data value, Q1, the median, Q3, and the largest data value. We organize the five-number summary as follows: Drawing a Boxplot Step 1: Determine the lower and upper fences. Lower Fence = Q1 – 1.5(IQR) Upper Fence = Q3 + 1.5(IQR) where IQR = Q3 – Q1 Step 2: Draw a number line long enough to include the all necessary values. Insert vertical lines at Q1, M, and Q3. Enclose these vertical lines in a box. Step 3: Label the lower and upper fences. Step 4: Draw a line from Q1 to the smallest data value that is larger than the lower fence. Draw a line from Q3 to the largest data value that is smaller than the upper fence. These values are called adjacent values and these lines are called whiskers. Step 5: Any data values less than the lower fence or greater than the upper fence are outliers and are marked with an asterisk (*). Example: Every six months, the United States Federal Reserve Board conducts a survey of credit card plans in the U.S. The following data are the interest rates charged by 10 credit card issuers randomly selected for the July 2005 survey. Construct a boxplot of the data. Step 1: The interquartile range (IQR) is 14.4% - 12% = 2.4%. The lower and upper fences are: Lower Fence = Q – 1.5(IQR) = 12 – 1.5(2.4) = 8.4% 1 Upper Fence = Q + 1.5(IQR) = 14.4 + 1.5(2.4) = 18.0% 3 Step 2: Draw the box plot Use a boxplot and quartiles to describe the shape of a distribution. The interest rate boxplot indicates that the distribution is skewed left.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Handout 1 - UTRGV Faculty Web