Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 2 9 Instruction: Measures of Central Tendency This lecture discusses numerical descriptive measures, beginning with three statistics of numerical data sets called measures of central tendency. A measure of central tendency is a statistic that assigns a numerical value as representative of an entire data set. One measure of central tendency is the arithmetic mean. The symbol X-bar, X , denotes the arithmetic mean of a sample set. The arithmetic mean, defined below, can be thought of as the average. For a given numerical set of data S = { X 1 , X 2 ,… , X n } with n elements, the arithmetic mean of the set is given by the formula: n X= ∑ Xi i =1 n . The arithmetic mean has three significant characteristics. First, changing the value of any score or adding to the data set a new score not equal to the mean, will change the mean of the data set. Second, if some constant value c is added to each value in the data set, the mean changes to X + c . Third, if some constant value c is multiplied by each value in the data set, the mean changes to c ⋅ X A second measure of central tendency is the median. The median, defined below, is the midpoint of the distribution of the data set. For data set S arranged in ascending order, the median is the value that divides the data set exactly in half, and exactly 50% of the data will be equal to or less than the median. If ( n + 1) 2 is an integer, it equals the position of the median. If ( n + 1) 2 is a not an integer, the position of the median is the midpoint between the score in the n 2 position and the score in the ( n + 2 ) 2 position. If n ( S ) is odd for a sample S of non-rounded data arranged in ascending order, the median is the middle number in S. If n ( S ) is even for a sample S of non-rounded data arranged in ascending order, the median is the mean of the two middle numbers. The third measure of central tendency is the mode. The mode, defined below, is the most common number in a numerical data set. For data set S with some frequency f k greater than any other frequency f j , the mode is the value with the greatest frequency. Lecture 2 10 According to the definition above, there is no mode in a numerical data set that contains data values such that the frequencies of all the data values are equal. If, however, there exists any one or more frequencies greater than one or more other frequencies, the data set has a mode, and the mode equals the data value (or values) with the greatest frequency. Data sets with multiple modes are said to be multimodal. Data sets with two modes are said to be bimodal. Consider a sample V = {6, 5, 2, 12, 1, 3, 2, 4, 0, 4, 13, 6, 6, 7, 1, 6} . To find the three measures of central tendency, we must find the arithmetic mean, the median, and the mode. Arranging the data set in ascending order, will help identify frequencies and the median: V = {0, 1, 1, 2, 2, 3, 4, 4, 5, 6, 6, 6, 6, 7, 12, 13} . The data point 6 appears in the data set the most (has the greatest frequency), so the mode equals 6. The arithmetic mean equals the ratio of the sum of the data points to the number of data points as computed below. X= 0 + 1 + 1 + 2 + 2 + 3 + 4 + 4 + 5 + 6 + 6 + 6 + 6 + 7 + 12 + 13 78 = = 4.875 16 16 Since n (V ) is even, the median equals the mean of the two middle numbers as computed below. median = 4+5 9 = = 4.5 2 2 In summary, for the given data set V, we have the three measures of central tendency: mean = 4.875, median = 4.5, & mode = 6. Consider a larger set of data S displayed by the frequency distribution below. X 22 23 24 25 26 27 28 29 30 31 32 33 f 5 3 7 1 1 2 4 10 4 1 1 1 Since the frequency distribution organizes the data set, finding the three measures of central tendency for the data set is not much more difficult for S than it was for V; even though, n ( S ) > n (V ) . Note that n ( S ) = ∑ f = 5 + 3 + 7 + 1 + 1 + 2 + 4 + 10 +4 + 1 + 1 + 1 = 40 . To find the mode, select the data value with the greatest frequency, which is 29. To find the median, start by calculating its position: ( 40 + 1) 2 = 20.5. Since position of the median is 20.5, the median equals the average of the 20th and 21st values in the data set arranged in ascending order: ( 28 + 28) 2 = 28. To find the arithmetic mean, calculate the ratio of the sum of the data points to the number of data points as below. ∑[ f ⋅ Xi ] n X= i =1 ∑f = 5 ⋅ 22 + 3 ⋅ 23 + 7 ⋅ 24 + 25 + 26 + 2 ⋅ 27 + 4 ⋅ 28 + 10 ⋅ 29 + 4 ⋅ 30 + 31 + 32 + 33 = 26.75 40 Lecture 2 11 In summary, for the given data set S, we have the three measures of central tendency: mean = 26.75, median = 28, & mode = 29. Instruction: Measures of Dispersion This portion of the lecture discusses three statistics of numerical data sets called measures of dispersion. Consider the two samples below each with the same mean and median. A = {47, 50, 53} B = {0, 50, 100} For both sets, X = 50. For sample A, the mean is a good estimate for any score found in the set, but the mean is not a good estimate for any score found in sample B. The scores in sample B are spread further apart than those in sample A. Sample B is said to have greater variability. Statistics that measure the magnitude of variability are called measures of dispersion. A measure of dispersion is a statistic that assigns a numerical value to describe the variability of a data set. Variability refers to the spread of a data set. A measure of dispersion measures how spread out or how widely dispersed a set of data is. One particular measure of dispersion is the range. The range, defined below, is the distance between the largest and smallest values in a sample. The range is the difference of the largest and smallest values in a sample. The range of set A above equals six because 53 − 47 = 6. The range of set B above equals 100 because 100 − 0 = 100. A second measure of dispersion is the sample variance. To discuss variance, we must first discuss a deviation and the squares of deviations. Deviation equals distance from the mean. A deviation score equals X i − X . According to the definition above, the deviations of scores below the mean are negative, and the deviations of scores above the mean are positive. The table below shows the deviations for set A. X Xi − X 47 –3 50 0 53 3 Lecture 2 12 Scores below the mean have negative deviations. Scores above the mean have positive deviations. Scores equal to the mean have zero deviations. While deviations can be positive or ( negative depending on the position of the respective score, the squares of deviations, X i − X ), 2 are always positive. To calculate sample variance, we must calculate the deviation of each score in the sample as above as well as the square of each deviation as below. (X Xi − X X 47 50 53 –3 0 3 i −X ) 2 9 0 9 The population variance equals the mean of the sum of the squares of the deviations. The sample variance equals an estimate of the population variance given by the formula in the box below. The sample variance, denoted var or S 2 , equals the ratio: n var = S 2 = ( ∑ Xi − X i =1 ) 2 . n −1 The sample variance for sample A = {47, 50, 53} is calculated below. var = S 2 = 9 + 0 + 9 18 = =9 3 −1 2 The third measure of dispersion is the standard deviation, which equals the square root of the variance. The sample standard deviation, denoted S, is a distance from the mean that equals the square root of the variance: S= S = 2 ( ∑ x−x ) 2 . n −1 The sample standard deviation measures the typical or standard distance of scores in the sample from the mean. According to the definition above, widely dispersed samples have large standard deviations. Indeed, the larger the sample's standard deviation, the more widely dispersed are the elements in the sample. The standard deviation of sample A = {47, 50, 53} is given here: S = 9 = 3 . The standard deviation has two key characteristics. First, adding a constant to each score in a sample will not change the standard deviation. Thus, if A* = {46, 49, 52} , then S = 3. Lecture 2 13 Second, multiplying each score by a constant causes the standard deviation to be multiplied by the same constant. Thus, if A* = {94, 100, 106} , then S = 6. A fourth measure of dispersion is the coefficient of variation, a relative measure always expressed as a percentage. The coefficient of variation measures the scatter in the data relative to the mean. The coefficient of variation, denoted CV, equals the quotient of the standard deviation and the mean expressed as a percent: ⎛S⎞ CV = ⎜ ⎟ . ⎝X⎠ Reconsider our sample from above, A = {47, 50, 53} , for which S = 3 and X = 50 . The coefficient of variation for sample A is 6% as calculated below. ⎛ 3 ⎞ CV = ⎜ ⎟ = 0.06 = 6% ⎝ 50 ⎠ Instruction: Distributions This lecture discusses types of distributions plus an interesting use of the standard deviation. Lecture 4.2 discussed frequency distribution graphs. This lecture discusses some general types of shapes of frequency distributions. One general shape of frequency distribution graphs includes symmetrical distributions. With symmetrical distributions, a vertical line can be drawn through the middle in such a way that one side of the distribution is an exact mirror image of the other as shown below in Figures A and B. Figure B demonstrates a bimodal symmetrical distribution. Figure A f 25 20 15 10 5 0 symmetrical distribution Lecture 2 14 Figure B f 20 15 10 5 0 bimodal symmetrical distribution Here the term bimodal means that the two data points (or classes) have the same frequency, which is how we will use the term bimodal in this course. Bimodal can refer to non-symmetrical distributions with two distinct peaks on either side of the center of the distribution. Another general shape of frequency distribution graphs includes skewed distributions. Skewed distributions tend to form graphs that rise up toward one end of the range of scores. These distributions often taper off gradually at the opposite end. The tapering end is called the tail. Figure C below demonstrates a positively skewed distribution. The modifier "positively" derives from the fact that the tail points in the positive direction. Figure D below demonstrates a negatively skewed distribution. Figure D Figure C f f 25 25 20 20 15 15 10 10 5 5 0 0 positively skewed distribution negatively skewed distribution Finally, a frequency distribution graph can be uniform (or rectangular). Uniform distributions form a rectangle because all the objects (or classes) have equal frequencies. Figure E below demonstrates a uniform distribution. Lecture 2 15 Figure E f 20 15 10 5 0 uniform distribution It is easy to imagine how challenging it would be to construct a frequency distribution graph for a population because populations tend to be large data sets and recording measurements and frequencies for the entire group would be cumbersome. It is sometimes easier, however, to construct relative frequency graphs for populations. Using statistical procedures applied to samples, researchers can sometimes infer information about the relative frequencies of populations. In such cases, the distributions are outlined with smooth curves. Figure P below displays a symmetrical relative frequency distribution for a population. Figure P Instruction: Chebyshev's Theorem The Russian statistician Pafnuti Chebyshev discovered a useful fact given in the box below that applies to all distributions regardless of their shape. Chebyshev's Theorem states that the fraction of any data set lying within k standard deviations of the mean where k > 1 is at least: k2 −1 . k2 This theorem tells us that at least 75% of the scores in a data set lie within two standard deviations of the mean as calculated below. Instruction: Measures of Position This lecture discusses measures of position. The median discussed in Lecture 4.3 is an example of a measure of position. Since the median is the middle number (or average of the two middle numbers), the median is a measure of position that reveals which score occupies the center of the distribution. Lecture 2 16 A measure of position is a statistic that reveals a score's position in the distribution of a data set. Percentiles are measures of position that reveal what percent of the scores equal or fall below a given score. Percentiles divide the distribution of the data set into one hundred parts. A percentile is a numerical value assigned to a given score that indicates what percent of the scores in the data set equal or fall below the given score. According to the definition above, a percentile is a position. The bottom percentile is zero. The top percentile is 99. The score with zero percentile is the lowest score. The score with the 99th percentile is the greatest score. To calculate the percentile of X i , divide the number of scores less than X i by n where n is the total number of scores in the data set and convert to a percent. To find the position of a score that occupies a given percentile, multiply the decimal equivalent of the percentile by n and select the next largest integer. Consider data set A = {22, 21, 14, 20, 19, 27, 17, 22, 26, 24}. Arranging the data set in ascending order reveals that eight scores fall below 26: A = {14, 17, 19, 20, 21, 22, 22, 24, | 26, 27} Accordingly, a score of 26 corresponds to the 80th percentile as calculated below. 8 = 0.8 = 80% 10 To find the position of the score in the 40th percentile for set A, we multiply 40% by the cardinal number of the set to get 4 then move to the next integer 5. 0.4 × 10 = 4 ⇒ 5 Hence, the fifth score in the ordered data set occupies the 40th percentile, so 21 corresponds to the 40th percentile. Another statistical measure of position is the quartile defined in the box below. A quartile divides the distribution into quarters. The quartiles, denoted Q1 , Q2 , and Q3 , are the three numbers that occupy the 25th, 50th, and 75th percentiles respectively. The second quartile, Q2 , equals the median. The first quartile, Q1 , equals the median of the scores that fall below Q2 . The third quartile, Q3 , equals the median of the scores that fall above Q2 . For a given sample, the three quartiles together with the lowest value (zero percentile) and the greatest value (99th percentile) act as a set of five numbers called the five number Lecture 2 17 summary of a data set. These five numbers are used to create a box plot (or box-and-whisker plot) defined below. A box plot is a graphical display that uses a rectangle and two line segments to summarize a data set. The entire display hovers over a number line. The rectangle extends from the first quartile to the third quartile and is divided into two parts by a vertical line segment drawn over the median (second quartile). From the left and right sides of the rectangle, two line segments called whiskers extend to the least and greatest scores respectively. Consider the data set B = {1, 2, 3, 6, 6, 7, 8, 8, 8, 9, 9, 11, 11, 12, 17}. The display below is a box plot representing data set B. Q1 Q2 Q3 The box plot conveys the central tendency, the location of the middle half of the data, the dispersion, and the skew-ness. The location of the median shows the central tendency. The rectangle reveals the middle half of the data. The reach of the whiskers exposes the range, and the non-symmetry or symmetry of box and whiskers displays the skew-ness. The most important measure of position is the Z-score defined below. A Z-score is a numerical value assigned to a raw score that measures the distance between the raw score and the mean in standard deviations. For a given sample A with mean X and standard deviation S, the Z-score of some raw score X i in A, is given by Zi = Xi − X . S Since a Z-score is a ratio of a raw score's deviation from the mean to the standard deviation, Zscores assigned to raw scores below the mean will be negative while those assigned to raw scores above the mean will be positive. If we recall Chebyshev's Theorem, we see the significance of a Z-score. Chebyshev's theorem stated that at least ( k 2 − 1) k 2 of the data of any distribution falls within k standard deviations. Since a Z-score equals a number of standard deviations, a raw score's Z-score can be substituted for k. Using Chebyshev's Theorem, we know that at least 93.75% of the data falls between the data point with a Z-score of –4 and the data point with a Z-score of +4. As a consequence, we note that any data point with a Z-score smaller than –4 or greater than +4 is fairly atypical of the data set. Imagine a doctor examining a child of a certain age whose weight Lecture 2 18 has a Z-score of –4.2. The doctor knows immediately that most children in a comparable population or sample have a greater weight. Accordingly, the doctor has statistical evidence to warrant expensive medical tests to see if there is some underlying medical cause for the child's low weight. Chebyshev's Theorem applies to any distribution of data. A future lecture discusses a particular type of distribution that imbues Z-scores with even more significance. Instruction: Measuring the Strength of Linear Relationships A scatter plot like the one shown below visually examines the relationship between two numerical variables. Blood Alcohol Concentration (mg%) 90 80 70 60 50 40 30 20 10 0 0 0.5 1 1.5 2 2.5 3 3.5 4 t-hours In the scatter plot above, there appears to be a linear relationship between hours elapsed and the alcohol concentration in a patient's bloodstream. The sample covariance measures the strength of such linear relationships. The sample covariance, denoted cov ( X , Y ) , measures the strength of the linear relationship between variables X and Y and is given by n ( )( ) ∑ ⎡ X i − X Yi − Y ⎤⎦ i =1 ⎣ cov ( X , Y ) = . n −1 The covariance is limited in its usefulness because it does not provide a relative strength of the relationship it purports to measure as does the coefficient of correlation. The coefficient of correlation, denoted r, measures the relative strength of a linear relationship between two numerical variables. The values of the coefficient of correlation range from negative one to positive one. If r = −1 , the relationship is said to be a perfect negative Lecture 2 19 correlation and the points in the scatter diagram will all fall in a straight line with a negative slope. If r = 1 , the relationship is said to be a perfect positive correlation and the points in the scatter diagram will all fall in a straight line with a positive slope. If r = 0 , the data points all fall in a straight horizontal line. In general, the smaller the difference between one and r , the stronger the linear relationship between the two variables. The sample coefficient of correlation, denoted r, measures the relative strength of the linear relationship between variables X and Y and is given by cov ( X , Y ) r= S X SY where n ( )( ) ∑ ⎡ X i − X Yi − Y ⎤⎦ i =1 ⎣ cov ( X , Y ) = , n −1 ( ) n 2 ∑ ⎡⎢ X i − X ⎤⎥ ⎦ , and S = S X = i =1 ⎣ Y n −1 ( ) n 2 ∑ ⎡⎢ Yi − Y ⎤⎥ ⎦. i =1 ⎣ n −1 Assignment 2 20 Problems #1 Find three measures of central tendency and three measures of dispersion for the data displayed in the frequency distribution below. Assume the data belongs to a sample. Value 2 4 6 8 #2 Frequency 5 1 8 4 Two friends, Frick and Frack, who take different history classes, took their midterm exams on the same day. Frick’s score was 86 while Frack’s score was 78. Use a complete sentence to identify which student did relatively better, given the class data shown below. Class mean Class standard deviation Frick 73 8 Frack 69 5 #3 Consider a distribution where the mean is seventy and the standard deviation is eight. At least what fraction of the values are between 54 and 86? #4 Construct a box plot for the data set below. 46 69 61 56 #5 59 79 64 61 63 62 67 64 66 52 70 67 69 59 83 71 75 64 66 88 51 67 56 71 59 70 62 58 63 79 65 68 66 55 68 72 An efficiency expert has developed the JSI, a test measuring job satisfaction of civil service clerks. The following information reflects data collected from a random sample of ten civil service clerks. Use a complete sentence and a relevant statistic to discuss the effectiveness of the JSI. Job Satisfaction Index (JSI) 92 32 56 20 72 16 56 76 80 48 Absences for Year JSI was taken 8 14 10 14 6 17 8 12 7 15