Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CHAPTET 3 Data Description OUTLINE: Introduction. 3-1 Measures of Central Tendency. 3-2 Measures of Variation. 3-3 Measures of Position. 3-4 Exploratory Data Analysis 3-1 MEASURES OF CENTRAL TENDENCY: Mean, Median, Mode, Midrange, Weighted Mean. Statistic: Is a characteristic or measure obtained by using all the data values from a sample. Parameter: Is a characteristic or measure obtained by using all the data values from a specific population. THE MEAN: The mean is the sum of the values, divided by the total number of values. x The symbol for the sample mean: X X n X 1 X 2 X 3 ............. X n n Where n: no. of val. In sample. The symbol for the population mean: X N X 1 X 2 X 3 ............. X n N Where N: no. of val. In population. EXAMPLE 3-1: The data represent the number of days off per year for a sample of individuals selected from nine different countries. Find the mean 20, 26, 40, 36, 23, 42, 35, 24, 30 PROPERTIES OF THE MEAN Uses all data values. Varies less than the median or mode Used in computing other statistics, such as the variance Unique, usually not one of the data values Cannot be used with open-ended classes Affected by extremely high or low values, called outliers THE MEDIAN: The median (MD) is the midpoint of the data array. *** The median is found by arranging the data in order and selecting the middle number *** Odd number of values The median will be one of the data values in the middle. Even number of values The median will be the average of two data values. EXAMPLE 3-4: The number of rooms in the seven hotels in downtown Pittsburgh is: 713, 300, 618, 595, 311, 401, and 292.. Find the median EXAMPLE 3-6: The number of tornadoes that have occurred in the United States over an 8-year period follows. 684, 764, 656, 702, 856, 1133, 1132, 1303 Find the median PROPERTIES OF THE MEDIAN Gives the midpoint Used when it is necessary to find out whether the data values fall into the upper half or lower half of the distribution. Can be used for an open-ended distribution. Affected less than the mean by extremely high or extremely low values. •The cost of four toys in a certain toy shop is given: 15$, 40$, 30$, 1350$ measure of central tendency should be used: a) b) c) d) mean median mode range THE MODE: The value that occurs most often in a data set is called the mode. No mode When no data value occurs more than once. Unimodal A data set that has only one mode. Multimodal Bimodal A data set that has two mode. A data set that has more than two mode. EXAMPLE 3-9: Find the mode of the signing bonuses of eight NFL players for a specific year. The bonuses in millions of dollars are 18.0, 14.0, 34.5, 10, 11.3, 10, 12.4, 10 EXAMPLE 3-10: Find the mode for the number of coal employees per county for 10 selected counties in southwestern Pennsylvania. 110, 731, 1031, 84, 20, 118, 1162, 1977, 103, 752 PROPERTIES OF THE MODE Used when the most typical case is desired Easiest average to compute Can be used with nominal data (Qualitative data) Not always unique or may not exist * The ………. Is a measure of central tendency should be used when the data are qualitative. a) b) c) d) median mean mode range ----------------------------------------------------------------------* When the data are categorical (red, blue, black, white) the measure of central tendency can be used: a) b) c) d) median mean mode range •If the number of books in a sample of 5-box are as follow : 11, 18, 2, 2, 7, 7, 2, 5 the set is said to have ………. a) b) c) d) Multimodal. Unimodal. Zero. Bimodal. THE MIDRANGE: The midrange is defined as the sum of the lowest and highest values in the data set, divided by 2. The symbol of midrange is MR. lowest .value highest .value L.v H .v MR 2 2 EXAMPLE 3-15: In the last two winter seasons, the city of Brownsville, Minnesota, reported these numbers of water-line breaks per month.. 2, 3, 6, 8, 4, 1 Find the midrange PROPERTIES OF THE MIDRANGE Easy to compute. Gives the midpoint. Affected by extremely high or low values in a data set. THE WEIGHTED MEAN: Find the Weighted Mean is used when the values in a data set are not equally represented. w1 X 1 w2 X 2 ....... wn X n wX Xw w1 w2 ....... wn w w1 ,w2 , …., wn are the weights And x1 , x2 , ….., xn are the values. Where EXAMPLE 3-17: A student received the following grades. Find the corresponding GPA. Course Credits, Grade, English Composition 3 A (4 points) Introduction to Psychology 3 C (2 points) Biology 4 B (3 points) Physical Education 2 D (1 point) SUMMARY OF MEASURES OF CENTRAL TENDENCY. Measure Mean Definition Sum of values, divided by total no. of values Median Middle point in data array Mode Most frequent data value Midrange L.V plus H.V ,divided by 2 Symbol , x MD MR y Positively skewed Mode Median Mean Mean > MD> Mode y x Negatively skewed Mean Mode Median Mean < MD< Mode x Symmetric distribution Mode Median Mean = MD = D 3-2 MEASURES OF VARIATION: How Can We Measure Variability? oRange oVariance oStandard Deviation oCoefficient of Variation RANGE The range is the difference between the highest and lowest values in a data set. R Highest Lowest EXAMPLE 3-18/19: Two experimental brands of outdoor paint are tested to see how long each will last before fading. Six cans of each brand constitute a small population. The results (in months) are shown. Find the mean and range of each group. Brand A Brand B 10 35 60 45 50 30 30 35 40 40 20 25 VARIANCE & STANDARD DEVIATION The variance is the average of the squares of the distance each value is from the mean. The standard deviation is the square root of the variance. The standard deviation is a measure of how spread out your data are. The population variance is 2 The population standard deviation is X X n 1 2 s X X n 1 2 N n X X 2 s 2 The sample standard deviation is 2 N X The sample variance is s2 X 2 n n 1 2 USES OF THE VARIANCE AND STANDARD DEVIATION To determine the spread of the data. To determine the consistency of a variable. To determine the number of data values that fall within a specified interval in a distribution (Chebyshev’s Theorem). Used in inferential statistics. EXAMPLE 3-21: Find the variance and standard deviation for the data set for Brand A paint. 10, 60, 50, 30, 40, 20 Months, X-µ 10 60 50 30 40 20 2 X n 1750 6 291.7 1750 2 MEASURES OF VARIATION: VARIANCE & STANDARD DEVIATION (SAMPLE COMPUTATIONAL MODEL) Is mathematically equivalent to the theoretical formula. Saves time when calculating by hand Does not use the mean Is more accurate when the mean has been rounded. EXAMPLE 3-23: Find the variance and standard deviation for the amount of European auto sales for a sample of 6 years. The data are in millions of dollars. 11.2, 11.9, 12.0, 12.8, 13.4, 14.3 X 11.2 11.9 12.0 12.8 13.4 14.3 75.6 X 2 958.94 n X X 2 s 2 n n 1 2 COEFFICIENT OF VARIATION The coefficient of variation is the standard deviation divided by the mean, expressed as a percentage. s CVAR 100% X Use CVAR to compare standard deviations when the units are different. EXAMPLE 3-25: The mean of the number of sales of cars over a 3-month period is 87, and the standard deviation is 5. The mean of the commissions is $5225, and the standard deviation is $773. Compare the variations of the two. •If the Cvar for the English final exam was 6.9% and Cvar for History final exam was 4.9%. Compare the variations: a) The English class was more variable. b) The History class was more variable. c) Cannote determine that. d) Both of classes has the same variation. For any bell shaped distribution. Approximately 68% of the data values will fall within one standard deviation of the mean . Approximately 95% of the data values will fall within two standard deviation of the mean . Approximately 99.7% of the data values will fall within three standard deviation of the mean . = 480 , S = 90 , approximately 68% Then the data fall between 570 and 390 = 480 , S = 90 , approximately 95% Then the data fall between 660 and 300 = 480 , S = 90 , approximately 99.7% Then the data fall between 750 and 210 When a distribution is bell-shaped, approximately what percentage of data values will fall within 1standard deviation of the mean? a) 90% b) 68% c) 86% d) 99% --------------------------------------------------------------------------------------- Math exam scores have a bell-shaped distribution with a mean of 100 and standard deviation of 15 . use the empirical rule to find the percentage of students with scores between 70 and 130? MEASURE OF POSITION • • • Standard score or z score Quartile Outlier z score or standard score (relative position) For Population For Sample If z score is negative the score is below the mean If z score is positive the score is above the mean Test A Test B X=38 X=94 = 40 = 100 S=5 S=10 Quartiles Smallest data value 25% Q1 25% Q2 25% Q3 25% largest data value 25% 50% 75% Quartiles divide the data set into 4 equal groups . Deciles are denoted Q1 , Q2 and Q3 with the corresponding percentiles being P25 , P50 and P75 . The median is the same as P50 or Q2 . PROCEDURE TABLE Finding Data Values Corresponding to Q1,Q2and Q3 . Step 1: Arrange the data in order from lowest to highest . Step 2: Find the median of the data values .This is the value for Q2 . Step 3: Find the median of the data values that fall below Q2.This is the value for Q1 . Step 4: Find the median of the data values that fall above Q2.This is the value for Q3. Outliers An outlier is an extremely high or an extremely low data value when compare with the rest of the data values . Step 1: Arrange the data in order and find Q1 and Q3. Step 2: Find the interquartile range IQR= Q3 - Q1 Step 3: Multiply the IQR by 1.5 . Step 4: Subtract the value obtained in step 3 form Q1 and add the value to Q3. Step 5: any value: smaller than Q1-1.5(IQR) or larger than Q3+1.5(IQR). outliers •Use the data below to find the outliers: 8, 4, 0, 2, 10, 15, 12, 22 a) 15 b) 22 c) No outlier d) 0 ---------------------------------------------------------------------------------------------------------•If a) b) c) d) Q1=4, IQR=40, Find Q3: 10 44 40 -4 * All the values in a dataset are between 60 and 88, except for one value of 97. That value 97 is likely to be -----•An outlier •The mean •The box plot •The range EXPLORATORY DATA ANALYSIS (EDA) A Box plot can be used to graphically represent the data set . For the box plot, The five –Number Summary : 1-lowest value of the data set . (min) 2-Q1. 3-the median(MD) Q2. 4-Q3. 5-the highest value of the data set . (max) PROCEDURE FOR CONSTRUCTING A BOXPLOT 1. 2. 3. 4. Find five -Number summary . Draw a horizontal axis with a scale such that it includes the maximum and minimum data value . Draw a box whose vertical sides go through Q1 and Q3,and draw a vertical line though the median . Draw a line from the minimum data value to the left side of the box and line from the maximum data value to the Q2 Q3 right side of the box. Q1 maximum minimum Real cheese 310 420 220 240 45 180 40 90 Cheese substitute 270 180 250 130 260 340 290 310 Compare the distributions using Box Plots? Step1: Find Q1,MD,Q3 for the Real cheese data 40 , 45 , 90 , 180 , 220 , 240 , 310 , 420 Q1 MD Q3 45 90 180 220 240 310 , , Q1 67.5 Q2 200 Q3 275 2 2 2 Step2:Find Q1 , MD and Q3 for the cheese substitute data. 130 , 180 , 250 , 260 , 270 , 290 , 310 , 340 Q1 MD Q3 260 270 180 250 Q1 215 , Q2 =265 2 2 290 310 300 , Q3 2 67.5 200 275 40 420 215 265 300 130 0 100 340 200 300 400 500 Right skewed (+ve skewed) Left skewed (-ve skewed) INFORMATION OBTAINED FROM A BOX PLOT From Box plot: The median(MD) is near the center. The distribution is symmetric. The median falls to left of the center The distribution is positively skewed (right skewed). The median falls to right of the center . The distribution is negatively skewed (left skewed). From Line: The lines are the same length. The distribution is symmetric. The right line is larger than the left line . The left line is larger than the right line . The distribution is positively skewed (Right skewed). . The distribution is negatively skewed (Left skewed). •The 2 3 4 5 6 0 4 2 2 1 following stem & leaf represent the scores for students on an statistics exam: 1 3 3 5 7 7 7 7 8 3 3 5 5 4 6 2 * Find the mode: a) 37 b) 23 c) 40 d) 62 * The range: a) 40 b) 42 c) 30 d)33 * The mean: a) 41.19 b) 43 c) 42.5 d) 20 * The skewed in this graph: a) Symmetric skewed b) positive skewed c) negative skewed •The 2 3 4 5 following stem & leaf represent the scores for students on an statistics exam: 0 1 4 0 1 2 2 2 * Find the mode: a) 42 b) 23 c) 40 d) 62 * The range: a) 40 b) 42 c) 30 d)32 * The mean: a) 41.19 b) 43 c) 42.5 d) 37.75 * The skewed in this graph: a) Symmetric skewed b) positive skewed c) negative skewed 0 2 4 * The IQR is …….. a) 4 b) 5 c) 10 d)0 * The distribution shape is …….. a) Positively skewed b) Negative skewed 6 8 10 * A characteristic or measure obtained by using the data values from a population is collide a ………. a) Statistic b) Quartile c) Percent d) Parameter ---------------------------------------------------------------------------------------------------------------------- * ……….. is the symbols of mean in the sample -------------------------------------------------------------------------------------------------------------------- * Find the mean of following data 10, 15, 12, 9, 2, 6 ? a) 12 b) 9 c) 54 d) 10.9 ----------------------------------------------------------------* What is the median of following data 5, 7, 10, 3, 8? * If the number of books in a sample of five boxes are follow 11 , 8, 2, 2, 7, 7, 2, 5 . Then the set is said to have a) Multimodal b) Unimodal c) Zero d) No mode ------------------------------------------------------------------------------------------------- * Find the midrange (MR) for the following data 7, -5, 2, 10, 15 ---------------------------------------------------------* If the mean of 5 values equals 64, then X ? --------------------------------------------------------* The measures of central tendency for the following data 1,3,9,11,2 are: Mean= …………… median =……………… mode= ……………..