Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Description Chapter 3 Introduction Different statistical methods can be used to summarize data Measures of central tendency: mean, median, mode, midrange Measures of variation: range, variance, standard deviation Measures of position: percentiles, quartiles, deciles 3.1 – Measures of Central Tendency Statistic: Characteristic or measure obtained using data values from sample Parameter Characteristic or measure obtained using all data values from specific population General Rounding Rule: Do not round until final answer is obtained Round off to one more decimal place than raw data Mean Mean (arithmetic average) Sum of all values divided by total number of values Symbol 𝑋 represents sample mean so 𝑿= 𝑿 where n is total number of values in sample 𝒏 Symbol µ (mu) is used to represent population means so 𝝁= 𝑿 where N is total number of values in population 𝑵 Examples Example 3 – 1 The data represent the number of days off per year for a sample of individuals selected from nine different countries. Find the mean. 20 26 40 36 23 42 35 24 30 Example 3 – 3 Using the frequency distribution for example 2-7, find the mean. Data represent the number of miles run during one week for a sample of 20 runners. Finding Mean for Grouped Data Procedure 1. Make a table with: Column A = CLASS Column B = Frequency f Column C = Midpoint Xm Column D = f * Xm 2. Find midpoints of each class and place them in column C 3. Multiply frequency by midpoint for each class, and place product in Column D 4. Find sum of Column D 5. Divide sum obtained in column D by sum of frequencies obtained in column B Formula is : 𝑿 = 𝒇∗𝑿𝒎 𝒏 Median Median Midpoint of a data array (data arranged in order) Symbol for median is MD Will be a specific value or fall between two values Examples 3–4 3–5 3–6 Mode Mode Value that occurs most often in data set Data set that has only one value occurring with greatest frequency is called unimodal Data sets with two values with same greatest frequency is called bimodal More than two values is called multimodal When all values occur with same frequency data set is said to have no mode Examples 3–9 3 – 10 3 – 11 Modal class: class with largest frequency Midrange Midrange Sum of lowest and highest values of data set, divided by 2 Symbol is MR Properties and Uses of Central Tendency The Mean 1. Found by using all values of the data 2. Varies less than median or mode when sample are take from same population and all three measure are computed for these samples 3. Used in computing other statistics, such as variance 4. Unique and not necessarily one of the data values 5. Cannot be computed for data in a frequency distribution that has an open-ended class 6. Affected by extremely high or low values, called outliers, and may not be appropriate average to use in these situations The Median 1. Used to find center or muddle value of data set 2. Used when it is necessary to fund out whether data values fall into the upper half or lower half of the distribution 3. Used for an open-ended distribution 4. Affected less than mean by extremely high or low values The Mode 1. Used when most typical case is desired 2. Easiest average to compute 3. Used when data are nominal 4. Not always unique The Midrange 1. Easy to compute 2. Gives the midpoint 3. Affected by extremely high or low data values Distribution Shapes Positively-skewed (right-skewed) distribution Majority of data values fall to left of mean and cluster at lower end of distribution “tail” of distribution is to the right Mean is right of median, mode is left of median Symmetric distribution Data values are evenly distributed on both sides of mean Mean, median, and mode are the same and at center of distribution Negatively-skewed (left-skewed) distribution Majority of data values fall to right of mean and cluster at upper end of distribution Mean is left of median, mode is right of median 3.2 – Measures of Variation To describe data sets accurately, statisticians must know more than measures of central tendency Range Highest value minus lowest value Symbol R is used for range Example 3 – 19 R = highest value – lowest value Population Variance and Standard Deviation Rounding rule: round to one more decimal place than that of original data Variance Average of squares of distance each value is from mean Symbol for population variance is σ2 Formula is: 𝝈𝟐 = (𝑿−𝝁)𝟐 𝑵 Standard Deviation Square root of variance Symbol is σ Formula is: 𝝈= 𝝈𝟐 = (𝑿−𝝁)𝟐 𝑵 Example 3 – 22 Find variance and standard deviation for Brand B paint data for months of 35, 45, 30, 35, 40, 25 Sample Variance and Standard Deviation Formulas for sample variance and standard deviation Symbol is s Sample Variance: 𝒔𝟐 = 𝒏( 𝑿𝟐 )−( 𝑿)𝟐 𝒏(𝒏−𝟏) Sample Standard Deviation: 𝒔 = 𝒔𝟐 = 𝒏( 𝑿𝟐 )−( 𝑿)𝟐 𝒏(𝒏−𝟏) Example 3 – 23 Find the sample variance and standard deviation for the amount of European auto sales for a sample of 6 years shown. The data are in millions of dollars. 11.2 11.9 12.0 12.8 13.4 14.3 Uses of the Variance and Standard Deviation 1. Variances and standard deviations can be used to determine the spread of data. If variance or standard deviation is large, data are more dispersed. This data is useful in comparing two (or more) data sets to determine which is more (most) variable. 2. Measures of variance and standard deviation are used to determine the consistency of a variable. 3. Variance and standard deviation are used to determine number of data values that fall within a specified interval in a distribution. 4. Variance and standard deviation are used quite often in inferential statistics. Coefficient of Variation Sometimes we are required to compare standard deviations of data that is not in the same units Coefficient of variation Denoted by CVar, is standard deviation divided by mean Result is expressed as a percentage Samples: CVar = 𝒔 𝑿 Populations: CVar = 𝝈 𝝁 Examples Example 3 – 25 The mean of the number of sales of cars over a 3-month period is 87, and the standard deviation is 5. The mean of the commissions is $5225, and the standard deviation is $773. Compare the variations of the two. Example 3 – 26 The mean for the number of pages of a sample of women’s fitness magazines is 132, with a variance of 23; the mean for the number of advertisements of a sample of women’s fitness magazines is 182, with a variance of 62. Compare the variations. Chebyshev’s Theorem Developed by Russian mathematician Chebyshev (1821-1894) Specifies proportions of spread in terms of standard deviation Chebyshev’s theorem Proportion of value from a data set that will fall within k standard deviations of the mean will be at least 1 – 1/k2, where k is a number greater than 1 k is not necessarily an integer Example 3 – 27 The mean price of houses in a certain neighborhood is $50,000, and the standard deviation is $10,000. Find the price range for which at least 75% of the houses will sell. Empirical (Normal) Rule Chebyshev’s theorem applies to any distribution regardless of shape When distribution is bell-shaped (normal), then following are true Empirical Rule Approximately 68% of the data values will fall within 1 standard deviation of the mean. Approximately 95% of the data values will fall within 2 standard deviations of the mean. Approximately 99% of the data values will fall within 3 standard deviations of the mean. 3.3 – Measures of Position Measures exist for position or location within a data set Include standard scores, percentiles, deciles, and quartiles Standard score (z score) Value obtained by subtracting mean from value and dividing result by standard deviation. Symbol for standard score is z Sample formula: Population formula: 𝒛= 𝑿−𝑿 𝒔 𝒛= 𝑿−𝝁 𝝈 z score represents number of standard deviations that a data value falls above or below the mean z > 0 (above mean) z < 0 (below mean) z = 0 (equal to mean) Example 3 – 29 A student scored 65 on a calculus test that had a mean of 50 and a standard deviation of 10; she scored 30 on a history test with a mean of 25 and a standard deviation of 5. Compare her relative positions on the two tests. Percentiles Percentiles Divide data set into 100 equal groups Percentile formula Percentile corresponding to a given value X is computed by using the following formula: Percentile = 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒅𝒂𝒕𝒂 𝒗𝒂𝒍𝒖𝒆𝒔 𝒃𝒆𝒍𝒐𝒘 𝑿 +𝟎.𝟓 ∗ 𝟏𝟎𝟎% 𝒕𝒐𝒕𝒂𝒍 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒗𝒂𝒍𝒖𝒆𝒔 Example 3 – 32 A teacher gives a 20-point test to 10 students. The scores are shown here. Find the percentile rank of a score of 12. 18 15 12 6 8 2 3 5 20 10 Quartiles and Deciles Quartiles Divide distribution into four groups, separated by Q1, Q2, Q3 Q1 is same as 25th percentile, Q2 is same as 50th percentile or median, Q3 is same as 75th percentile Procedure for finding data values corresponding to Q1, Q2, Q3 1. Arrange data in order from lowest to highest 2. Find median of data values. This is Q2 3. Find median of data values that fall below Q2. This is Q1 4. Find median of data values that fall above Q2. This is Q3 Example 3 – 36 Find Q1, Q2, Q3 for the data set 15, 13, 6, 5, 12, 50, 22, 18 IQR Interquartile range (IQR) Difference between Q1 and Q3 Range of middle 50% of data Used to identify outliers and as measure of variability in exploratory data analysis Deciles Divide distribution into ten groups, denoted by D1, D2, … , D10 Found by using same formula for percentiles Outliers Outlier Extremely high or extremely low data value when compared with rest of data values Can strongly affect mean and standard deviation of a variable Procedure for Identifying Outliers 1. Arrange data in order and find Q1 and Q3 2. Find interquartile range: IQR = Q3 – Q1 3. Multiply IQR by 1.5 4. Subtract value obtained in step 3 from Q1 and add value to Q3 5. Check data set for any data value that is smaller than Q1 – 1.5(IQR) or larger than Q3+1.5(IQR) Example 3 – 37 Check data set for outliers: 5, 6, 12, 13, 15, 18, 22, 50 3.4 – Exploratory Data Analysis The Five-Number Summary and Boxplots Boxplot (box and whisker plot) Can be used to graphically represent data set Set involves specific values called a five-number summary 1. Lowest value of data set (minimum) 2. Q1 3. Median 4. Q3 5. Highest value of data set (maximum) Boxplot Procedure for Constructing a Boxplot 1. Find five-number summary for data values 2. Draw horizontal axis with a scale such that it includes maximum and minimum data values 3. Draw a box whose vertical sides go through Q1 and Q3, and draw a vertical line through the median 4. Draw a line from minimum data value to left side of box and a line from maximum data value to right side of box Example 3 – 38 The number of meteorites found in 10 states of the United States is 89, 47, 164, 296, 30, 215, 138, 78, 48, 39. Construct a boxplot for the data. Information Obtained from a Boxplot 1. If the median is near the center of the box, the distribution is approximately symmetric 2. If the median falls to the left of the center of the box, the distribution is positively skewed 3. If the median falls to the right of the center, the distribution is negatively skewed 4. If the lines are about the same length, the distribution is approximately symmetric 5. If the right line is larger than the left line, the distribution is positively skewed 6. If the left line is larger than the right line, the distribution is negatively skewed Resistant statistic Summary statistics median and IQR is less affected by outliers Mean and standard deviation are affected more by outliers and are called nonresistant statistics