Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 2: Describing Distributions with Numbers Descriptive Statistics Describe the important characteristics of a set of measurements. Analyzing Data Finding centers of data set, describing variation of data set, and a shape of data set. Two Basic Concepts of Measures of Center Mean (x) (Arithmetic Mean) / (An average) : Found by adding the data values and dividing the total by the number of data. ∑ Sample mean x = x n ∑ Population mean µ = x N Median(M): The middle of value when the original data values are arranged in order of increasing (or decreasing). (A center of an ordered data) Round-off Rule: Carry one more decimal place than is present in the original set of values. Ex. 1 17, 19, 21, 18, 20, 18, 19, 20, 20 Ex. 2 17, 19, 21, 18, 20, 18, 19, 20, 20, 21 Comparing the mean and median The mean and median of a roughly symmetric distribution are close together. If the distribution is exactly symmetric, the mean and median are exactly the same. In a skewed distribution, the mean is usually farther out in the long tail than is the median. Percentiles: The position measures used in educational and health-related fields to indicated the position of an individual in a group. P% (100 − P )% ———————————-+————————————– Pth percentiles Median: = P50 = Q2 The 50th percentile, denoted P50 , has about 50% of the data values below it and about 50% of the data value above it. Measuring variability: the quartiles First quartile (Q1 ) = also called the lower quartile or the 25th percentile. P25 Second quartile (Q2 ) = also called the median or the 50th percentile. P50 Third quartile (Q3 ) = also called the upper quartile or the 75th percentile. P75 Measuring variability: The five-number summary and boxplots Boxplot: a graph of a data set that consists of a line extending from the minimum value to the maximum value, and a box with lines drawn at the first quartile, Q1, the median, and the third quartile, Q3. 5-Number Summary and Boxplot 1. Minimum data value 2. First quartile (Q1 )= P25 : At least 25% of the sorted values are less than or equal to Q1 , and at least 75% of the values are greater than or equal to Q1 . 3. Second quartile (Q2 )= P50 : Same as the median; separates the bottom 50% of the sorted values from the top 50%. 4. Third quartile (Q3 )= P75 : At least 75% of the sorted values are less than or equal to Q3 , and at least 25% of the values are greater than or equal to Q3 . 5. Maximum data value When the rth number that is a Q1 , Q2 , Q3 , r satisfies r = 0.25 ⇒ r = 0.25× (total number of values) total number of values r = 0.50 ⇒ r = 0.50× (total number of values) total number of values r = 0.75 ⇒ r = 0.75× (total number of values) total number of values If r is a whole number: The value of the rth percentile is the midway between the rth and the (r + 1)th value. If r is not a whole number: Round up to the next larger whole number. Use the rth value. Ex. Construct Boxplot of the data set: 34, 36, 39, 43, 51, 53, 62, 63, 73, 79 Minimum data value ⇒ 34 Q1 = P25 ⇒ ⇒ r = 0.25 10 r = 2.5 ⇒ the 3rd value ⇒ 39 Q2 = P50 ⇒ ⇒ Q3 = P75 ⇒ ⇒ r = 0.50 10 r = 5 ⇒ the value between 5th and 6th 51 + 53 ⇒ = 52 2 r = 0.75 10 r = 7.5 ⇒ the 8th value ⇒ 63 Maximum data value ⇒ 79 Spotting suspected outliers (Median as the center) Using a median and the Interquartile Range (IQR) to analyze data. Interquartile Range (IQR) : (Q3 − Q1 ) Outliers with IQR Lower fence: Q1 − 1.5 · (IQR) Upper fence: Q3 + 1.5 · (IQR) Measuring spread: Variance and the Standard Deviation Those tools show the characteristic of data’s variation. Range = (maximum data value) − (minimum data value) Variance (s2 ): The average of the squares of the distance each value is from the mean. Standard Deviation (s): A measure of how much data values deviate away from the mean. The square root of the variance. A.M ean V ariance Standard Deviation —————————————————————————————————————– ∑ Sample x= x n s2 = ∑ Population x µ= N (x − x)2 n−1 √∑ (x − x)2 s= n−1 (x − µ)2 N √∑ (x − µ)2 σ= N ∑ ∑ 2 σ = Ex. 5, 7, 1, 2, 4 Range = (maximum data value) − (minimum data value) =7−1=6 ∑ x (5 + 7 + 1 + 2 + 4) 19 Mean (x) = = = = 3.8 n 5 5 Steps Step 1: Compute the mean x. Step 2: Subtract the mean from each individual value (x - x). Step 3: Square each of the deviations obtain from Step 2. (x − x)2 . Step 4: Add ∑ all of2 the squares obtained from Step 3. (x − x) Step 5: Divided the total from Step 4 by the number n − 1, which is 1 less than the total number of sample values present. The result is the variance. Step 6: Find the square root of the result of Step 5. The result is the standard deviation. x (xi − x) (xi − x)2 x2 ————————————————————— 5 1.2 1.44 25 7 3.2 10.24 49 1 −2.8 7.84 1 2 −1.8 3.24 4 4 .2 .04 16 ————————————————————— 19 0.0 22.80 95 Variance: ∑ (x − x)2 22.80 s = = = 5.7 n−1 5−1 2 Standard Deviation: s = √ 5.7 = 2.387 ≈ 2.4 Shortcut formula Sample Variance ( ∑ )2 ∑ 2 n (x ) − x s2 = n(n − 1) Sample Standard Deviation √ ∑ ( ∑ )2 n (x2 ) − x s= n(n − 1) Round-Off Rule for Measures of Variation When rounding the value of a measure of variation, carry one more decimal place than is present in the original set of data. Round only the final answer, not values in the middle of a calculation. Properties of the Standard Deviation (1) 1. n − 1 is called the degrees of freedom. 2. s measures variability about the mean and should be used only when the mean is chosen as the measure of center. 3. s is always zero or greater than zero. s = 0 only when there is no variability. This happens only when all observations have the same value. Otherwise, s > 0. 4. As the observations become more variable about their mean, s gets larger. 5. s has the same units of measurement as the original observations. For example, if you measure weight in kilograms, both the mean x and the standard deviation s are also in kilograms. This is one reason to prefer s to the variance s2 , which would be in squared kilograms. 6. Like the mean x, s is not resistant. A few outliers can make s very large. Properties of the Standard Deviation (2) The standard deviation measures the variation among data values. The standard deviation is a measure of variation of all values from the mean. Data values close together ⇒ A small standard deviation Data values with much more ⇒ A larger standard deviation Ex. 4.2, 3.5, 3.2, 4.0, 4.1 S.D : 0.430116 Ex. 5, 7, 1, 2, 4 S.D : 2.387467 Spotting suspected outliers (Mean as the center) Using a mean and the standard deviation to analyze data. Range Rule of Thumb: The vast majority (such as 95%) of sample values lie within two standard deviations of the mean for many data set. Minimum ”usual” value = mean − 2 × standard deviation Maximum ”usual” value = mean + 2 × standard deviation