Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Asian School of Business PG Programme in Management (2005-06) Course: Quantitative Methods in Management I Instructor: Chandan Mukherjee Session 2: Summarising a Distribution Modern/EDA Terminology Classical Terminology Cluster, Level, Centre Central Tendency, Location Scatter, Spread Dispersion Shape Skewness Tails Kurtosis Numerical Summaries (Descriptive Statistics) Feature Mean based summary Order based summary Level Arithmatic Mean Median Spread Standard Deviation Midspread The order based summaries are resistant to extreme values i.e. that are not unduly influenced by a small part of the data. That is why they are called Resistant Summaries. Numerical Summaries (Descriptive Statistics) • Both the mean based and the order based summaries of the spread of a distribution (Standard Deviation and Midspread) are scale dependent. • That is why we need to neutralise the scale effect by dividing by their respective summaries of the Center. • Coefficient of Variation = Standard Deviation / Mean • Relative Midspread = Midspread / Median Definitions Variance = Average Squared Distance from the Mean 1 n 2 ( X X ) i n 1 = Serial No. 1 2 3 4 5 Total Average Data (X) 6 7 8 9 20 50 10 X–Mean -4 -3 -2 -1 10 0 where 1 n X 1 X i n (X–Mean)2 16 9 4 1 100 130 26 = Variance DEFINITION (contd.) Standard Deviation (SD ) = Square root of Variance 26 5.099 Co-efficient Variation = SD/Mean = 0.0517 (or, 5.17%) DEFINITION (contd.) Median = The value that divides the ordered data values into two equal halves To compute Median: • Sort the data in ascending order • Divide the number of observations (data values) by 2 • If the result is an integer, say 9, then median is the average of the 9th and 10th observations • If the result is not an integer, say 9.3, then round it up to the next integer above i.e. 10 in this case. The 10th observation is the median DEFINITION (contd.) Example: Finding the Median 0.07 1.99 3.25 6.93 8.57 9.15 13.95 33.97 34.46 34.52 36.33 39.15 39.67 40.40 40.55 40.67 44.87 46.08 48.43 50.11 51.02 51.97 54.16 55.07 57.99 60.25 63.36 67.95 70.93 82.20 94.43 99.41 102.51 113.70 114.09 119.17 121.25 126.28 126.54 128.40 133.02 141.80 150.11 156.63 162.75 193.56 200.19 220.06 282.59 302.75 Average of 26th and 27th observations = 61.80 405.21 445.63 DEFINITION (contd.) Quartiles = The three values that divide the ordered observations into four equal parts 25% of the observations lie below the First (Lower) Quartile 50% of the observations lie below the Second (Middle) Quartile 75% of the observations lie below the Third (Upper) Quartile The Third or the Middle Quartile is obviously the Median DEFINITION (contd.) To compute the Lower and the Upper Quartile: • Sort the data in ascending order • Divide the total number of observations by 4 • If the result is an integer, say 12, then take the average of the 12th and the 13th observations from the lowest observation (downward) as the Lower Quartile. Similarly, take the 12th and the 13th observations from the highest observation (upward) as the Upper Quartile • If the result is not an integer, say, 12.7, then round it up to the next integer above, i.e. 13 in this case. The Lower Quartile is the 13th observations from the lowest, and the Upper quartile is the 13th observation from the highest DEFINITION (contd.) Example: Finding the Lower & Upper Quatiles 0.07 1.99 3.25 6.93 8.57 9.15 13.95 33.97 34.46 34.52 36.33 39.15 39.67 40.40 40.55 40.67 44.87 46.08 48.43 50.11 51.02 51.97 54.16 55.07 57.99 60.25 63.36 67.95 70.93 82.20 94.43 99.41 102.51 113.70 114.09 119.17 121.25 126.28 126.54 128.40 133.02 141.80 150.11 156.63 162.75 193.56 200.19 220.06 282.59 302.75 405.21 445.63 Lower Quartile Upper quartile (39.67 + 40.40)/2 = 40.04 (126.54 + 128.40)/2 = 127.47 DEFINITION (contd.) Midspread = Upper Quartile – Lower Quartile The range that holds the middle 50% of the observations Relative Midspread = Midspread / Median = (127.47 – 40.04)/61.80 = 1.41 Five Number Summary Five numbers can comprehensively summarise the features of a distribution without being unduly affected by a small part of the data Minimum (MN) Lower Quartile (LQ) Median (MD) Upper Quartile (UQ) Maximum (MX) Five Number Summary is Comprehensive: The Grand Summary of a Distribution Lower Tail Upper Tail Indentifying the Extreme Values: Are The Outliers? Here is a thumb rule (based on theory): Step = 1.5 times Midspread Lower Fence = Lower Quartile – Step Upper Fence = Upper Quartile + Step All observations below the Lower Fence are Negative Outliers All observations above the Upper Fence are Positive Outliers Who Cotton Blended Yearn Companies: Five Number Summary & Fences MN 0.07 LQ 40.04 MD 61.80 UQ 127.47 MX 445.63 Midspread 87.43 Step 131.14 Lower Fence -91.10 Upper Fence 258.61 Box & Whisker Plot Gross Fixed Asset 445.63 MX Outliers! Upper Fence UQ .07 Lower Fence MN Cotton & Blended Yarn Companies MD LQ Comparing Two Distributions of Gross Fixed Asset: Fabrics and Yarn Companies GF Asset (Crores) 2000 1500 1000 500 0 Fabrics Yarn Summary Of The Points