Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 4 DESCRIPTIVE STATISTICS: Numerical summaries BUSINESS STATISTICS Advanced Educational Program Reading materials: Chap 4 (Keller) 1 Outline 2 Measure of center and spread • Measures of center: - Mean, median, mode - Selection of measures of location • Measures of dispersion (spread): - Range, quartile range, quartile deviation, variance, standard deviation • Empirical rule (general case: Chebyshev’s law) • Coefficient of variation • Coefficient of skewness 3 Measures of center 4 Measures of center • A measure of center or location shows where the center of the data is • Three most useful measures of location: § Arithmetic mean/average § Median § Mode 5 6 Arithmetic mean from frequency table Arithmetic mean from raw data N • Arithmetic mean from population: µ= ∑X i • Apply this formula for the sample: i =1 N k n • Arithmetic mean from sample: i i ∑x x= i x= ∑x f i =1 i =1 k ∑f n i i =1 Where: Xi, xi - the value of each item N, n - total number of items Where: xi - the value of class i fi – frequency of class i 7 8 Mean is sensitive to outliers Advantages and disadvantages of arithmetic mean • Advantages: – Easy to understand and calculate – Values of every items are included => representative for the whole set of data • Disadvantages – Sensitive to outliers: Sample: (43; 38; 37; : : : ; 27; 34): => x = 33.5 Contaminated sample (43; 38; 37; : : : ; 27; 1934): => x = 71.5 9 10 Median Calculate median from raw data • Median is the value of the observation which is located in the middle of the data set Steps to find median: If the data has an odd number of observations: (n + 1)th – Middle observation: 2 Median = x ( n +1)th 1. Arrange the observations in order of size (normally ascending order) 2 • 2. Find the number of observations and hence the middle observation If the data has an even number of observations: – There are two observations located in the middle and Median = ( x 3. The median is the value of the middle observation th ⎛n⎞ ⎜ ⎟ ⎝2⎠ 11 +x ⎛n ⎞ ⎜ +1⎟ ⎝2 ⎠ th )/2 12 Example Advantages and disadvantages of median • E.g1. Raw data: 11, 11, 13, 14, 17 => find median • E.g 2. Raw data: 11, 11, 13, 14, 16, 17 => find median • Advantages: – Easy to understand and calculate – Not affected by outlying values => thus can be used when the mean would be misleading • Disadvantages – Value of one observation => fails to reflect the whole data set – Not easy to use in other analysis 13 14 Mode • • Example to calculate mode Mode is the value which occurs most frequently in the data set Steps to find mode 1. Draw a frequency table for the data 2. Identify the mode as the most frequent value 15 Frequency 8 3 12 7 16 12 17 8 19 5 16 Mean, median and mode in normal and skewed distributions Bimodal and multimodal data Bimodal (two modes) X Multimodal (several modes) 17 18 Which measure of centre is best? Measures of dispersion (variability) • Mean generally most commonly used • Sensitive to extreme values • If data skewed/extreme values present, median better, e.g. real estate prices • Mode generally best for categorical data – e.g. restaurant service quality (below): mode is very good. (ordinal) Rating # customers Excellent 20 Very good 50 Good 30 Satisfactory 12 Poor 10 Very Poor 6 • Measures of dispersion tell you how spread out all other values of the distribution from the central tendency Measures of dispersion • • The range, quartile range, and quartile deviation • Variance and standard deviation 19 Why do we need measures of dispersion? 20 Why measures of dispersion? (1) • Two data sets of midterm marks of 5 students: – First set: 100, 40, 40, 35, 35 => Mean: 50 – Second set: 70, 55, 50, 40, 35 => Mean: 50 Ø Which mean (first or second) is more reliable? • Need to know the spread of other values around the central tendency, especially important in analysing stock market. 21 Why measures of dispersion? (2) 22 Range • Range is the difference between the largest and smallest value => Sort data before computing range • Formula: Range = maximum value - minimum value • Advantages of Range: easy to calculate for ungrouped data. • Disadvantages: – Take into account only two values – Affected by one or two extreme values – More difficult to calculate for grouped data 23 24 Quartiles Quartile range and quartile deviation • Quartiles: are defined as values of observations which are a quarter of the way through data • Quartile range = Q3 – Q1 – Q1 - the first quartile: the value of the observation of which 25% of observations fall below • Quartile deviation = – Q2 - the second quartile: the median (50% of the observations fall below) • Advantages of quartile deviation (semi-interquartile range): less affected by extreme value Q3 − Q1 2 • Disadvantages: take into account only 50% of the data – Q3 - the third quartile: the value of the observation of which 75% of observations fall below 25 26 Variance • Variance from population: • Variance from sample Standard deviation (σ ) σ2 = ∑ s2 = ( X i − µ )2 • Standard deviation (S.D) is the square root of variance • S.D from population: N ∑ ( x − x) 2 σ = σ2 n −1 • S.D from sample: • Advantages: • Take into account all values • Easy to interpret the result. s = s2 • Advantages: • Overcome the disadvantage of meaningless unit of variance • The most widely used measure of dispersion (the bigger its value => the more spread out are the data) • Disadvantages: the unit of variance has no meaning 27 Application of this in finance • Variance (or S.D) of an investment, can be used as a measure of risk e.g. on profits/return. • Larger variance è larger risk • Usually, higher rate of return, higher risk 28 Example – 2 funds over 10 years (1) • Rates of return A 8.3 -6.2 20.9 -2.7 33.6 42.9 24.4 5.2 3.1 30.5 B 12.1 -2.8 6.4 12.2 27.8 25.3 18.2 10.7 -1.3 11.4 x A = 16% xB = 12% s A2 = 280.34(%) 2 s A2 = 99.37(%) 2 • Which fund will you invest? Empirical rules or the law of 3 σ Example – 2 funds over 10 years (2) • For a normal or symmetrical distribution: l – 68.26% of all obs fall within 1 standard deviation of the mean, i.e. in the range: Depending on how Risk-averse you are: Fund A: higher risk, but also higher average rate of return. ( x − 1s) ↔ ( x + 1s) – 95.45% of all obs fall within 2 standard deviation of the mean, i.e. in the range: ( x − 2s) ↔ ( x + 2s) – 99.73% of all obs fall within 3 standard deviation of the mean, i.e. in the range: ( x − 3s ) ↔ ( x + 3s ) 32 Meaning of the law of 3σ Boxplot • Convert z-score to probability (next lecture) Here is the Boxplot of height of international students studying at UNSW • Identify outliers Boxplot of Height 200 190 whisker upper quartile Height 180 170 median box lower quartile 160 whisker 150 33 34 Boxplots Shapes of Boxplots • Need MEDIAN and QUARTILES to create a boxplot • MEDIAN = middle of observations, i.e. ½ way through observations • QUARTILES = mark quarter points of observations, i.e. ¼ (Q1) and ¾ (Q3) of the way through data [(n+1)/4; 3(n+1)/ 4] • INTERQUARTILE RANGE = Q3-Q1 • Whiskers: max length is 1.5*IQR; stretch from box to furthest data point (within this range) • Points further out from box marked with stars; called outliers Boxplot of Symmetric, Positive skew, Negative skew, Bimodal 5.0 • Skewness/ symmetry • Modality • Range Data 2.5 0.0 -2.5 -5.0 Symmetric 35 Positive skew Negative skew Bimodal 36 Coefficient of skewness (C of S) Activity 1 • Summary statistics of two data sets are as follows • This measures the shape of distribution • There are some measures of skewness. • Below is a common one: Pearson’s coefficient of skewness. Coefficient of skewness = 3 x (mean-median)/standard deviation • If C of S is nearly +1 or -1, the distribution is highly skewed • If C of S is positive => distribution is skewed to the right (positive skew) n • If C of S is negative => distribution is skewed to the left (negative skew) Set 1: Ages of students studying at UNSW Set 2: Wages of staffs 294.3 Mean 22.4839 Median 21 292.5 Standard deviation 6.3756 125.93 Compute the Pearson’s coefficient of skewness of these data sets and describe their shapes of distribution 37 38 Investigating the relationship between variables Distribution shapes • Methods: – Table: Cross-table – Charts: 6 – 99.73% of obs of the population fall within 1 standard deviation of the mean, i.e. in the range: 2 4 100 50 Frequency o Multiple bar chart o Scatterplot (mentioned in lecture 8) – 0 0 Frequency 150 8 200 10 – 95.45% of obs of the population fall within 1 standard deviation of the mean, i.e. in the range: 20 40 age Skewed to the right 60 80 100 200 300 wages 400 500 600 Nearly normal 39 • Advantages: – Overcome the disadvantage of meaningless unit of variance – The most widely used measure of dispersion (the bigger its value => the more spread out are the data) Cross-table Cross-table • Cross-table is used to investigate the relationship b/w two categorical vars or discrete variables with few values. • EX: use gss.sav data file to explore the relationship b/w internet use and degree • Note: – Need to identify dependent and independent variables. – Know how to calculate row and column percentages – Rule of thumb: independent var in row and dependent var in column 41 42 Multiple bar chart Multiple bar chat Here you are • We can use multiple bar chart to explore the relationship b/w variables. • The skill is to know how to draw chart • EX: use gss.sav data file to explore the relationship b/w internet use, age, and degree 43 44