LESSON TWO : DESCRIPTIVE STATISTICS 2.1 Summary statistics for raw data: mean, quartiles and mode The mean or average value for a set of raw data To describe the distribution of a character in a sample or in a population we can use frequency distributions and its graphical representation. However, many times is necessary to be able to describe data numerically, for example by a value that is typical of the bulk of the data. Such a figure, calculated from the data is called a summary statistics. This is usually a mean (or average value), a median or a mode Formula of the mean value (average) for raw data (only for quantitative variables) N x i 1 i N = Properties of the average. 1) The sum of values of a variables taken by a set of statistical units is equal to average multiplied by the numbers of the units. N x N i1 i 2) The sum of the differences in absolute value between the values taken by the variable and the average is zero. 3) The sum of the squared differences between the values taken by the variable and the average is a minimum. The median, quartiles and percentiles The median For the distributions that are asymmetrics and with extreme vaules (or outliers) is better to use as a mesaure of central tendendy the median, that can be calculate also for ordinal variables. The median can be computed only if the variable is an ordinal one, in the sense that it can be possible to rank the values of the variable in increasing or decreasing order and consequently the corresponding statisical units. Having done this is possible to evalute the position of each unit with respect with that array. The Median or second quartile, Q2, is the value of the middle item in an ordered set of data. Q2 is the value of the item that is at or nearest the 0,5(N+1)th position in the ordered data. The Lower Quartile Q1 is the item whose position in the ordered list is at or nearest to the 0,25(N+1)th. The Upper Quartile Q3 is the item whose position in the ordered list is at or nearest to 0,75(N+1)th. Property of the median. For a quantitative variable X, the sum of the differences in absolute value of the xi from a costant c is a minimum when c is equal to the median. Calculus of the median for odd and even number of data. Percentiles Percentiles divide the arrayed data in hundreths. The mode The mode is the value that occurs most often a) there may be no mode or there may be several b) the mode may be a value near the beginning, middle or end of the data. In other words it can be anywhere and therefore may not be representative. See exercise 2.1 pages 68-70 Bradley. 2. 2 Summary statistics for grouped data: mean, quartiles and mode Grouping assumptions: It is assumed that the values within each interval vary uniformly between the lowest and highest value in the interval. Hence the average value of the data in any intervals is the mid interval value: it is used to represent the group numerically. Calculations of mean for grouped data (fi xi) / N where i is the ordered position of the interval, xi is the mid value of the interval, fi is the frequency of the interval. See example 2.2 pages 74-75 Bradley. Quartiles, percentiles for grouped data When the data is sorted into a frequency table the data is ordered from the lowest to the highest values in blocks or intervals. Approximate estimates of the quartiles are made by identifiying the intervals containing the items that are 25%, 50% and 75% of the way through the ordered data. More accurate estimation is made using the formula: Qm = LQm + ((mN/4 – cf)/fQm) (w). where LQm is the lower limit of the interval containing Qm fQm is the frequency of the interval containing Qm w is the width of the interval containing Qm cf is the cumulative frequency up to, but non including the Qm interval. Weighted average ( wi xi)/ wi wi are called the weights and reflect the relative importance of each xi. The mean for grouped data is weighted mean where the weights are the frequencies. Examples: in a statistic examination the overall marks calculated as follows: 50% for the written paper, 30% for the practical examination and 20% for homework. Consumer price index. 2.3Measure of dispersion for raw data Range Variance Standard deviation Semi-interquartile range Quartile deviation Examples: Age of two groups of tourists P : 48, 50, 52, 51, 49 Q: 2, 88,76, 31, 64, 39, 50 The range is the difference between the highest and the lowest value in the data set. Variance: 2 = ((xi -)2)/N Standard deviation = (((xi -)2)/N)1/2 When the variance is calculated from sample data, the denominator is given by n-1, where n is the sample size. Semi-interquartile range (or interquartile range) is the the difference between the upper and lower quartile. The quartile deviation is the semi-interquartile divided by 2.