* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Chapter-1
Survey
Document related concepts
Transcript
Chapter I Introduction to statistics 1.1 BASIC CONCEPTS OF STATISTICS 1.1.1 Definition of statistics It is not precisely known how the word "statistics” was originated. Most people believe that the term statistics has been derived from the Latin word “status” meaning political state. Some believe that the word statistic has been originated from the Italian word “statista”, the French word “statistique” and the German word “statistik”. This background tends to suggest that the term statistics has its origin from the ancient time. The science of statistics has developed gradually and its field of applications has widened with the passage of time. Its continued use and importance are considered to be indispensable in all spheres of life. Statistics is a branch of scientific methodology. It deals with the collection, classification, description, and interpretation of data through scientific procedures. It is difficult to define statistics in a few words, since its dimension, scope, function, use, and importance are constantly changing with time. No formal definition has emerged so far and no definition is perhaps universally accepted. Statistics are simply the facts and figures about any phenomenon or event - whether it relates to population, production, income, expenditure, sales, birth or death, or any other quantitative measures. Statistics can be defined as numerical facts (descriptive statistics) and as a subject (Inferential statistics). A few formal definitions of statistics are given below: Statistics are measurements, enumeration or estimates of natural or social phenomena systematically arranged so as to exhibit their interrelationships (Conor, 1937). By statistics we mean aggregates of facts affected to a marked extent by a multiplicity of causes, numerically expressed, enumerated or estimated Chapter I: Introduction to Statistics according to a reasonable standard of accuracy, collected in a systematic manner for a pre-determined purpose, and placed in relation to each other (Secrist, 1933). Statistics is the science, pure and applied, creating, developing and applying techniques, by which the uncertainty of inductive inferences may be evaluated (Steel et al., 1997). 1.1.2 Classification of statistics i) Pure Statistics or Mathematical Statistics: Pure Statistics deals with the theory of statistics. Research tools are usually developed in this branch of statistics. These tools are then applied to specific problems in different fields - such as physics and chemistry, anthropology and biology, etc. - of Applied Statistics. Pure Statistics not only creates new tools but also tries to perfect existing ones by placing them on more and more rigorous foundations, and in this process, it depends much on recent developments in Pure Mathematics. ii) Applied Statistics: Applied Statistics deals with the application of statistical methods to specific problems. It continues to find new uses for the existing tools and for the new tools that are being created. Pure and Applied Statistics interact and thus enrich each other. Pure Statistics continues to develop newer and stronger tools for Applied Statistics, while Applied Statistics continues to present new and challenging problems for Pure Statistics. 1.1.3 Characteristics of statistics Statistics possess the following characteristic features: 1) Statistics deals with population or aggregate of individuals, rather than with individuals alone. This means that statistics does not deal with a single figure, since a single figure is incapable to provide any additional information other than itself. 2) Statistics is concerned with reduction of data or with obtaining correct facts from figures. 3) Statistics deals with variation. 2 Chapter I: Introduction to Statistics 4) Statistics deals with only numerically specified populations. Qualitative statements, such as fair, good, medium, or poor, are not statistics unless they can be expressed in numerical form. 5) Statistics deals with populations which occur in nature and are subject to a large number of random forces. This means that statistics are not the effect of a single factor. For example, the weight of an individual depends primarily on his/her age and height, and the production of rice depends on land fertility, irrigation, input uses, etc. 6) Statistics collected should be of reasonable standards of accuracy. Statistics should be collected in a systematic and scientific manner. The collected statistics should lead to arrive at decision regarding the population of interest. 7) The logic used in statistical inference is inductive, or statistical inferences are uncertain. We draw inference about the population from the information contained in a sample, which is only a part of the population and thus we pass from the particular to the general so that there is a chance of our conclusions being untrue. 8) Statistics should be obtained for pre-determined purposes. There should be clear, well-defined and unambiguous statement regarding the purpose of data collection. 9) Statistics collected should allow comparison with other data. Statistical data should be collected with a view to make comparison with data of similar nature collected in different settings. Otherwise, no conclusion can be drawn regarding their quality, usefulness, importance and hence cannot be used for any decision making purpose for which they were collected. 10) Statistical results might lead to fallacious conclusions, if quoted out of context or manipulated. 1.1.4 Population and sample The essential purpose of statistics is to describe about the numerical properties of populations and draw inferences about the populations from the samples. In statistics, the concepts of population and sample are of immense importance. 3 Chapter I: Introduction to Statistics A population is a complete set of individuals, objects, or measurements having some common characteristics, whereas a sample is a subset or part of the population selected to represent the population. For example, a sample of size n = 2500 individuals was selected randomly from a population of size N = 60 million to arrive at a decision regarding the preference of a prime ministerial candidate in a country. 1.2 DESCRIPTIVE STATISTICS 1.2.1 Central tendency There are two obvious features of the data that can be characterized in a simple form and yet give a very meaningful description: central tendency and dispersion. The central tendency is measured by averages; these describe the point about which the various observed values cluster. There are several different measures of central tendency. Each is an indicator of what a typical value is, but each employs a different definition of ‘typical’. These measures are collectively called statistical averages. The purpose of a statistical average is to represent the central value of a distribution and also to afford a basis of comparison with other distributions of similar nature. Among the several averages, the most commonly used averages are Mean, Median and Mode. 1.2.1.1 Mean Arithmetic mean: The arithmetic mean is the most commonly used central value of a distribution. It is also referred to as simply the mean,The arithmetic mean is the sum of a set of observations, positive, negative or zero, divided by the number of observations. If we have n real numbers x1 , x 2 , x 3 , ......., x n , their arithmetic mean, denoted by x , can be expressed as: x= x1 + x 2 + x3 + ............. + x n n We can also write the mean x as follows: n x= ∑x i=1 i n 4 Chapter I: Introduction to Statistics Clearly, the mean x may be positive, negative, or even zero depending on the nature of the values included in its computation. Computing arithmetic mean for grouped data: When arithmetic mean is computed from a grouped distribution, the mid-point of each class is taken as the representative value of that class. The various mid-values are multiplied by their respective class frequencies, the products are added, and the sum of the products is then divided by the total number of observations to obtain the arithmetic mean. Symbolically, if z1 , z 2 , z 3 ,.........., z k are the mid-values and f1 , f 2 , f 3 ,........, f k are the corresponding frequencies, where the subscript ‘ k ’ stands for the number of classes, then the mean z is ∑fz ∑f i z= i i Geometric mean: Geometric mean is defined as the nth positive root of the product of n observations. Symbolically, G = ( x1 x 2 x 3 LLLL x n ) 1 / n This average is used when dealing with observations each of which bears an approximately constant ratio to the preceding one, e.g., in averaging rates of growth (increase or decrease) of a statistical population. If the n non-zero and positive variate-values x1 , x 2 ,........, x n occur f1 , f 2 ,......., f n times, respectively, then the geometric mean G of the set of observations is defined by: [ G = x1 f1 LogG = f x 2 2 LLL x n 1 N fn ] 1 1 N n ⎡ n ⎤N = ⎢∏ x if i ⎥ ⎣ i =1 ⎦ n ∑ ( f logx ) , where, N = ∑ f i =1 i i i =1 i Thus the logarithm of the geometric mean is the weighted mean of the different values log xi when weights are the frequencies f i . There are several other measures of averages, such as harmonic mean, weighted arithmetic mean, quadratic mean, trimmed mean, and trimean, which are occasionally used. 5 Chapter I: Introduction to Statistics 1.2.1.2 Median The median is that value for which 50% of the observations, when arranged with respect to their magnitudes, either in ascending or descending order, lies on each side. The implication of this definition is that a median is the middle value of the observations such that the number of observations above it is equal to the number of observations below it. Computing median for raw data: Suppose a family has seven members whose ages in years are 12, 7, 2, 34, 17, 21 and 19. To compute the median of these numbers, we arrange them in either ascending or descending order. In either ordering, the middle value is the median. Arranging them in both orders, the series is Ascending order: 2, 7, 12, 17, 21 and 34. Descending order: 34, 21, 19, 17, 12, 7 and 2. The middle value in either ordering is the fourth value, which in this case is 17. How would you deal with the problem when the number of observations is even? In general, if x1 , x 2 , x 3 ,........, x n constitute a series of n observations of the variable X arranged in order of magnitude and the number of observations is odd, the median, henceforth abbreviated M e , is the number occurring in the center of the series and is determined by considering the Me = X1 2 1 (n + 1) th value of the observations. Symbolically, 2 ( n +1) If n is even, the median is given by: Me = ⎞ 1⎛ ⎜Xn + Xn ⎟ ⎜ +1 ⎟ 2⎝ 2 2 ⎠ The following steps are involved in the computation of median from ungrouped data: • List the observations in order of magnitudes. 6 Chapter I: Introduction to Statistics • Count the number of observations. This is n . • The median is the value that corresponds to the observation number if n is odd and the observation number 1 (n + 1) 2 1 ⎡ n ⎛ n ⎞⎤ + ⎜ + 1⎟ if n is even. 2 ⎢⎣ 2 ⎝ 2 ⎠⎥⎦ Example: The weights of 11 mothers in kg were recorded as follows: 47, 44, 42, 41, 58, 52, 55, 39, 40, 43 and 61 To obtain the median weight, we arrange the values in ascending order. When we do so, the series becomes 39, 40, 41, 42, 43, 44, 47, 52, 55, 58 and 61. Since n is odd, the median is the value that belongs to the observation number 1 (n + 1) i.e. 2 1 1 (n + 1) th observation = (11 + 1) = 6th observation. On counting, the 6th observation 2 2 is 44 and hence it is the median. If the series becomes 39, 40, 41, 42, 43, 44, 47, 55, 58 and 61, in which case n = 10, which is an even number, the median will be the average of the 5th and 6th observations. This value is 1 (43 + 44) = 43.5 . 2 Computing median for grouped data: Algebraic expression for the median of a group frequency distribution is: M e = Lo + h ⎛n ⎞ ⎜ − F⎟ fo ⎝ 2 ⎠ where, Lo = Lower class boundary of the median class h = Width of the median class f o = Frequency of the median class F = Cumulative frequency of the pre-median class The following steps are involved in computing median from grouped data: • Compute the less than type cumulative frequencies. • Determine n / 2 , one-half of the total number of cases. 7 Chapter I: Introduction to Statistics • Locate the median class for which the cumulative frequency is more than n/ 2. • Determine the lower limit of the median class. This is Lo . • Sum the frequencies of all classes prior to the median class. This is F . • Determine the frequency of the median class. This is f o . • Determine the class width of the median class. This is h . Now you have all the quantities to compute median. Putting them in the above formula, you can calculate the median. 1.2.1.3 Mode Mode is the value of a distribution for which the frequency is maximum. In other words, mode is the value of a variable, which occurs with the highest frequency. If a population consists of 75 percent Hindus, 15 percent Muslims and the remaining 10 percent are followers of other religions; the modal category is the Hindu, which has the most people. To determine the value of the mode for a group frequency distribution, it is necessary to identify the modal class, in which the mode is located. In general, a modal class is the one with the highest frequency of the distribution. Once the modal class is identified, the next step is to locate the mode within the class. The mid-point of the modal class is usually taken as modal value of the distribution. 1.2.2 Measures of location Measures that are allied to the median include the quartiles, deciles and percentiles, because they are also based on their position in a series of observations. These measures are referred to as measures of location and not the measures of central tendency as they describe the position of one score relative to the others rather than the whole set of data. 1.2.2.1 Quartiles Quartiles are those variate values which divide the total frequency into four equal parts. There are three quartiles in a data series, usually denoted by Q1 , Q2 and Q3 . Q 2 8 Chapter I: Introduction to Statistics is identical with the median. Q1 and Q3 are the values at or below which one-fourth and three-fourths of all items in a series fall, respectively. For a grouped frequency distribution, the method of estimating the first and third quartiles is similar to that of estimating the median: Qi = Li + h fi ⎛ in ⎞ ⎜ − F ⎟, i = 1, 2, 3 ⎝4 ⎠ where, Li = Lower limit of the i th quartile class n = Total number of observations in the distribution h = Class width of the i th quartile class f i = Frequency of the i th quartile class F = Cumulative frequency of the class prior to the i th quartile class Table 1.1 Distribution of 70 students according to the marks they obtained in a class test Marks No. of students Cumulative frequencies 40 6 6 43 11 17 51 19 36 55 17 53 60 13 66 63 4 70 70 _ Total To obtain Q1 and Q3 , we cumulate the frequencies as shown in the third column of the table above. Since n / 4 = 70 4 = 17.5 is not an integer, the first quartile will be the 18th observation (next higher integer of the fraction 17.5). From the cumulative frequencies, Q1 will be 51. Since 3n / 4 = 52.5 , Q3 will be the 53 rd value which is 55. 1.2.2.2 Percentiles 9 Chapter I: Introduction to Statistics The p -th percentile of a data set is a value such that at least p percent of the items take on this value or less and at least (100- p ) percent of the items take on this value or more. It seems obvious that percentiles are the values, which divide the distribution into 100 equal parts. Thus there are 99 percentiles in a distribution, which are conveniently denoted by p1 , p 2 , p 3 , ............., p 99 . In terms of percentiles, the median is the 50th percentile. This means that p 50 = Q2 = M e . The 25th and 75th percentiles are the first and third quartiles, respectively. Admission test scores for colleges and universities are frequently reported in terms of percentiles. For example, suppose an applicant has a raw score of 54 in the oral portion of an admission test. If the raw score of 54 corresponds to the 70th percentile, it is easily seen that approximately 70 percent of the students had a score less than this individual and approximately 30 percent scored better. With ungrouped data, the percentile either takes the value half-way between the two observations or the value of one of the observations, depending on whether n is divisible by 100 or not. Consider the observations 11, 14, 17, 23, 27, 32, 40, 49, 54, 59, 71 and 80. To determine the 29th percentile, p 29 , we note that 1 (29 ×12) = 3.48, which is not an integer. Thus the next higher integer 4 here will 100 determine the 29th percentile value. On inspection, p 29 = 23 . Percentiles for grouped data: The i th percentile of a grouped distribution for n observations may be arrived at by using the following formula: p i = Li + h in ( − F) , f i 100 i = 1, 2, 3, LLL , 99 where, Li = Lower limit of the i th percentile f i = Frequency of the i th percentile class h = Width of the class interval F = Cumulative frequency of the class prior to the i th percentile class 10 Chapter I: Introduction to Statistics Table 1.2 Number of births to women by current age Age in years Number of births Cumulative number of births 14.5-19.5 677 677 19.5-24.5 1908 2585 24.5-29.5 1737 4332 29.5-34.5 1040 5362 34.5-39.5 294 5656 39.5-44.5 91 5747 44.5-49.5 16 5763 All ages 5763 - As an illustration, the 30th percentile for the distribution is determined from in = (30 × 5763) / 100 = 1728.9. Looking at the cumulative frequency in the table, 100 we find that this value falls in the range 19.5-24.5. The other required values are: L30 = 19.5, f 30 = 1908 , h = 5, and F = 677 Hence, p 30 = 19.5 + 5 (1728.9 − 677) = 22.25 1908 Percentile rank: The percentile rank of any score or observation is defined as the percentage of cases in a distribution that falls at or below that score. For grouped distribution, the following formula is used to compute the percentile rank (PR): ⎛ X − Li F + fi ⎜ ⎝ h PR = n ⎞ ⎟ ⎠ × 100 where, F = Cumulative frequency of class below the percentile class f i = Frequency of the percentile class X = Score for which the percentile rank is desired Li = Lower limit of the percentile class 11 Chapter I: Introduction to Statistics h = Class width for which the PR is desired n = Total number of observations Let us obtain the percentile rank for an age of 22.25 for the data in Table 2. Here, X = 22.25, F = 677, f i = 1908, Li = 19.5, h = 5 and n = 5763. ⎛ 22.25 − 19.5 ⎞ 677 + 1908⎜ ⎟ 5 ⎝ ⎠ × 100 = 30 PR = 5763 Hence the percentile rank is 30%. This implies that out of the 5763 births in the study area, approximately 1729 (30%) occurred to women aged 22.25 years or below and the remaining 70% occurred to women who were above this age. 1.2.2.3 Deciles When a distribution is divided into ten equal parts, each division is called a decile. Thus, there are 9 deciles in a distribution, which are denoted by D1 , D2 , ............, D9 . Obviously, D5 = M e = P50 . ( ) Example: Compute the 6th decile D6 for the distribution in Table 1.1. Here, in 10 = (6n ) 10 . For n = 70 , this quantity is 42. This being an integer, the (42 + 43) 2 th = 42.5th observation will be the 6th decile. By inspection of the distribution, D6 = 55. For grouped data, the formula for Di is Di = Li + h fi ⎞ ⎛ in ⎜ − F⎟, ⎠ ⎝ 10 i = 1, 2, 3, LLL , 9 where, Li = Lower limit of the ith decile class h = Width of the class interval f i = Frequency of the ith decile class F = Cumulative frequency of the class prior to the ith decile class n = Total number of observations 12 Chapter I: Introduction to Statistics Example: Obtain D 4 for the distribution in Table 2.2. Here, in 4 × 5763 = = 2305.2, L4 = 19.5, 10 10 Hence, D 4 = L4 + h f4 f 4 = 1908 and F = 677 ⎞ ⎛ 4n ⎜ − F ⎟ = 19.5 + (2305.2 − 677 ) = 23.8 ⎠ ⎝ 10 The value of 23.8 for the fourth decile implies that of the total births that occurred among the women, 4 out of every 10 (40%) occurred to them at an age of 23.8 years or before. 1.2.3 Measures of variability or dispersion The measure of dispersion is concerned with the scatter of a data set about its average. The dispersion of a distribution reveals how the observations are spread out or scattered on each side of the center. To measure the dispersion, scatter, or variation of a distribution is as important as to locate the central tendency. If the dispersion is small, it indicates high uniformity of the observations in the distribution. Absence of dispersion in the data indicates perfect uniformity. This situation arises when all observations in the distribution are identical. If this were the case, description of any single observation would suffice. A measure of dispersion appears to serve two purposes. First, it is one of the most important quantities used to characterize a frequency distribution. Second, it affords a basis of comparison between two or more frequency distributions. The study of dispersion bears its importance from the fact that various distributions may have exactly the same averages, but substantial differences in their variability. Frequently used measures of dispersion are the range, inter-quartile range, mean deviation, variance, and standard deviation. 1.2.3.1 Range The simplest and crudest measure of dispersion is the range. This is defined as the difference between the largest and the smallest values in the distribution. If x1 , x 2 ,.........., x n are the values of n observations in a sample, then range ( R ) of the variable X is given by: 13 Chapter I: Introduction to Statistics R ( x1 , x 2 ,........, x n ) = max{x1 , x 2 ,..........., x n } − min( x1 , x 2 ,............, x n } In other words, if the x values are arranged in ascending order such that x1 < x 2 < ........... < x n , then R = x n − x1 . More compactly, R = L − S , where, L = the largest value and S = the smallest value of the observations. Example: For a set of observations 90, 110, 20, 51, 210 and 190, the largest value is 210 and the smallest value is 20. The range is then R = 210 − 20 = 190 . For grouped data, the difference between the higher class limit of the highest class and the lower class limit of the lowest class is considered to be the range. 1.2.3.2 Special range Although the range is meaningful, it is of little use because of its marked instability, particularly when the range is based on a small sample. Imagine, if there is one extreme value in a distribution, the range of the values will appear to be large, when in fact, removal of this value may reveal an otherwise compact distribution with extremely low dispersion. Since the range is subject to the undue influence of erratic extreme values, it can be expected that if such values are excluded, the range of remaining items may be a more useful measure. One such measure is the 10 to 90 percentile range. It is established by excluding the highest and the lowest 10 percent of the items, and is the difference between the largest and the smallest values of the remaining 80 percent of the items. If P1090 stands for the 10 to 90 percentile range, then P1090 = P90 − P10 where, P90 and P10 are the 90thand 10th percentiles of the distribution, respectively. 1.2.3.3 Quartile deviation A measure similar to the special range is the inter-quartile range (Q ) . It is the difference between the third quartile (Q3 ) and the first quartile (Q1 ) . Thus Q = Q3 − Q1 The inter-quartile range is frequently reduced to the measure of semi-interquartile range, known as the quartile deviation (QD ) , by dividing it by 2. Thus 14 Chapter I: Introduction to Statistics QD = Q3 − Q1 2 This measure is more meaningful than the range because it is not based on two extreme values. Both the 10 to 90 percentile range and the quartile deviation have serious shortcomings. First of all, they do not take into consideration the values of all items. For example, P10 90 is not affected by the distribution patterns of those items above P90 and below P10 . QD is not affected by the distribution of all items above Q3 and below Q1 . Moreover, they remain to be positional measures, failing to provide measurement of scatter of the observations, relative to the typical value. In addition, it does not enter into any of the higher mathematical relationships that are basic to inferential statistics. To get rid of these shortcomings, there is a need for some measures that reflect the deviation of each and every observation from the average. 1.2.3.4 Mean deviation For data clustered near the central value, the differences of the individual observations from their typical value will tend to be small. Accordingly, to obtain a measure of the total variation in the data, it is appropriate to find an average of these differences. The resulting value will be called mean/average deviation. The mean deviation is an average of absolute deviations of individual observations from the central value of a series. The mean deviation is computed as the arithmetic mean of the absolute values of the deviations from the 'typical value' of a distribution. The 'typical value' may be the arithmetic mean, median, mode, or even an arbitrary value. The median is sometimes preferred as a typical value in computing the average deviation, because the sum of the absolute values of the deviations from the median is smaller than any other value. In practice, however, the arithmetic mean is generally used. If the distribution is symmetrical, the mean is identical with the median and the same average deviation is obtained. 15 Chapter I: Introduction to Statistics If a grouped frequency distribution is constructed, as is usually done with large samples, the average deviation is k MD( x ) = ∑f i =1 xi − x i n where, MD (x ) =Average deviation about mean k = Number of classes x i = Mid point of the ith class f i = frequency of the ith class k n = ∑ fi i =1 1.2.3.5 Variance and standard deviation Instead of ignoring the signs of deviations from the mean as in the computation of an average deviation, they may each be squared and then the results are added. The sum of squares can be regarded as a measure of the total dispersion of the distribution. By dividing the sum by the total number of observations, we obtain the average of the squared deviations, a measure called variance of the distribution. If the observations are all from a population, the resulting variance is referred to as a population variance. The variance for a population observations x1 , x 2 ,........, x N , commonly designated by σ 2 , is σ 2 ∑ (x = i − µ) 2 N where, µ is the mean of all the observations and N is the total number of observations in the population. Because of the operation of squaring, the variance is expressed in square units (e.g., kg2, km2, dollar2, etc.) and not in the original units (e.g., kg, km, dollar, etc.). It is therefore necessary to extract the positive square root to restore the original unit. The measure of dispersion thus obtained is called the population standard deviation and is usually denoted by σ . Thus 16 Chapter I: Introduction to Statistics ∑ (x σ= i − µ) N 2 = variance ( x ) Thus by definition, the standard deviation is the positive square root of the meansquare deviations of the observations from their arithmetic mean. If x1 , x 2 ,........, x n represent a set of sample observations of size n, the sample variance denoted by s 2 , is expressed as s 2 ∑ (x = i − x) 2 n −1 where, x is the sample mean of all the sample observations. The square root of sample variance is the sample standard deviation. It is denoted by S. When the observations x1 , x 2 ,..........., x k are paired with their corresponding frequencies f 1 , f 2 ,......., f k respectively in a fashion {xi , f i } to form a frequency distribution, the formula for computing variance and standard deviation should be modified since they are based on ungrouped data. s 2 ∑ f (x = i i − x) 2 n −1 f i = frequency of the ith observation x i = value of the ith observation n = ∑ fi x i will be the mid-value of the ith class of the frequency is presented by class intervals. Example: Let us illustrate the computational process of variance and standard deviation using the following ungrouped data on family size. Table 1.3 Family No. Number of household members in 10 families 1 2 3 4 5 17 6 7 8 9 10 Chapter I: Introduction to Statistics Size ( xi ) 3 3 4 4 5 5 6 6 7 7 The quantities to be calculated for computing the variance and standard deviation are shown in the table below: Family No. 1 2 3 4 5 6 7 8 9 10 Total xi 3 3 4 4 5 5 6 6 7 7 50 xi − x -2 -2 -1 -1 0 0 1 1 2 2 0 (x i − x )2 4 4 1 1 0 0 1 1 4 4 20 9 9 16 16 25 25 36 36 49 49 270 2 xi ∑x Here , x = s 2 ∑ (x = = i n i − x) n −1 50 = 5 and thus 10 2 = 20 = 2.2, giving s = 2.2 = 1.48 9 Example: Table 1.4 Computation of variance and standard deviation for grouped data is shown below xi − x (x i − x )2 f i (x i − x ) 18 -3 9 18 15 75 -1 1 3 2 14 98 1 1 2 8 2 16 128 2 4 8 9 1 9 81 3 9 9 Total 10 60 400 - - 40 xi fi f i xi f i xi 3 2 6 5 3 7 ∑fx x= ∑f i i i 60 = = 6 and s 2 = 10 ∑ f (x i 2 − x) n −1 2 i = 2 40 = 4.44 9 If the divisor n is used instead of n − 1 , s 2 = 4.0 underestimating the variance by an amount 0.44 . If n is large, this discrepancy will tend to disappear. 18 Chapter I: Introduction to Statistics 1.2.3.6 Relative measures of dispersion The various measures of dispersion that have been presented so far are absolute measures. The measures are absolute in the sense that they are expressed in the same statistical units in which the original data are presented, such as dollar, meter, kilogram, etc. When the two sets of data are expressed in different units, the absolute measures are not comparable. Even with identical units of measurements, the individual values of one distribution may vary so widely (such as the salary of a manager versus wage of a worker) that the average and the deviations of the items from this average of the first distribution may be widely different in magnitude from those of other. These differences may arise entirely due to the inherent differences in the averages of the two distributions and because of this, the absolute difference in magnitude of deviations can not be taken for comparing the measures of variation of the distributions. So to compare the extent of variation of different distributions whether having differing or identical units of measurements, it is necessary to consider some other measures that reduce the absolute deviation in some relative form. These measures are usually expressed in the form of coefficients and are pure numbers, independent of the unit of measurements. The measures are: 1) Coefficient of variation 2) Coefficient of mean deviation 3) Coefficient of range 4) Coefficient of quartile deviation Coefficient of variation: The standard deviation discussed above is an absolute measure of dispersion. The corresponding relative measure proposed by Karl Pearson is the coefficient of variation (CV) that attempts to measure the relative variability in data set. When the means of data sets vary considerably, we do not get the accurate picture of the relative variability in two sets just by comparing the standard deviation. Coefficient of variation tends to overcome this difficulty. This is a measure that presents the spread of the distribution relative to the mean of the same distribution. A coefficient of variation is computed as a ratio of the standard deviation of the distribution to the mean of the same distribution. Symbolically, 19 Chapter I: Introduction to Statistics CV = sx x The CV is usually expressed in percentage, in which case CV = sx × 100 . Thus a x value of 33 percent for CV implies that the standard deviation of the sample value is 33 percent of the mean of the same distribution. As an illustration of the use of the CV as a descriptive statistic, let us suppose that we wish to obtain some insight into whether height is more variable than the weight in the same population. For this purpose, for instance, we have the following data obtained from 150 children in a community: Mean Height weight 40 inch 10 kg SD 5 inch 2 kg CV 0.125 0.20 Since the coefficient of variation for weight is greater than that of height, we would tend to conclude that weight has more variability than height in the population. Coefficient of mean deviation: The third relative measure is the coefficient of mean deviation. As the mean deviation can be computed from mean, median, mode, or from any arbitrary value, a general formula for computing coefficient of mean deviation may be put as follows: Coefficient of mean deviation from A = = Mean deviation from A × 100 A MD( A) × 100 A where, A is the mean, median, mode, or any other arbitrary value. The use of a particular formula depends on the type of average used in computing the mean deviation. Coefficient of range: The coefficient of range is a relative measure corresponding to range and is obtained by the following formula: 20 Chapter I: Introduction to Statistics Coefficient of range = L−S × 100 L+S where, L and S are respectively the largest and the smallest observations in the data set. Coefficient of quartile deviation: The coefficient of quartile deviation is computed from the first and the third quartiles using the following formula: Coefficient of quartile deviation = 1.2.4 1.2.4.1 Q3 − Q1 ×100 Q3 + Q1 Shape characteristics of a distribution Skewness The term skewness refers to the lack of symmetry. The lack of symmetry in a distribution is always determined with reference to a normal distribution. Note that a normal distribution is always symmetrical. The lack of symmetry leads to an asymmetric distribution and in such cases, we call this distribution as skewed or we say that skewness is present in the distribution. The skewness may be either positive or negative. When the skewness of a distribution is positive (negative), the distribution is called a positively (negatively) skewed distribution. Absence of skewness makes a distribution symmetrical. It is important to emphasize that skewness of a distribution cannot be determined simply my inspection. a) Symmetrical distribution: This type of distribution is known as normal or Gaussian distribution. One would obtain such a distribution with data, such as height, weight, and examination scores. For a symmetrical distribution, Mean = Median = Mode. 21 Chapter I: Introduction to Statistics Figure 1.1 Location of mean, median and mode in a symmetrical distribution b) Positively skewed distribution: In this distribution, the long tail to the right indicates the presence of extreme values at the positive end of the distribution. This pulls the mean to the right. These distributions occur with data, such as family size, female age at marriage, and wages of the employees. For a positively skewed distribution, Mean > Median > Mode. The frequency curve would look: Figure 1.2 Location of mean, median and mode in an asymmetrical distribution c) Negatively skewed distribution: In a negatively skewed distribution, the mean is pulled in a negative direction. Reaction times for an experiment, daily maximum temperature for a month in winter, etc., result in negatively skewed distributions. For a negatively skewed distribution, Mean < Median < Mode. The frequency curve would look like as in Figure 1.2. Measures of skewness: In studying skewness of a distribution, the first thing that we would like to know whether the distribution is positively or negatively skewed. The second thing is to measure the degree of skewness. The simplest measure of skewness is the Pearson’s coefficient of skewness: Mean - Mode Standard deviation Pearson' s coefficient of skewness = If Mean > Mode, the skewness is positive. If Mean < Mode, the skewness is negative. If Mean = Mode, the skewness is zero. 22 Chapter I: Introduction to Statistics In many instances, mode cannot be uniquely defined and the above formula cannot be applied. It has been observed that for a moderately skewed distribution, the following relationship holds: Mean - Mode = 3(Mean - Median) Using this relation, the Pearson’s coefficient of skewness assumes the following modified form: Pearson' s coefficient of skewness = 3(Mean - Median ) Standard deviation Another measure of skewness due to Bowley, is defined in terms of the quartile value. Since there is no difference between the distances of either of the first quartile Q1 or the third quartile Q3 from the median Q 2 in a symmetrical distribution, any difference in the distances from the median is a reasonable basis for measuring skewness in a distribution. Thus, in terms of the three quartiles Q1 , Q 2 and Q3 , the Bowley’s quartile coefficient of skewness is Quartile coefficient of skewness = = (Q3− Q2 ) − (Q2 − Q1 ) Q3 − Q1 Q3 + Q1 − 2Q2 Q3 − Q1 This is evidently a pure number lying between -1 and +1 and is zero for a symmetrical distribution. If Q3 − Q2 = Q2 − Q1 , quartile skewness = 0 and the distribution is symmetrical. If Q3 − Q2 > Q2 − Q1 , quartile skewness > 0 and the distribution is positively skewed. If Q3 − Q2 < Q2 − Q1 , quartile skewness < 0 and the distribution is negatively skewed. 1.2.4.2 Kurtosis There is considerable variation among symmetrical distributions. For instance, they can differ markedly in terms of peakedness. This is what we call kurtosis. Kurtosis is 23 Chapter I: Introduction to Statistics the degree of peakedness of a distribution, usually taken in relation to a normal distribution. A curve having relatively higher peak than the normal curve, is known as leptokurtic. On the other hand, if the curve is more flat-topped than the normal curve, it is called platykurtic. A normal curve itself is called mesokurtic, which is neither too peaked nor too flat-topped. Figure 1.3 Illustration of kurtosis Measures of Kurtosis: The most important measure of kurtosis based on the second and fourth moments is β 2 , defined as: β2 = µ4 µ22 where, µ 2 and µ 4 are, respectively, the second and fourth moments about the mean. This measure is a pure number and is always positive. For normal distribution, β 2 = 3 . When the value of β 2 is greater than 3, the curve is more peaked than the normal curve, in which case, it is leptokurtic. When the value of β 2 is less than 3, the curve is less peaked than the normal curve, in which case, it is platykurtic. In other words, If β 2 − 3 > 0, the distribution is leptokurtic. If β 2 − 3 < 0 , the distribution is platykurtic. If β 2 − 3 = 0 , the distribution is mesokurtic. 24 Chapter I: Introduction to Statistics 1.3 DATA EXPLORATION WITH GRAPHICAL MEANS In addition to presenting statistical data through tabular form and descriptive statistics, one can present the data through some visual aids. This refers to graphs and diagrams. Such presentation gives visual impression of the entire data and therefore the information presented is easily understood. When frequency distributions are constructed primarily to condense large sets of data into an easy to digest form, graphical and diagrammatic presentations are preferred. The most common forms of graphs and diagrams are the bar diagram, pie chart, histogram, line diagram, scatter diagram, frequency polygon, and ogive. Bar diagrams and pie charts are usually constructed for categorical data and the others for interval scale data. 1.3.1 Bar diagram A bar diagram, also known as a bar chart, is a form of presentation in which the frequencies are represented by rectangles separated along the horizontal axis and drawn as bars of convenience widths. A bar diagram consists of horizontal or vertical bars of equal widths and lengths proportional to the magnitudes the bars represent. In presenting the bars, there is no necessity of having a continuous scale. Example: Health personnel from 150 rural health centres were asked how frequently they have visited their respective areas during the last one week. The responses were recorded as rarely, occasionally, frequently, and never. The following table displays the frequency of responses in each category: Table 1.5 Relative frequency distribution of health professional data Response Frequency Relative frequency Frequently 49 0.327 Occasionally 71 0.473 Rarely 24 0.160 Never 6 0.040 Total 150 1.00 The vertical and horizontal bar diagrams constructed from these data are shown in figures below: 25 Chapter I: Introduction to Statistics No. of visits 80 60 40 20 0 Frequently Occasionally Rarely Never Figure 1.4 Vertical bar diagram for health centre visit data Never Rarely Occasionally Frequently 0 10 20 30 40 50 60 70 80 No. of visits Figure 1.5 Horizontal bar diagram for health centre visit data Component bar diagram: A component bar diagram is a good device to display categorical data. In such a diagram, the total values as well as the various components constituting the total are shown. Each part of the bar represents each component, while the whole bar represents the total value. The component parts are variously colored or shaded to make them distinct. Example: Given below are the population of two regions by sex. Display them by a component bar diagram: 26 Chapter I: Introduction to Statistics Population in '000 Region Percent of population Male Female Male Female A 11228 10637 51.3 48.7 B 17634 16306 52.0 48.0 Population in '000 Male Female 35 30 25 20 15 10 5 0 A B Region Figure 1.6 Component bar diagram for population data Multiple bar chart: Multiple bar charts are frequently used to present statistical data. They are primarily used to compare two or more characteristics corresponding to a common variate value. Multiple bar charts are grouped bars, whose lengths are proportional to the magnitude of the characteristics. The bars of a multiple chart are usually put adjacent to each other without allowing any space between them. Different shading or color can be used to distinguish one group of bars from other groups. Data on population values for different regions, literacy rates by sex, volume of exports by type of production, etc., can be represented by a multiple bar chart. Example: Given below is the education level of female population of Bangladesh by administrative division. Display them with a multiple bar chart. Percent of females with Division No education Primary education Secondary education Barisal 43.9 34.4 21.7 Chittagong 41.8 37.0 21.2 27 Chapter I: Introduction to Statistics Dhaka 45.9 35.3 18.8 Khulna 39.6 41.2 19.2 Rajshahi 48.5 37.8 13.7 Sylhet 52.6 36.1 11.3 Population in'000 60 50 40 30 20 10 0 Barisal Dhaka Rajshahi Region No education Priamary education Secondary ducation Figure 1.7 Multiple bar diagram for the education level data 1.3.2 Pie chart A pie chart, also known as a pie diagram, is an effective way of presenting percentage parts when the whole quantity is taken as 100. This is a useful device for presenting categorical data. The pie chart consists of a circle sub-divided into sectors, whose areas are proportional to the various parts into which the whole quantity is divided. The sectors may be shaded or colored differently to show their individual contributions to the whole. Table 1.6 Health centre visit data for constructing a pie diagram Response Frequency Relative frequency Angles of the (%) sector Frequent 49 32.7 117.6 Occasional 71 47.3 170.4 Rare 24 16.0 57.6 Never 6 4.0 14.4 28 Chapter I: Introduction to Statistics Total 150 100.0 Frequent Occasional Rare 360.0 Never Figure 1.8 Simple pie diagram Frequent Occasional Rare Never Figure 1.9 Three dimensional pie diagram 1.3.3 Histogram The most common form of graphical representation of a frequency distribution is the histogram. A histogram is constructed by placing the class boundaries on the horizontal axis of a graph and the frequencies on the vertical axis. Each class is shown on the graph by drawing a rectangle whose base is the class boundary and whose height is the corresponding frequency for the class. When the class boundaries are required to be unequal because of some particular feature of the data set, the method of constructing a histogram should be modified accordingly. 29 Chapter I: Introduction to Statistics Example: The observed relative humidity (%) at a certain location for a period of 100 days is given below. Construct a histogram from these data. 61.5 77 60.5 53 58.5 41 71 48.5 40.5 46 73 78 55 50 59 36.5 68 59 38.5 56.5 70.5 72.5 50.5 68 55.5 39.5 63 62 48.5 60 66 72 47.5 62 65 38.5 54.5 51 29.5 34.5 64.5 62 65 42.5 65 44.5 71.5 54.5 34.5 43 68.5 65 63.5 53 53.5 50.5 84.5 52 34 41.5 75.5 66.5 51 48 46 39 73 60 48.5 31 63.5 50.5 54.5 66 48.5 41.5 57.5 51 37 30.5 62 58.5 55.5 62.5 47 57.5 56 36.5 40.5 50 68 61.5 58 54.5 52.5 69.5 51 40.5 55.5 49.5 Enter the data either directly into a SPSS data file or copy from another compatible file to a SPSS data file. Give the name of the variable as 'rh'. To obtain a histogram using the SPSS package, first click on the analyze menu bar in the data file, then click on explore in the descriptive statistics. Bring the variable 'rh' in the Dependent List: box. In the Display options, select Plots. Then click the Plots... icon; a new window called Explore: Plots appears. Select None in the Boxplots box and Histogram in the Descriptive box. Then click on Continue and OK to obtain the histogram as shown below. Frequency 20 10 Std. Dev = 12.01 Mean = 54.8 N = 100.00 0 30.0 40.0 35.0 50.0 45.0 60.0 55.0 70.0 65.0 80.0 75.0 85.0 RH Figure 1.10 Histogram of the relative humidity data 1.3.4 Stem and Leaf plot Stem and leaf plot is a graphical technique of representing quantitative data that can be used to examine the shape of a frequency distribution, the range of the values, the 30 Chapter I: Introduction to Statistics point of concentration of the values and the presence of any extreme values or outliers. Compared to the other graphical techniques presented thus far, stem and leaf plot is an easy and quick way of displaying data. In a stem and leaf plot, a histogramlike picture of a frequency distribution is constructed. Example: Use a stem and leaf plot to display the following marks obtained by 20 students in a statistics test: 84 17 38 45 47 53 76 54 75 22 66 65 55 54 51 33 39 19 54 72 Solution: The lowest score is 17 and the highest score is 84. For stem and leaf plots, classes must be of equal lengths. We will use the first or leading digit (tens) of score as the stem and the trailing digit (units) as the leaf. For example, for the score 84, the leading digit is 8 and the trailing digit is 4. In a stem and leaf plot, a leading digit (stem of score) determines the row in which the score is placed. The trailing digits for a score are then written in the appropriate row. In this way, each score is recorded in the stem and leaf plot. Frequency Stem & Leaf 2.00 1.00 3.00 2.00 6.00 2.00 3.00 1.00 1. 2. 3. 4. 5. 6. 7. 8. 79 2 389 57 134445 56 256 4 Stem width: 10 Each leaf: 1 case(s) You can obtain a stem and leaf plot with the SPSS software. Click on analyze on the menu bar, then click on explore in the descriptive statistics. Bring the variable, say 'mark' in this case, for which you would like to obtain a stem and leaf plot under the Dependent List: box. In the Display options, select Plots. Then click the Plots... icon; a new window called Explore: Plots appears. Select None in the Boxplots box and 31 Chapter I: Introduction to Statistics Stem-and-leaf in the Descriptive box. Then click on Continue and OK to obtain the above stem and leaf plot. 1.3.5 Frequency polygon A frequency polygon provides an alternative way to a histogram of presenting graphically the distribution of a continuous variable. The presentation involves placing the mid-values on the horizontal axis and the frequencies on the vertical axis. However, instead of using rectangles as with the histogram, we find the class midpoints are used on the horizontal axis and then the points are plotted directly above the class mid-points at a height corresponding to the frequency of the class. Classes of zero frequency are added at each end of the frequency distribution so that the frequency polygon touches the horizontal axis at both ends of the graph. This makes the frequency polygon a close figure. Example: Weekly expenditure in dollar by 80 students at a certain city is shown below. Construct a frequency polygon using these data. Frequency Expenditure No. of students 4.5-9.5 8 9.5-14.5 29 14.5-19.5 27 19.5-24.5 12 24.5-29.5 4 Total 80 35 30 25 20 15 10 5 0 0 2 7 12 17 22 Mid Value Figure 11 Frequency polygon 32 27 32 Chapter I: Introduction to Statistics The histogram and frequency polygon are equally good techniques for presenting continuous data. The histogram is more often used when a single distribution is presented, while the frequency polygon is largely used for comparison of two or more distributions. Ogive (Cumulative frequency polygon): A graph of the cumulative frequency distribution or cumulative relative frequency distribution is called an ogive. An ogive can be either less than or more than type. 1.3.6 Scatter diagram Scatter diagrams are useful for displaying information on two quantitative variables, which are believed to be inter-related. Height and weight, age and height, income and expenditure, rainfall and runoff are the examples of some of the data sets that are assumed to be related to each other, which can be displayed by scatter diagrams. Example: Given below is the age in years at the first marriage of 20 couples. Construct a scatter diagram for the two data sets. Husband's Wife's age age Husband's Wife's age age Husband's Wife's age age 19 15 39 32 21 14 26 17 39 30 26 19 27 19 26 19 28 19 29 21 33 27 29 21 36 28 39 34 37 29 40 36 25 21 31 27 35 32 40 38 - - 33 Chapter I: Introduction to Statistics 40 35 30 wife's age 25 20 15 10 5 0 0 10 20 30 40 50 Husband's age Figure 1.12 Scatter diagram for the age at marriage data 1.3.7 Line graph A line graph is particularly useful for numerical data if we wish to display the data in the time series form. Such data could be the production of jute in a region for a period of 20 years, the export of raw materials from a country for a period of 40 years, the annual rainfall at a location for a period of 100 years, or the daily evaporation from a lake for a period of 50 years. The growth of population in Bangladesh since 1901 is given in the table below: Table 1.7 Census population of Bangladesh in million Year Population 1901 28.9 1911 31.6 1921 33.2 1931 35.6 1941 42.0 34 Chapter I: Introduction to Statistics 1951 44.2 1961 55.2 1974 76.4 1981 89.9 1991 111.5 A line graph for these data is drawn in Figure 1.13. From the line graph, one can easily observe that the population of Bangladesh has increased significantly from 1901 to 1991, but the increase is not uniform throughout the period. The increase was slower until 1941, which thereafter increased with much higher rate compared to the previous period. Population(million) 120 100 80 60 40 20 0 1901 1921 1941 1961 1981 Census year Figure 1.13 Time series plot of the total population in Bangladesh 35