Download Chapter-1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Transcript
Chapter I
Introduction to statistics
1.1
BASIC CONCEPTS OF STATISTICS
1.1.1 Definition of statistics
It is not precisely known how the word "statistics” was originated. Most people
believe that the term statistics has been derived from the Latin word “status”
meaning political state. Some believe that the word statistic has been originated from
the Italian word “statista”, the French word “statistique” and the German word
“statistik”. This background tends to suggest that the term statistics has its origin
from the ancient time.
The science of statistics has developed gradually and its field of applications has
widened with the passage of time. Its continued use and importance are considered to
be indispensable in all spheres of life.
Statistics is a branch of scientific methodology. It deals with the collection,
classification, description, and interpretation of data through scientific procedures. It
is difficult to define statistics in a few words, since its dimension, scope, function,
use, and importance are constantly changing with time. No formal definition has
emerged so far and no definition is perhaps universally accepted. Statistics are simply
the facts and figures about any phenomenon or event - whether it relates to
population, production, income, expenditure, sales, birth or death, or any other
quantitative measures. Statistics can be defined as numerical facts (descriptive
statistics) and as a subject (Inferential statistics). A few formal definitions of statistics
are given below:
Statistics are measurements, enumeration or estimates of natural or social
phenomena systematically arranged so as to exhibit their interrelationships
(Conor, 1937).
By statistics we mean aggregates of facts affected to a marked extent by a
multiplicity of causes, numerically expressed, enumerated or estimated
Chapter I: Introduction to Statistics
according to a reasonable standard of accuracy, collected in a systematic
manner for a pre-determined purpose, and placed in relation to each other
(Secrist, 1933).
Statistics is the science, pure and applied, creating, developing and applying
techniques, by which the uncertainty of inductive inferences may be evaluated
(Steel et al., 1997).
1.1.2 Classification of statistics
i) Pure Statistics or Mathematical Statistics: Pure Statistics deals with the theory of
statistics. Research tools are usually developed in this branch of statistics. These tools
are then applied to specific problems in different fields - such as physics and
chemistry, anthropology and biology, etc. - of Applied Statistics. Pure Statistics not
only creates new tools but also tries to perfect existing ones by placing them on more
and more rigorous foundations, and in this process, it depends much on recent
developments in Pure Mathematics.
ii) Applied Statistics: Applied Statistics deals with the application of statistical
methods to specific problems. It continues to find new uses for the existing tools and
for the new tools that are being created.
Pure and Applied Statistics interact and thus enrich each other. Pure Statistics
continues to develop newer and stronger tools for Applied Statistics, while Applied
Statistics continues to present new and challenging problems for Pure Statistics.
1.1.3 Characteristics of statistics
Statistics possess the following characteristic features:
1)
Statistics deals with population or aggregate of individuals, rather than with
individuals alone.
This means that statistics does not deal with a single figure, since a single figure is
incapable to provide any additional information other than itself.
2)
Statistics is concerned with reduction of data or with obtaining correct facts
from figures.
3)
Statistics deals with variation.
2
Chapter I: Introduction to Statistics
4)
Statistics deals with only numerically specified populations.
Qualitative statements, such as fair, good, medium, or poor, are not statistics unless
they can be expressed in numerical form.
5)
Statistics deals with populations which occur in nature and are subject to a large
number of random forces.
This means that statistics are not the effect of a single factor. For example, the weight
of an individual depends primarily on his/her age and height, and the production of
rice depends on land fertility, irrigation, input uses, etc.
6)
Statistics collected should be of reasonable standards of accuracy.
Statistics should be collected in a systematic and scientific manner. The collected
statistics should lead to arrive at decision regarding the population of interest.
7)
The logic used in statistical inference is inductive, or statistical inferences are
uncertain.
We draw inference about the population from the information contained in a sample,
which is only a part of the population and thus we pass from the particular to the
general so that there is a chance of our conclusions being untrue.
8)
Statistics should be obtained for pre-determined purposes.
There should be clear, well-defined and unambiguous statement regarding the purpose
of data collection.
9)
Statistics collected should allow comparison with other data.
Statistical data should be collected with a view to make comparison with data of
similar nature collected in different settings. Otherwise, no conclusion can be drawn
regarding their quality, usefulness, importance and hence cannot be used for any
decision making purpose for which they were collected.
10)
Statistical results might lead to fallacious conclusions, if quoted out of context
or manipulated.
1.1.4 Population and sample
The essential purpose of statistics is to describe about the numerical properties of
populations and draw inferences about the populations from the samples. In statistics,
the concepts of population and sample are of immense importance.
3
Chapter I: Introduction to Statistics
A population is a complete set of individuals, objects, or measurements having some
common characteristics, whereas a sample is a subset or part of the population
selected to represent the population. For example, a sample of size n = 2500
individuals was selected randomly from a population of size N = 60 million to arrive
at a decision regarding the preference of a prime ministerial candidate in a country.
1.2
DESCRIPTIVE STATISTICS
1.2.1 Central tendency
There are two obvious features of the data that can be characterized in a simple form
and yet give a very meaningful description: central tendency and dispersion. The
central tendency is measured by averages; these describe the point about which the
various observed values cluster.
There are several different measures of central tendency. Each is an indicator of what
a typical value is, but each employs a different definition of ‘typical’. These measures
are collectively called statistical averages. The purpose of a statistical average is to
represent the central value of a distribution and also to afford a basis of comparison
with other distributions of similar nature. Among the several averages, the most
commonly used averages are Mean, Median and Mode.
1.2.1.1
Mean
Arithmetic mean: The arithmetic mean is the most commonly used central value of a
distribution. It is also referred to as simply the mean,The arithmetic mean is the sum
of a set of observations, positive, negative or zero, divided by the number of
observations. If we have n real numbers x1 , x 2 , x 3 , ......., x n , their arithmetic mean,
denoted by x , can be expressed as:
x=
x1 + x 2 + x3 + ............. + x n
n
We can also write the mean x as follows:
n
x=
∑x
i=1
i
n
4
Chapter I: Introduction to Statistics
Clearly, the mean x may be positive, negative, or even zero depending on the nature
of the values included in its computation.
Computing arithmetic mean for grouped data: When arithmetic mean is computed
from a grouped distribution, the mid-point of each class is taken as the representative
value of that class. The various mid-values are multiplied by their respective class
frequencies, the products are added, and the sum of the products is then divided by the
total number of observations to obtain the arithmetic mean. Symbolically, if
z1 , z 2 , z 3 ,.........., z k are the mid-values and f1 , f 2 , f 3 ,........, f k are the corresponding
frequencies, where the subscript ‘ k ’ stands for the number of classes, then the
mean z is
∑fz
∑f
i
z=
i
i
Geometric mean: Geometric mean is defined as the nth positive root of the product
of n observations. Symbolically,
G = ( x1 x 2 x 3 LLLL x n ) 1 / n
This average is used when dealing with observations each of which bears an
approximately constant ratio to the preceding one, e.g., in averaging rates of growth
(increase or decrease) of a statistical population. If the n non-zero and positive
variate-values x1 , x 2 ,........, x n occur f1 , f 2 ,......., f n times, respectively, then the
geometric mean G of the set of observations is defined by:
[
G = x1
f1
LogG =
f
x 2 2 LLL x n
1
N
fn
]
1
1
N
n
⎡ n
⎤N
= ⎢∏ x if i ⎥
⎣ i =1
⎦
n
∑ ( f logx ) , where, N = ∑ f
i =1
i
i
i =1
i
Thus the logarithm of the geometric mean is the weighted mean of the different values
log xi when weights are the frequencies f i .
There are several other measures of averages, such as harmonic mean, weighted
arithmetic mean, quadratic mean, trimmed mean, and trimean, which are occasionally
used.
5
Chapter I: Introduction to Statistics
1.2.1.2
Median
The median is that value for which 50% of the observations, when arranged with
respect to their magnitudes, either in ascending or descending order, lies on each side.
The implication of this definition is that a median is the middle value of the
observations such that the number of observations above it is equal to the number of
observations below it.
Computing median for raw data: Suppose a family has seven members whose ages
in years are 12, 7, 2, 34, 17, 21 and 19. To compute the median of these numbers, we
arrange them in either ascending or descending order. In either ordering, the middle
value is the median. Arranging them in both orders, the series is
Ascending order:
2, 7, 12, 17, 21 and 34.
Descending order:
34, 21, 19, 17, 12, 7 and 2.
The middle value in either ordering is the fourth value, which in this case is 17.
How would you deal with the problem when the number of observations is even?
In general, if x1 , x 2 , x 3 ,........, x n constitute a series of n observations of the variable
X arranged in order of magnitude and the number of observations is odd, the median,
henceforth abbreviated M e , is the number occurring in the center of the series and is
determined by considering the
Me = X1
2
1
(n + 1) th value of the observations. Symbolically,
2
( n +1)
If n is even, the median is given by:
Me =
⎞
1⎛
⎜Xn + Xn ⎟
⎜
+1 ⎟
2⎝ 2
2
⎠
The following steps are involved in the computation of median from ungrouped data:
•
List the observations in order of magnitudes.
6
Chapter I: Introduction to Statistics
•
Count the number of observations. This is n .
•
The median is the value that corresponds to the observation number
if n is odd and the observation number
1
(n + 1)
2
1 ⎡ n ⎛ n ⎞⎤
+ ⎜ + 1⎟ if n is even.
2 ⎢⎣ 2 ⎝ 2 ⎠⎥⎦
Example: The weights of 11 mothers in kg were recorded as follows:
47, 44, 42, 41, 58, 52, 55, 39, 40, 43 and 61
To obtain the median weight, we arrange the values in ascending order. When we do
so, the series becomes 39, 40, 41, 42, 43, 44, 47, 52, 55, 58 and 61. Since n is odd,
the median is the value that belongs to the observation number
1
(n + 1) i.e.
2
1
1
(n + 1) th observation = (11 + 1) = 6th observation. On counting, the 6th observation
2
2
is 44 and hence it is the median. If the series becomes 39, 40, 41, 42, 43, 44, 47, 55,
58 and 61, in which case n = 10, which is an even number, the median will be the
average of the 5th and 6th observations. This value is
1
(43 + 44) = 43.5 .
2
Computing median for grouped data: Algebraic expression for the median of a
group frequency distribution is:
M e = Lo +
h ⎛n
⎞
⎜ − F⎟
fo ⎝ 2
⎠
where,
Lo = Lower class boundary of the median class
h = Width of the median class
f o = Frequency of the median class
F = Cumulative frequency of the pre-median class
The following steps are involved in computing median from grouped data:
•
Compute the less than type cumulative frequencies.
•
Determine n / 2 , one-half of the total number of cases.
7
Chapter I: Introduction to Statistics
•
Locate the median class for which the cumulative frequency is more than
n/ 2.
•
Determine the lower limit of the median class. This is Lo .
•
Sum the frequencies of all classes prior to the median class. This is F .
•
Determine the frequency of the median class. This is f o .
•
Determine the class width of the median class. This is h .
Now you have all the quantities to compute median. Putting them in the above
formula, you can calculate the median.
1.2.1.3
Mode
Mode is the value of a distribution for which the frequency is maximum. In other
words, mode is the value of a variable, which occurs with the highest frequency. If a
population consists of 75 percent Hindus, 15 percent Muslims and the remaining 10
percent are followers of other religions; the modal category is the Hindu, which has
the most people.
To determine the value of the mode for a group frequency distribution, it is necessary
to identify the modal class, in which the mode is located. In general, a modal class is
the one with the highest frequency of the distribution. Once the modal class is
identified, the next step is to locate the mode within the class. The mid-point of the
modal class is usually taken as modal value of the distribution.
1.2.2
Measures of location
Measures that are allied to the median include the quartiles, deciles and percentiles,
because they are also based on their position in a series of observations. These
measures are referred to as measures of location and not the measures of central
tendency as they describe the position of one score relative to the others rather than
the whole set of data.
1.2.2.1
Quartiles
Quartiles are those variate values which divide the total frequency into four equal
parts. There are three quartiles in a data series, usually denoted by Q1 , Q2 and Q3 . Q 2
8
Chapter I: Introduction to Statistics
is identical with the median. Q1 and Q3 are the values at or below which one-fourth
and three-fourths of all items in a series fall, respectively. For a grouped frequency
distribution, the method of estimating the first and third quartiles is similar to that of
estimating the median:
Qi = Li +
h
fi
⎛ in
⎞
⎜ − F ⎟, i = 1, 2, 3
⎝4
⎠
where,
Li = Lower limit of the i th quartile class
n = Total number of observations in the distribution
h = Class width of the i th quartile class
f i = Frequency of the i th quartile class
F = Cumulative frequency of the class prior to the i th quartile class
Table 1.1
Distribution of 70 students according to the marks they obtained in a class
test
Marks
No. of students
Cumulative frequencies
40
6
6
43
11
17
51
19
36
55
17
53
60
13
66
63
4
70
70
_
Total
To obtain Q1 and Q3 , we cumulate the frequencies as shown in the third column of
the table above. Since n / 4 = 70 4 = 17.5 is not an integer, the first quartile will be the
18th observation (next higher integer of the fraction 17.5). From the cumulative
frequencies, Q1 will be 51. Since 3n / 4 = 52.5 , Q3 will be the 53 rd value which is
55.
1.2.2.2
Percentiles
9
Chapter I: Introduction to Statistics
The p -th percentile of a data set is a value such that at least p percent of the items
take on this value or less and at least (100- p ) percent of the items take on this value
or more. It seems obvious that percentiles are the values, which divide the distribution
into 100 equal parts. Thus there are 99 percentiles in a distribution, which are
conveniently denoted by
p1 , p 2 , p 3 , ............., p 99 . In terms of percentiles, the
median is the 50th percentile. This means that p 50 = Q2 = M e . The 25th and 75th
percentiles are the first and third quartiles, respectively.
Admission test scores for colleges and universities are frequently reported in terms of
percentiles. For example, suppose an applicant has a raw score of 54 in the oral
portion of an admission test. If the raw score of 54 corresponds to the 70th percentile,
it is easily seen that approximately 70 percent of the students had a score less than this
individual and approximately 30 percent scored better.
With ungrouped data, the percentile either takes the value half-way between the two
observations or the value of one of the observations, depending on whether n is
divisible by 100 or not. Consider the observations 11, 14, 17, 23, 27, 32, 40, 49, 54,
59, 71 and 80. To determine the 29th percentile,
p 29 ,
we note that
1
(29 ×12) = 3.48, which is not an integer. Thus the next higher integer 4 here will
100
determine the 29th percentile value. On inspection, p 29 = 23 .
Percentiles for grouped data: The i th percentile of a grouped distribution for n
observations may be arrived at by using the following formula:
p i = Li +
h in
(
− F) ,
f i 100
i = 1, 2, 3, LLL , 99
where,
Li = Lower limit of the i th percentile
f i = Frequency of the i th percentile class
h = Width of the class interval
F = Cumulative frequency of the class prior to the i th percentile class
10
Chapter I: Introduction to Statistics
Table 1.2
Number of births to women by current age
Age in years
Number of births
Cumulative number of births
14.5-19.5
677
677
19.5-24.5
1908
2585
24.5-29.5
1737
4332
29.5-34.5
1040
5362
34.5-39.5
294
5656
39.5-44.5
91
5747
44.5-49.5
16
5763
All ages
5763
-
As an illustration, the 30th percentile for the distribution is determined from
in
= (30 × 5763) / 100 = 1728.9. Looking at the cumulative frequency in the table,
100
we find that this value falls in the range 19.5-24.5. The other required values are:
L30 = 19.5, f 30 = 1908 , h = 5, and F = 677
Hence, p 30 = 19.5 +
5
(1728.9 − 677) = 22.25
1908
Percentile rank: The percentile rank of any score or observation is defined as the
percentage of cases in a distribution that falls at or below that score. For grouped
distribution, the following formula is used to compute the percentile rank (PR):
⎛ X − Li
F + fi ⎜
⎝ h
PR =
n
⎞
⎟
⎠
× 100
where,
F = Cumulative frequency of class below the percentile class
f i = Frequency of the percentile class
X = Score for which the percentile rank is desired
Li = Lower limit of the percentile class
11
Chapter I: Introduction to Statistics
h = Class width for which the PR is desired
n = Total number of observations
Let us obtain the percentile rank for an age of 22.25 for the data in Table 2.
Here, X = 22.25, F = 677, f i = 1908, Li = 19.5, h = 5 and n = 5763.
⎛ 22.25 − 19.5 ⎞
677 + 1908⎜
⎟
5
⎝
⎠
× 100 = 30
PR =
5763
Hence the percentile rank is 30%. This implies that out of the 5763 births in the study
area, approximately 1729 (30%) occurred to women aged 22.25 years or below and
the remaining 70% occurred to women who were above this age.
1.2.2.3
Deciles
When a distribution is divided into ten equal parts, each division is called a decile.
Thus, there are 9 deciles in a distribution, which are denoted by D1 , D2 , ............, D9 .
Obviously, D5 = M e = P50 .
( )
Example: Compute the 6th decile D6 for the distribution in Table 1.1.
Here, in 10 = (6n ) 10 . For n = 70 , this quantity is 42. This being an integer, the
(42 + 43) 2 th = 42.5th observation
will be the 6th decile. By inspection of the
distribution, D6 = 55.
For grouped data, the formula for Di is
Di = Li +
h
fi
⎞
⎛ in
⎜ − F⎟,
⎠
⎝ 10
i = 1, 2, 3, LLL , 9
where,
Li = Lower limit of the ith decile class
h = Width of the class interval
f i = Frequency of the ith decile class
F = Cumulative frequency of the class prior to the ith decile class
n = Total number of observations
12
Chapter I: Introduction to Statistics
Example: Obtain D 4 for the distribution in Table 2.2.
Here,
in 4 × 5763
=
= 2305.2, L4 = 19.5,
10
10
Hence, D 4 = L4 +
h
f4
f 4 = 1908 and F = 677
⎞
⎛ 4n
⎜ − F ⎟ = 19.5 + (2305.2 − 677 ) = 23.8
⎠
⎝ 10
The value of 23.8 for the fourth decile implies that of the total births that occurred
among the women, 4 out of every 10 (40%) occurred to them at an age of 23.8 years
or before.
1.2.3
Measures of variability or dispersion
The measure of dispersion is concerned with the scatter of a data set about its average.
The dispersion of a distribution reveals how the observations are spread out or
scattered on each side of the center. To measure the dispersion, scatter, or variation of
a distribution is as important as to locate the central tendency. If the dispersion is
small, it indicates high uniformity of the observations in the distribution. Absence of
dispersion in the data indicates perfect uniformity. This situation arises when all
observations in the distribution are identical. If this were the case, description of any
single observation would suffice. A measure of dispersion appears to serve two
purposes. First, it is one of the most important quantities used to characterize a
frequency distribution. Second, it affords a basis of comparison between two or more
frequency distributions. The study of dispersion bears its importance from the fact
that various distributions may have exactly the same averages, but substantial
differences in their variability.
Frequently used measures of dispersion are the range, inter-quartile range, mean
deviation, variance, and standard deviation.
1.2.3.1
Range
The simplest and crudest measure of dispersion is the range. This is defined as the
difference between the largest and the smallest values in the distribution. If
x1 , x 2 ,.........., x n are the values of n observations in a sample, then range ( R ) of the
variable X is given by:
13
Chapter I: Introduction to Statistics
R ( x1 , x 2 ,........, x n ) = max{x1 , x 2 ,..........., x n } − min( x1 , x 2 ,............, x n }
In other words, if the x values are arranged in ascending order such that
x1 < x 2 < ........... < x n , then
R = x n − x1 . More compactly,
R = L − S , where,
L = the largest value and S = the smallest value of the observations.
Example: For a set of observations 90, 110, 20, 51, 210 and 190, the largest value is
210 and the smallest value is 20. The range is then R = 210 − 20 = 190 .
For grouped data, the difference between the higher class limit of the highest class
and the lower class limit of the lowest class is considered to be the range.
1.2.3.2
Special range
Although the range is meaningful, it is of little use because of its marked instability,
particularly when the range is based on a small sample. Imagine, if there is one
extreme value in a distribution, the range of the values will appear to be large, when
in fact, removal of this value may reveal an otherwise compact distribution with
extremely low dispersion. Since the range is subject to the undue influence of erratic
extreme values, it can be expected that if such values are excluded, the range of
remaining items may be a more useful measure. One such measure is the 10 to 90
percentile range. It is established by excluding the highest and the lowest 10 percent
of the items, and is the difference between the largest and the smallest values of the
remaining 80 percent of the items. If P1090 stands for the 10 to 90 percentile range, then
P1090 = P90 − P10
where, P90 and P10 are the 90thand 10th percentiles of the distribution, respectively.
1.2.3.3
Quartile deviation
A measure similar to the special range is the inter-quartile range (Q ) . It is the
difference between the third quartile (Q3 ) and the first quartile (Q1 ) . Thus
Q = Q3 − Q1
The inter-quartile range is frequently reduced to the measure of semi-interquartile
range, known as the quartile deviation (QD ) , by dividing it by 2. Thus
14
Chapter I: Introduction to Statistics
QD =
Q3 − Q1
2
This measure is more meaningful than the range because it is not based on two
extreme values.
Both the 10 to 90 percentile range and the quartile deviation have serious
shortcomings. First of all, they do not take into consideration the values of all items.
For example, P10
90
is not affected by the distribution patterns of those items above P90
and below P10 . QD is not affected by the distribution of all items above Q3 and
below Q1 . Moreover, they remain to be positional measures, failing to provide
measurement of scatter of the observations, relative to the typical value. In addition, it
does not enter into any of the higher mathematical relationships that are basic to
inferential statistics. To get rid of these shortcomings, there is a need for some
measures that reflect the deviation of each and every observation from the average.
1.2.3.4
Mean deviation
For data clustered near the central value, the differences of the individual observations
from their typical value will tend to be small. Accordingly, to obtain a measure of the
total variation in the data, it is appropriate to find an average of these differences. The
resulting value will be called mean/average deviation.
The mean deviation is an average of absolute deviations of individual observations
from the central value of a series.
The mean deviation is computed as the arithmetic mean of the absolute values of the
deviations from the 'typical value' of a distribution. The 'typical value' may be the
arithmetic mean, median, mode, or even an arbitrary value. The median is sometimes
preferred as a typical value in computing the average deviation, because the sum of
the absolute values of the deviations from the median is smaller than any other value.
In practice, however, the arithmetic mean is generally used. If the distribution is
symmetrical, the mean is identical with the median and the same average deviation is
obtained.
15
Chapter I: Introduction to Statistics
If a grouped frequency distribution is constructed, as is usually done with large
samples, the average deviation is
k
MD( x ) =
∑f
i =1
xi − x
i
n
where,
MD (x ) =Average deviation about mean
k = Number of classes
x i = Mid point of the ith class
f i = frequency of the ith class
k
n = ∑ fi
i =1
1.2.3.5
Variance and standard deviation
Instead of ignoring the signs of deviations from the mean as in the computation of an
average deviation, they may each be squared and then the results are added. The sum
of squares can be regarded as a measure of the total dispersion of the distribution. By
dividing the sum by the total number of observations, we obtain the average of the
squared deviations, a measure called variance of the distribution.
If the observations are all from a population, the resulting variance is referred to as a
population variance. The variance for a population observations x1 , x 2 ,........, x N ,
commonly designated by σ 2 , is
σ
2
∑ (x
=
i
− µ)
2
N
where, µ is the mean of all the observations and N is the total number of
observations in the population.
Because of the operation of squaring, the variance is expressed in square units (e.g.,
kg2, km2, dollar2, etc.) and not in the original units (e.g., kg, km, dollar, etc.). It is
therefore necessary to extract the positive square root to restore the original unit. The
measure of dispersion thus obtained is called the population standard deviation and is
usually denoted by σ . Thus
16
Chapter I: Introduction to Statistics
∑ (x
σ=
i
− µ)
N
2
= variance ( x )
Thus by definition, the standard deviation is the positive square root of the meansquare deviations of the observations from their arithmetic mean.
If x1 , x 2 ,........, x n represent a set of sample observations of size n, the sample
variance denoted by s 2 , is expressed as
s
2
∑ (x
=
i
− x)
2
n −1
where, x is the sample mean of all the sample observations.
The square root of sample variance is the sample standard deviation. It is denoted by
S.
When the observations x1 , x 2 ,..........., x k are paired with their corresponding
frequencies f 1 , f 2 ,......., f k respectively in a fashion
{xi , f i }
to form a frequency
distribution, the formula for computing variance and standard deviation should be
modified since they are based on ungrouped data.
s
2
∑ f (x
=
i
i
− x)
2
n −1
f i = frequency of the ith observation
x i = value of the ith observation
n = ∑ fi
x i will be the mid-value of the ith class of the frequency is presented by class
intervals.
Example: Let us illustrate the computational process of variance and standard
deviation using the following ungrouped data on family size.
Table 1.3
Family No.
Number of household members in 10 families
1
2
3
4
5
17
6
7
8
9
10
Chapter I: Introduction to Statistics
Size ( xi )
3
3
4
4
5
5
6
6
7
7
The quantities to be calculated for computing the variance and standard deviation are
shown in the table below:
Family No.
1
2
3
4
5
6
7
8
9
10
Total
xi
3
3
4
4
5
5
6
6
7
7
50
xi − x
-2
-2
-1
-1
0
0
1
1
2
2
0
(x i − x )2
4
4
1
1
0
0
1
1
4
4
20
9
9
16
16
25
25
36
36
49
49
270
2
xi
∑x
Here , x =
s
2
∑ (x
=
=
i
n
i
− x)
n −1
50
= 5 and thus
10
2
=
20
= 2.2, giving s = 2.2 = 1.48
9
Example:
Table 1.4
Computation of variance and standard deviation for grouped data is
shown below
xi − x
(x i − x )2
f i (x i − x )
18
-3
9
18
15
75
-1
1
3
2
14
98
1
1
2
8
2
16
128
2
4
8
9
1
9
81
3
9
9
Total
10
60
400
-
-
40
xi
fi
f i xi
f i xi
3
2
6
5
3
7
∑fx
x=
∑f
i
i
i
60
=
= 6 and s 2 =
10
∑ f (x
i
2
− x)
n −1
2
i
=
2
40
= 4.44
9
If the divisor n is used instead of n − 1 , s 2 = 4.0 underestimating the variance by an
amount 0.44 . If n is large, this discrepancy will tend to disappear.
18
Chapter I: Introduction to Statistics
1.2.3.6
Relative measures of dispersion
The various measures of dispersion that have been presented so far are absolute
measures. The measures are absolute in the sense that they are expressed in the same
statistical units in which the original data are presented, such as dollar, meter,
kilogram, etc. When the two sets of data are expressed in different units, the absolute
measures are not comparable. Even with identical units of measurements, the
individual values of one distribution may vary so widely (such as the salary of a
manager versus wage of a worker) that the average and the deviations of the items
from this average of the first distribution may be widely different in magnitude from
those of other. These differences may arise entirely due to the inherent differences in
the averages of the two distributions and because of this, the absolute difference in
magnitude of deviations can not be taken for comparing the measures of variation of
the distributions. So to compare the extent of variation of different distributions
whether having differing or identical units of measurements, it is necessary to
consider some other measures that reduce the absolute deviation in some relative
form. These measures are usually expressed in the form of coefficients and are pure
numbers, independent of the unit of measurements. The measures are:
1) Coefficient of variation
2) Coefficient of mean deviation
3) Coefficient of range
4) Coefficient of quartile deviation
Coefficient of variation: The standard deviation discussed above is an absolute
measure of dispersion. The corresponding relative measure proposed by Karl Pearson
is the coefficient of variation (CV) that attempts to measure the relative variability in
data set. When the means of data sets vary considerably, we do not get the accurate
picture of the relative variability in two sets just by comparing the standard deviation.
Coefficient of variation tends to overcome this difficulty. This is a measure that
presents the spread of the distribution relative to the mean of the same distribution.
A coefficient of variation is computed as a ratio of the standard deviation of the
distribution to the mean of the same distribution. Symbolically,
19
Chapter I: Introduction to Statistics
CV =
sx
x
The CV is usually expressed in percentage, in which case CV =
sx
× 100 . Thus a
x
value of 33 percent for CV implies that the standard deviation of the sample value is
33 percent of the mean of the same distribution.
As an illustration of the use of the CV as a descriptive statistic, let us suppose that we
wish to obtain some insight into whether height is more variable than the weight in
the same population. For this purpose, for instance, we have the following data
obtained from 150 children in a community:
Mean
Height
weight
40 inch
10 kg
SD
5 inch
2 kg
CV
0.125
0.20
Since the coefficient of variation for weight is greater than that of height, we would
tend to conclude that weight has more variability than height in the population.
Coefficient of mean deviation: The third relative measure is the coefficient of mean
deviation. As the mean deviation can be computed from mean, median, mode, or from
any arbitrary value, a general formula for computing coefficient of mean deviation
may be put as follows:
Coefficient of mean deviation from A =
=
Mean deviation from A
× 100
A
MD( A)
× 100
A
where, A is the mean, median, mode, or any other arbitrary value. The use of a
particular formula depends on the type of average used in computing the mean
deviation.
Coefficient of range: The coefficient of range is a relative measure corresponding to
range and is obtained by the following formula:
20
Chapter I: Introduction to Statistics
Coefficient of range =
L−S
× 100
L+S
where, L and S are respectively the largest and the smallest observations in the data
set.
Coefficient of quartile deviation: The coefficient of quartile deviation is computed
from the first and the third quartiles using the following formula:
Coefficient of quartile deviation =
1.2.4
1.2.4.1
Q3 − Q1
×100
Q3 + Q1
Shape characteristics of a distribution
Skewness
The term skewness refers to the lack of symmetry. The lack of symmetry in a
distribution is always determined with reference to a normal distribution. Note that a
normal distribution is always symmetrical. The lack of symmetry leads to an
asymmetric distribution and in such cases, we call this distribution as skewed or we
say that skewness is present in the distribution. The skewness may be either positive
or negative. When the skewness of a distribution is positive (negative), the
distribution is called a positively (negatively) skewed distribution. Absence of
skewness makes a distribution symmetrical. It is important to emphasize that
skewness of a distribution cannot be determined simply my inspection.
a) Symmetrical distribution: This type of distribution is known as normal or
Gaussian distribution. One would obtain such a distribution with data, such as height,
weight, and examination scores. For a symmetrical distribution, Mean = Median =
Mode.
21
Chapter I: Introduction to Statistics
Figure 1.1 Location of mean, median and mode in a symmetrical distribution
b) Positively skewed distribution: In this distribution, the long tail to the right
indicates the presence of extreme values at the positive end of the distribution. This
pulls the mean to the right. These distributions occur with data, such as family size,
female age at marriage, and wages of the employees. For a positively skewed
distribution, Mean > Median > Mode. The frequency curve would look:
Figure 1.2 Location of mean, median and mode in an asymmetrical distribution
c) Negatively skewed distribution: In a negatively skewed distribution, the mean is
pulled in a negative direction. Reaction times for an experiment, daily maximum
temperature for a month in winter, etc., result in negatively skewed distributions. For
a negatively skewed distribution, Mean < Median < Mode. The frequency curve
would look like as in Figure 1.2.
Measures of skewness: In studying skewness of a distribution, the first thing that we
would like to know whether the distribution is positively or negatively skewed. The
second thing is to measure the degree of skewness. The simplest measure of skewness
is the Pearson’s coefficient of skewness:
Mean - Mode
Standard deviation
Pearson' s coefficient of skewness =
If Mean > Mode, the skewness is positive.
If Mean < Mode, the skewness is negative.
If Mean = Mode, the skewness is zero.
22
Chapter I: Introduction to Statistics
In many instances, mode cannot be uniquely defined and the above formula cannot be
applied. It has been observed that for a moderately skewed distribution, the following
relationship holds:
Mean - Mode = 3(Mean - Median)
Using this relation, the Pearson’s coefficient of skewness assumes the following
modified form:
Pearson' s coefficient of skewness =
3(Mean - Median )
Standard deviation
Another measure of skewness due to Bowley, is defined in terms of the quartile value.
Since there is no difference between the distances of either of the first quartile Q1 or
the third quartile Q3 from the median Q 2 in a symmetrical distribution, any
difference in the distances from the median is a reasonable basis for measuring
skewness in a distribution. Thus, in terms of the three quartiles Q1 , Q 2 and Q3 , the
Bowley’s quartile coefficient of skewness is
Quartile coefficient of skewness =
=
(Q3− Q2 ) − (Q2 − Q1 )
Q3 − Q1
Q3 + Q1 − 2Q2
Q3 − Q1
This is evidently a pure number lying between -1 and +1 and is zero for a symmetrical
distribution.
ƒ
If Q3 − Q2 = Q2 − Q1 , quartile skewness = 0 and the distribution is
symmetrical.
ƒ
If Q3 − Q2 > Q2 − Q1 , quartile skewness > 0 and the distribution is positively
skewed.
ƒ
If Q3 − Q2 < Q2 − Q1 , quartile skewness < 0 and the distribution is negatively
skewed.
1.2.4.2
Kurtosis
There is considerable variation among symmetrical distributions. For instance, they
can differ markedly in terms of peakedness. This is what we call kurtosis. Kurtosis is
23
Chapter I: Introduction to Statistics
the degree of peakedness of a distribution, usually taken in relation to a normal
distribution. A curve having relatively higher peak than the normal curve, is known as
leptokurtic. On the other hand, if the curve is more flat-topped than the normal curve,
it is called platykurtic. A normal curve itself is called mesokurtic, which is neither too
peaked nor too flat-topped.
Figure 1.3 Illustration of kurtosis
Measures of Kurtosis: The most important measure of kurtosis based on the second
and fourth moments is β 2 , defined as:
β2 =
µ4
µ22
where, µ 2 and µ 4 are, respectively, the second and fourth moments about the mean.
This measure is a pure number and is always positive.
For normal distribution, β 2 = 3 . When the value of β 2 is greater than 3, the curve is
more peaked than the normal curve, in which case, it is leptokurtic. When the value of
β 2 is less than 3, the curve is less peaked than the normal curve, in which case, it is
platykurtic. In other words,
If β 2 − 3 > 0, the distribution is leptokurtic.
If β 2 − 3 < 0 , the distribution is platykurtic.
If β 2 − 3 = 0 , the distribution is mesokurtic.
24
Chapter I: Introduction to Statistics
1.3
DATA EXPLORATION WITH GRAPHICAL MEANS
In addition to presenting statistical data through tabular form and descriptive statistics,
one can present the data through some visual aids. This refers to graphs and diagrams.
Such presentation gives visual impression of the entire data and therefore the
information presented is easily understood. When frequency distributions are
constructed primarily to condense large sets of data into an easy to digest form,
graphical and diagrammatic presentations are preferred. The most common forms of
graphs and diagrams are the bar diagram, pie chart, histogram, line diagram, scatter
diagram, frequency polygon, and ogive. Bar diagrams and pie charts are usually
constructed for categorical data and the others for interval scale data.
1.3.1 Bar diagram
A bar diagram, also known as a bar chart, is a form of presentation in which the
frequencies are represented by rectangles separated along the horizontal axis and
drawn as bars of convenience widths. A bar diagram consists of horizontal or vertical
bars of equal widths and lengths proportional to the magnitudes the bars represent. In
presenting the bars, there is no necessity of having a continuous scale.
Example: Health personnel from 150 rural health centres were asked how frequently
they have visited their respective areas during the last one week. The responses were
recorded as rarely, occasionally, frequently, and never. The following table displays
the frequency of responses in each category:
Table 1.5
Relative frequency distribution of health professional data
Response
Frequency
Relative frequency
Frequently
49
0.327
Occasionally
71
0.473
Rarely
24
0.160
Never
6
0.040
Total
150
1.00
The vertical and horizontal bar diagrams constructed from these data are shown in
figures below:
25
Chapter I: Introduction to Statistics
No. of visits
80
60
40
20
0
Frequently
Occasionally
Rarely
Never
Figure 1.4 Vertical bar diagram for health centre visit data
Never
Rarely
Occasionally
Frequently
0
10
20
30
40
50
60
70
80
No. of visits
Figure 1.5 Horizontal bar diagram for health centre visit data
Component bar diagram: A component bar diagram is a good device to display
categorical data. In such a diagram, the total values as well as the various components
constituting the total are shown. Each part of the bar represents each component,
while the whole bar represents the total value. The component parts are variously
colored or shaded to make them distinct.
Example: Given below are the population of two regions by sex. Display them by a
component bar diagram:
26
Chapter I: Introduction to Statistics
Population in '000
Region
Percent of population
Male
Female
Male
Female
A
11228
10637
51.3
48.7
B
17634
16306
52.0
48.0
Population in '000
Male
Female
35
30
25
20
15
10
5
0
A
B
Region
Figure 1.6 Component bar diagram for population data
Multiple bar chart:
Multiple bar charts are frequently used to present statistical
data. They are primarily used to compare two or more characteristics corresponding to
a common variate value. Multiple bar charts are grouped bars, whose lengths are
proportional to the magnitude of the characteristics. The bars of a multiple chart are
usually put adjacent to each other without allowing any space between them.
Different shading or color can be used to distinguish one group of bars from other
groups. Data on population values for different regions, literacy rates by sex, volume
of exports by type of production, etc., can be represented by a multiple bar chart.
Example: Given below is the education level of female population of Bangladesh by
administrative division. Display them with a multiple bar chart.
Percent of females with
Division
No education
Primary education
Secondary education
Barisal
43.9
34.4
21.7
Chittagong
41.8
37.0
21.2
27
Chapter I: Introduction to Statistics
Dhaka
45.9
35.3
18.8
Khulna
39.6
41.2
19.2
Rajshahi
48.5
37.8
13.7
Sylhet
52.6
36.1
11.3
Population in'000
60
50
40
30
20
10
0
Barisal
Dhaka
Rajshahi
Region
No education
Priamary education
Secondary ducation
Figure 1.7 Multiple bar diagram for the education level data
1.3.2 Pie chart
A pie chart, also known as a pie diagram, is an effective way of presenting percentage
parts when the whole quantity is taken as 100. This is a useful device for presenting
categorical data. The pie chart consists of a circle sub-divided into sectors, whose
areas are proportional to the various parts into which the whole quantity is divided.
The sectors may be shaded or colored differently to show their individual
contributions to the whole.
Table 1.6
Health centre visit data for constructing a pie diagram
Response
Frequency
Relative frequency
Angles of the
(%)
sector
Frequent
49
32.7
117.6
Occasional
71
47.3
170.4
Rare
24
16.0
57.6
Never
6
4.0
14.4
28
Chapter I: Introduction to Statistics
Total
150
100.0
Frequent
Occasional
Rare
360.0
Never
Figure 1.8 Simple pie diagram
Frequent
Occasional
Rare
Never
Figure 1.9 Three dimensional pie diagram
1.3.3 Histogram
The most common form of graphical representation of a frequency distribution is the
histogram. A histogram is constructed by placing the class boundaries on the
horizontal axis of a graph and the frequencies on the vertical axis. Each class is shown
on the graph by drawing a rectangle whose base is the class boundary and whose
height is the corresponding frequency for the class. When the class boundaries are
required to be unequal because of some particular feature of the data set, the method
of constructing a histogram should be modified accordingly.
29
Chapter I: Introduction to Statistics
Example: The observed relative humidity (%) at a certain location for a period of 100
days is given below. Construct a histogram from these data.
61.5
77
60.5
53
58.5
41
71
48.5
40.5
46
73
78
55
50
59
36.5
68
59
38.5
56.5
70.5
72.5
50.5
68
55.5
39.5
63
62
48.5
60
66
72
47.5
62
65
38.5
54.5
51
29.5
34.5
64.5
62
65
42.5
65
44.5
71.5
54.5
34.5
43
68.5
65
63.5
53
53.5
50.5
84.5
52
34
41.5
75.5
66.5
51
48
46
39
73
60
48.5
31
63.5
50.5
54.5
66
48.5
41.5
57.5
51
37
30.5
62
58.5
55.5
62.5
47
57.5
56
36.5
40.5
50
68
61.5
58
54.5
52.5
69.5
51
40.5
55.5
49.5
Enter the data either directly into a SPSS data file or copy from another compatible
file to a SPSS data file. Give the name of the variable as 'rh'. To obtain a histogram
using the SPSS package, first click on the analyze menu bar in the data file, then click
on explore in the descriptive statistics. Bring the variable 'rh' in the Dependent List:
box. In the Display options, select Plots. Then click the Plots... icon; a new window
called Explore: Plots appears. Select None in the Boxplots box and Histogram in the
Descriptive box. Then click on Continue and OK to obtain the histogram as shown
below.
Frequency
20
10
Std. Dev = 12.01
Mean = 54.8
N = 100.00
0
30.0
40.0
35.0
50.0
45.0
60.0
55.0
70.0
65.0
80.0
75.0
85.0
RH
Figure 1.10 Histogram of the relative humidity data
1.3.4
Stem and Leaf plot
Stem and leaf plot is a graphical technique of representing quantitative data that can
be used to examine the shape of a frequency distribution, the range of the values, the
30
Chapter I: Introduction to Statistics
point of concentration of the values and the presence of any extreme values or
outliers. Compared to the other graphical techniques presented thus far, stem and leaf
plot is an easy and quick way of displaying data. In a stem and leaf plot, a histogramlike picture of a frequency distribution is constructed.
Example: Use a stem and leaf plot to display the following marks obtained by 20
students in a statistics test:
84
17
38
45
47
53
76
54
75
22
66
65
55
54
51
33
39
19
54
72
Solution: The lowest score is 17 and the highest score is 84. For stem and leaf plots,
classes must be of equal lengths. We will use the first or leading digit (tens) of score
as the stem and the trailing digit (units) as the leaf. For example, for the score 84, the
leading digit is 8 and the trailing digit is 4. In a stem and leaf plot, a leading digit
(stem of score) determines the row in which the score is placed. The trailing digits for
a score are then written in the appropriate row. In this way, each score is recorded in
the stem and leaf plot.
Frequency Stem & Leaf
2.00
1.00
3.00
2.00
6.00
2.00
3.00
1.00
1.
2.
3.
4.
5.
6.
7.
8.
79
2
389
57
134445
56
256
4
Stem width:
10
Each leaf:
1 case(s)
You can obtain a stem and leaf plot with the SPSS software. Click on analyze on the
menu bar, then click on explore in the descriptive statistics. Bring the variable, say
'mark' in this case, for which you would like to obtain a stem and leaf plot under the
Dependent List: box. In the Display options, select Plots. Then click the Plots... icon;
a new window called Explore: Plots appears. Select None in the Boxplots box and
31
Chapter I: Introduction to Statistics
Stem-and-leaf in the Descriptive box. Then click on Continue and OK to obtain the
above stem and leaf plot.
1.3.5 Frequency polygon
A frequency polygon provides an alternative way to a histogram of presenting
graphically the distribution of a continuous variable. The presentation involves
placing the mid-values on the horizontal axis and the frequencies on the vertical axis.
However, instead of using rectangles as with the histogram, we find the class
midpoints are used on the horizontal axis and then the points are plotted directly
above the class mid-points at a height corresponding to the frequency of the class.
Classes of zero frequency are added at each end of the frequency distribution so that
the frequency polygon touches the horizontal axis at both ends of the graph. This
makes the frequency polygon a close figure.
Example: Weekly expenditure in dollar by 80 students at a certain city is shown
below. Construct a frequency polygon using these data.
Frequency
Expenditure
No. of students
4.5-9.5
8
9.5-14.5
29
14.5-19.5
27
19.5-24.5
12
24.5-29.5
4
Total
80
35
30
25
20
15
10
5
0
0
2
7
12
17
22
Mid Value
Figure 11
Frequency polygon
32
27
32
Chapter I: Introduction to Statistics
The histogram and frequency polygon are equally good techniques for presenting
continuous data. The histogram is more often used when a single distribution is
presented, while the frequency polygon is largely used for comparison of two or more
distributions.
Ogive (Cumulative frequency polygon):
A graph of the cumulative frequency
distribution or cumulative relative frequency distribution is called an ogive. An ogive
can be either less than or more than type.
1.3.6
Scatter diagram
Scatter diagrams are useful for displaying information on two quantitative variables,
which are believed to be inter-related. Height and weight, age and height, income and
expenditure, rainfall and runoff are the examples of some of the data sets that are
assumed to be related to each other, which can be displayed by scatter diagrams.
Example: Given below is the age in years at the first marriage of 20 couples.
Construct a scatter diagram for the two data sets.
Husband's
Wife's age
age
Husband's
Wife's age
age
Husband's
Wife's age
age
19
15
39
32
21
14
26
17
39
30
26
19
27
19
26
19
28
19
29
21
33
27
29
21
36
28
39
34
37
29
40
36
25
21
31
27
35
32
40
38
-
-
33
Chapter I: Introduction to Statistics
40
35
30
wife's age
25
20
15
10
5
0
0
10
20
30
40
50
Husband's age
Figure 1.12 Scatter diagram for the age at marriage data
1.3.7 Line graph
A line graph is particularly useful for numerical data if we wish to display the data in
the time series form. Such data could be the production of jute in a region for a period
of 20 years, the export of raw materials from a country for a period of 40 years, the
annual rainfall at a location for a period of 100 years, or the daily evaporation from a
lake for a period of 50 years.
The growth of population in Bangladesh since 1901 is given in the table below:
Table 1.7
Census population of Bangladesh in million
Year
Population
1901
28.9
1911
31.6
1921
33.2
1931
35.6
1941
42.0
34
Chapter I: Introduction to Statistics
1951
44.2
1961
55.2
1974
76.4
1981
89.9
1991
111.5
A line graph for these data is drawn in Figure 1.13. From the line graph, one can
easily observe that the population of Bangladesh has increased significantly from
1901 to 1991, but the increase is not uniform throughout the period. The increase was
slower until 1941, which thereafter increased with much higher rate compared to the
previous period.
Population(million)
120
100
80
60
40
20
0
1901
1921
1941
1961
1981
Census year
Figure 1.13 Time series plot of the total population in Bangladesh
35