Download Statistical Methods for Social Sciences 3(2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Law of large numbers wikipedia , lookup

Transcript
Statistical Methods for Social Sciences
3(2-1)
Lecture 1
Index Number:
Concept of Index Number: According to Prof. Secrist, “Index Numbers are a series of
numbers by which changes in the magnitudes of a phenomena are measured from time to
time or from place to place”.
Suppose it is required to measure the general changes in the price of certain commodities.
Let the commodities be wheat, milk, eggs etc. We know that wheat is sold in Rs. per
maund, milk in Rs. per litre, eggs in Rs. per dozen. Obviously these commodities are not
comparable directly. In order to have comparison between them we compute the
percentage changes in these commodities from one period to another on the basis of the
prices of some selected date. These percentages are known as index number of prices of
these commodities. The averages of these percentages would study the general changes in
the level of prices. This average is known as wholesale price index number or general
index number.
Construction of Wholesale Price Index Numbers: The index number construction
involves the following steps;
1. Purpose of Index Number
2. Selection of Commodities
3. Selection of Prices
4. Choice of Base Period
5. Choice of Average
6. Choice of Proper Weights
1. Purpose of Index Number: Since all index numbers are not suitable for all
purposes, therefore, it is necessary to ascertain the purpose of index number
beforehand for which they are going to be constructed. For example, if it is
required to study the effect of increased prices on the labour class, cost of living
index numbers should be constructed and not the wholesale index numbers which
study the variations in general.
2. Selection of Commodities: It is not possible to include all commodities in the
construction of index numbers which are brought and sold in the market due to
financial and other difficulties; therefore, it is necessary to include only those
commodities which are most commonly used for that class of people for whom
the index numbers are going to be constructed. The selected commodities should
suit the tastes, habits, customs or requirements of that class of people. As regards
the number of commodities to be included there is no hard and fast rule for this
purpose. Normally the larger the number of commodities lesser will be the
chance of error in the average obtained. Considering the economy, and accuracy,
the number of commodities should be atleast 23 for sensitive index numbers. The
Pakistan wholesale price index number of Board of Economic Enquiry, Punjab
Lahore includes 39 commodities. The British Board of Trade wholesale price
index number includes 200 commodities.
3. Selection of Prices: After selecting the commodities, price quotation to be taken
from the prominent business houses, standard journals or magazines from
different places at the same time. The price should not be quoted as so many units
per rupee but should be quoted as so many rupees per unit. It is advisable to take
into account the wholesale prices rather than retail prices. The price quotations
should be taken daily, weekly, fortnightly etc depending upon the purpose of
index numbers.
4. Choice of Base Period: When the prices have been collected, the next step is to
reduce them into percentages or relatives by selecting a suitable base. It may be
either fixed base method or chain base method.
i. Fixed Based Method: In this method the base period is fixed and the prices of
subsequent years are expressed as relatives of the prices of the base year.
Method of calculating the price relatives by the Fixed Base Method:
In order to calculate the price relatives by the fixed based method, the price of
each year is divided by the price of base year and this ratio is multiplied by 100.
Price relative for the current year = (Price for the current year/Price for the base
year) X 100
Symbolically it can be written as;
Y1 = (P1/Po) x 100
Y2 = (P2/Po) x 100
Y3= (P3/Po) x 100
Yn = (Pn/ Po) x 100
Where Po = Price for the base year
P1 = Price for the current year
Example 1. Compute the Index numbers by taking 1957 as the base year.
Years
1955
1956
1957
1958
1959
1960
1961
Price of wheat
14,
15,
16,
17,
18,
19,
20
Solution: Computation of Index Numbers
Year
1955
1956
1957
1958
1959
1960
1961
Price
14
15
16
17
18
19
20
Index Number with 1957 = 100
14/16 x 100 = 87.50
15/16 x 100 = 93.75
16/16 x 100 = 100
17/16 x 100 = 106.25
18/16 x 100 = 112.5
19/16 x 100 = 118.7
20/16 x 100 = 125.0
ii. Chain Base Method: According to this method, the base is not fixed but it
changes from year to year. Here the price of previous year is taken as the base
period and thus the relatives are computed. If it is required to compute the price
relatives from 1955 to 1961, the price of 1955 is taken as 100. Then the price of
1955 is taken as base for the year 1957 and so on. The price relatives computed
by this method are known as Link Relatives.
Method of Calculating the Link Relatives by the Chain Base Method. In order
to calculate link relatives by the chain base method, the price of the current year is
divided by the price of preceding year and this ratio is multiplied by 100.
Link Relative for the current year = (Price in the current year/ price in the
preceding year) x 100
Symbolically it can be written as;
L1 = P1/Po x 100
L2 = P2/Po x 100
L3 = P3/Po x 100
Ln = Pn/Po x 100
Where Po = Price in the first year
P1 = Price in the 2nd year
P3 = Price in the 3rd year and so on.
Example 2. Use the following data of industrial production in Pakistan to
compare the annual fluctuations in Pakistan Industrial activity by the chain
base method.
Index Number of Industrial Production in Pakistan
Year
1956
1957
1958
1959
1960
1961
1962
1963
1964
Index Number
120
122
116
120
120
137
136
149
156
Solution: In order to construct the index numbers of chain base method, we take
the relative for 1956 =100 and then compute the other index.
Base year 1956 =100
1956
1957
1958
1959
1960
1961
1962
1963
1964
Index Number
100 =100
122/120 x 100 = 101.66
116/122 x 100 = 95.08
120/116 x 100 = 103.4
120/120 x 100 =100.0
137/120 x 100 = 114.16
136/137 x 100 = 99.27
149/136 x 100 = 109.6
156/149 x 100 = 104.7
5. Choice of Average: After computing the link relatives, their average is taken to
get the required index number. Theoretically any average (such as mean, median,
mode, G.Mean, H. Mean) can be taken in the construction of the index number
but practically mean and geometrical means are suitable because of mean of
relatives is reversible and geometrical means of prices is reversible. Moreover,
mean gives heavy weight to commodities of high prices and light to those of low
prices. The geometric mean gives more weight to small items and less weight to
large items.
Example 3. From the data given below, compute the index number of prices,
taking 1962 as base.
Year
1962
1963
1964
1965
Commodity (Price in Rs.)
Firewood
Soft coke
3.25
2.50
3.44
2.80
3.50
2.00
3.75
2.50
Kerosene oil
0.22
0.22
0.25
0.25
Matches
0.06
0.06
0.06
0.06
Solution: Calculation of index Numbers
Yea
r
Commodity (Price in Rs.)
Firewood
196
2
100
196
3
3.44/3.25x100=1
06
196
4
3.50/3.25x100=1
08
196
5
3.75/3.25x100=1
15
Tota
l
Soft
coke
2.50/2.5
0 x100
=100
2.80/2.5
0
x100=11
2
2.0/ 2.50
x
100
=80
2.50/2.5
0 x 100
=125
Kerosen
e oil
100
Matche
s
100
400
Index Number
by
Mea Media
n
n
100
100
100
100
418
105
108
114
100
402
101
104
114
100
454
114
108
Example 4. Construct Chain Indices for the following years, taking 1940 as
the base.
Year
1940
1941
1942
1943
1944
Commodity
Wheat
2.8
3.4
3.6
4.0
4.2
Rice
10.5
10.8
10.6
11.0
11.5
Maize
2.7
3.2
3.5
3.8
4.0
Solution: Calculation of Chain Indices:
Year
Commodity
Wheat
1940 100
Total Average Chain Indices
Rice
100
Maize
100
100
100
100
1941
1942
1943
1944
3.4/2.8x100=121
3.6/3.4x100=106
4.0/3.6x100=111
4.2/4.0x100=105
103
98
104
105
119
109
109
105
114
104
108
105
114
104
108
105
100x114/100=114
114x104/100=118
118.6x108/100=128
128x105/100=134.5
6. Choice of Proper Weight: When the average of the relatives is taken we get the
required index number. While taking the average we see that all the commodities
are treated alike while in actual practice that some commodities are more
important than others and as such they need weight in the construction of index
numbers. Thus weights are assigned to the commodities depending upon their
relative importance.
Methods used in weighing the indices of prices:
i. Weighted Aggregate Method: According to this method, the current year’s
prices are multiplied by the base year quantities. The sum of the products so
obtained is then divided by the sum of products of the base year’s prices and base
year quantities.
Symbolically it can be written as;
Index Number for the current year = (Σ p1 qo/ Σ po qo) x 100
Where p1= Price for the current year
po = Price for the base year
qo = Quantity in the base year
Example 3. On analyzing bills of certain food items consumed at a club for the
years 1954 and 1955, the following tables are drawn up:
Items
Meat
Fish
Eggs
Vegetables
Fruit
1954 Prices
2.2
2.5
1.1
0.7
0.3
1955 Prices
2.0
2.8
1.3
0.7
0.2
Quantities
100 seers
30 seers
50 doz
100 seers
150 Nos
Obtain an index by an appropriate method for indicating the relative change in
prices for 1955 over 1954.
Solution: Computation of weighted index number for 1955 by using aggregate
expenditure method.
Base 1955 =100
Items
po
Meat
2.2
Fish
2.5
Eggs
1.1
Vegetable 0.7
p1
2.0
2.8
1.3
0.7
qo
100
30
50
100
poqo
220
75
55
70
p1qo
200
84
65
70
Fruit
0.3
0.2
150
45
30
Total
465
449
Weighted Index Number = (Σ p1 qo/ Σ po qo) x 100 = 449/465 x 100 =96.56
Weight Average of Relative: In this method, the weights are the values in the
base year are calculated on the basis of the aggregates expenditure of the
commodities. In order to calculate the aggregate expenditure of the commodities
in the base year, the quantities are multiplied with their respective prices. The sum
of the products of the price relatives of the current year and the values of the base
year is divided by the sum of the weights (values). The resulting figure is required
index number for the current year.
Symbolically it can be written as
Index Number for the current year = Σ IV/ Σ V
Where I = Price relative of the current year
V = Value (weights) of the base year
Example 4. Calculate the weighted index number of the data given in
Example 3 by the method of weighted average of relatives.
Soulation: Computation of weighted index number for 1955 by the method of
weighted average relative Base year 1954=100
Items
Prices
po
p1
Wights
qo
Meat
Fish
Eggs
Vegetable
Fruit
Total
2.2
2.5
1.1
0.7
0.3
100
30
50
100
150
2.0
2.8
1.3
0.7
0.2
Values
consumed
in the base
V= poqo
220
75
55
70
45
465
I=p1/qox100 IxV
99.9
112.0
118.0
110.0
66.7
19998
8400
6501
7000
3000.5
44899.5
Weighted index number for the current year (1955)= Σ IV/ Σ V= 44899.5/465
=96.6
Cost of living Index Number: Are especially designed to study the effect of
changes in prices on the people as consumers.
Uses of Index Numbers: Following are come of uses of Index Numbers
1. Index numbers are used in the department of Commerce, Meteriology, Labour
and Industries etc.
2. The Insurance Companies use index numbers for determining the probable
times of death or duration of life of those persons who are insured.
3. Index number works like a barometer showing fluctuations in daily life, cost
of living, employment, public health etc.
4. Index numbers are useful in the study of changes in price levels over a period
of time.
5. Index numbers are of great use in forecasting. There are many forecasting
organizations which compile forecasting index numbers. Such index number
works well during the period of mild prosperity and depression but fail during
a severe depression.
6. Index numbers are of great help for studying physical changes over a period
of time. For examples various types of index numbers were compiled such as
index of industrial production, index of factory production etc, which study
seasonal variations, seasonal trends etc.
7. Index numbers are of great help foe the purpose of comparison among
different regions. For example, Government may be interested in compiling
the cost of living index numbers of various regions within the country for the
purpose of establishing an equitable standard of living.
8. Index numbers are widely used by the economists, social workers and
businessmen in order to measure the changes i.e wages, prices, sales, stocks,
production and cost of living.
Lecture 2
Random Variable
The outcome of an experiment need not be a number, for example, the outcome when a
coin is tossed can be 'heads' or 'tails'. However, we often want to represent outcomes as
numbers. A random variable is a function that associates a unique numerical value with
every outcome of an experiment. The value of the random variable will vary from trial to
trial as the experiment is repeated.
There are two types of random variable - discrete and continuous.
A random variable has either an associated probability distribution (discrete random
variable) or probability density function (continuous random variable).
Examples
1. A coin is tossed ten times. The random variable X is the number of tails that are
noted. X can only take the values 0, 1, ..., 10, so X is a discrete random variable.
2. A light bulb is burned until it burns out. The random variable Y is its lifetime in
hours. Y can take any positive real value, so Y is a continuous random variable.
Expected Value
The expected value (or population mean) of a random variable indicates its average or
central value. Expected value gives a general impression of the behaviour of some
random variable without giving full details of its probability distribution (if it is discrete)
or its probability density function (if it is continuous).
Two random variables with the same expected value can have very different
distributions. There are other useful descriptive measures which affect the shape of the
distribution, for example variance.
The expected value of a random variable X is symbolised by E(X) or µ.
If X is a discrete random variable with possible values x1, x2, x3, ..., xn, and p(xi)
denotes P(X = xi), then the expected value of X is defined by:
where the elements are summed over all values of the random variable X.
If X is a continuous random variable with probability density function f(x), then the
expected value of X is defined by:
Example
Discrete case : When a die is thrown, each of the possible faces 1, 2, 3, 4, 5, 6 (the xi's)
has a probability of 1/6 (the p(xi)'s) of showing. The expected value of the face showing
is therefore:
µ = E(X) = (1 x 1/6) + (2 x 1/6) + (3 x 1/6) + (4 x 1/6) + (5 x 1/6) + (6 x 1/6) = 3.5
Notice that, in this case, E(X) is 3.5, which is not a possible value of X.
Variance
The (population) variance of a random variable is a non-negative number which gives an
idea of how widely spread the values of the random variable are likely to be; the larger
the variance, the more scattered the observations on average.
Stating the variance gives an impression of how closely concentrated round the expected
value the distribution is; it is a measure of the 'spread' of a distribution about its average
value.
Variance is symbolised by V(X) or Var(X) or
The variance of the random variable X is defined to be:
where E(X) is the expected value of the random variable X.
Notes
a. the larger the variance, the further that individual values of the random variable
(observations) tend to be from the mean, on average;
b. the smaller the variance, the closer that individual values of the random variable
(observations) tend to be to the mean, on average;
c. taking the square root of the variance gives the standard deviation, i.e.:
d. the variance and standard deviation of a random variable are always non-negative.
See also sample variance.
Probability Distribution
The probability distribution of a discrete random variable is a list of probabilities
associated with each of its possible values. It is also sometimes called the probability
function or the probability mass function.
More formally, the probability distribution of a discrete random variable X is a function
which gives the probability p(xi) that the random variable equals xi, for each value xi:
p(xi) = P(X=xi)
It satisfies the following conditions:
a.
b.
Cumulative Distribution Function
All random variables (discrete and continuous) have a cumulative distribution function. It
is a function giving the probability that the random variable X is less than or equal to x,
for every value x.
Formally, the cumulative distribution function F(x) is defined to be:
for
For a discrete random variable, the cumulative distribution function is found by summing
up the probabilities as in the example below.
For a continuous random variable, the cumulative distribution function is the integral of
its probability density function.
Example
Discrete case : Suppose a random variable X has the following probability distribution
p(xi):
xi
0
1
2
3
4
5
p(xi) 1/32 5/32 10/32 10/32 5/32 1/32
This is actually a binomial distribution: Bi(5, 0.5) or B(5, 0.5). The cumulative
distribution function F(x) is then:
xi
0
1
2
3
4
5
F(xi) 1/32 6/32 16/32 26/32 31/32 32/32
F(x) does not change at intermediate values. For example:
F(1.3) = F(1) = 6/32
F(2.86) = F(2) = 16/32
Probability Density Function
The probability density function of a continuous random variable is a function which can
be integrated to obtain the probability that the random variable takes a value in a given
interval.
More formally, the probability density function, f(x), of a continuous random variable X
is the derivative of the cumulative distribution function F(x):
Since
it follows that:
If f(x) is a probability density function then it must obey two conditions:
a. that the total probability for all possible values of the continuous random variable
X is 1:
b. that the probability density function can never be negative: f(x) > 0 for all x.
Discrete Random Variable
A discrete random variable is one which may take on only a countable number of distinct
values such as 0, 1, 2, 3, 4, ... Discrete random variables are usually (but not necessarily)
counts. If a random variable can take only a finite number of distinct values, then it must
be discrete. Examples of discrete random variables include the number of children in a
family, the Friday night attendance at a cinema, the number of patients in a doctor's
surgery, the number of defective light bulbs in a box of ten.
Continuous Random Variable
A continuous random variable is one which takes an infinite number of possible values.
Continuous random variables are usually measurements. Examples include height,
weight, the amount of sugar in an orange, the time required to run a mile.
Independent Random Variables
Two random variables X and Y say, are said to be independent if and only if the value of
X has no influence on the value of Y and vice versa.
The cumulative distribution functions of two independent random variables X and Y are
related by
F(x,y) = G(x).H(y)
where
G(x) and H(y) are the marginal distribution functions of X and Y for all pairs
(x,y).
Knowledge of the value of X does not effect the probability distribution of Y and vice
versa. Thus there is no relationship between the values of independent random variables.
For continuous independent random variables, their probability density functions are
related by
f(x,y) = g(x).h(y)
where
g(x) and h(y) are the marginal density functions of the random variables X and Y
respectively, for all pairs (x,y).
For discrete independent random variables, their probabilities are related by
P(X = xi ; Y = yj) = P(X = xi).P(Y=yj)
for each pair (xi,yj).
Probability-Probability (P-P) Plot
A probability-probability (P-P) plot is used to see if a given set of data follows some
specified distribution. It should be approximately linear if the specified distribution is the
correct model.
The probability-probability (P-P) plot is constructed using the theoretical cumulative
distribution function, F(x), of the specified model. The values in the sample of data, in
order from smallest to largest, are denoted x(1), x(2), ..., x(n). For i = 1, 2, ....., n, F(x(i))
is plotted against (i-0.5)/n.
Quantile-Quantile (QQ) Plot
A quantile-quantile (Q-Q) plot is used to see if a given set of data follows some specified
distribution. It should be approximately linear if the specified distribution is the correct
model.
The quantile-quantile (Q-Q) plot is constructed using the theoretical cumulative
distribution function, F(x), of the specified model. The values in the sample of data, in
order from smallest to largest, are denoted x(1), x(2), ..., x(n). For i = 1, 2, ....., n, x(i) is
plotted against F-1((i-0.5)/n).
Normal Distribution
Normal distributions model (some) continuous random variables. Strictly, a Normal
random variable should be capable of assuming any value on the real line, though this
requirement is often waived in practice. For example, height at a given age for a given
gender in a given racial group is adequately described by a Normal random variable even
though heights must be positive.
A continuous random variable X, taking all real values in the range
is said to
follow a Normal distribution with parameters µ and
if it has probability density
function
We write
This probability density function (p.d.f.) is a symmetrical, bell-shaped curve, centered at
its expected value µ. The variance is
.
Many distributions arising in practice can be approximated by a Normal distribution.
Other random variables may be transformed to normality.
The simplest case of the normal distribution, known as the Standard Normal Distribution,
has expected value zero and variance one. This is written as N(0,1).
Examples
Poisson Distribution
Poisson distributions model (some) discrete random variables. Typically, a Poisson
random variable is a count of the number of events that occur in a certain time interval or
spatial area. For example, the number of cars passing a fixed point in a 5 minute interval,
or the number of calls received by a switchboard during a given period of time.
A discrete random variable X is said to follow a Poisson distribution with parameter m,
written X ~ Po(m), if it has probability distribution
where
x = 0, 1, 2, ..., n
m > 0.
The following requirements must be met:
a. the length of the observation period is fixed in advance;
b. the events occur at a constant average rate;
c. the number of events occurring in disjoint intervals are statistically independent.
The Poisson distribution has expected value E(X) = m and variance V(X) = m; i.e. E(X)
= V(X) = m.
The Poisson distribution can sometimes be used to approximate the Binomial
distribution with parameters n and p. When the number of observations n is large, and the
success probability p is small, the Bi(n,p) distribution approaches the Poisson distribution
with the parameter given by m = np. This is useful since the computations involved in
calculating binomial probabilities are greatly reduced.
Examples
Binomial Distribution
Binomial distributions model (some) discrete random variables.
Typically, a binomial random variable is the number of successes in a series of trials, for
example, the number of 'heads' occurring when a coin is tossed 50 times.
A discrete random variable X is said to follow a Binomial distribution with parameters n
and p, written X ~ Bi(n,p) or X ~ B(n,p), if it has probability distribution
where
x = 0, 1, 2, ......., n
n = 1, 2, 3, .......
p = success probability; 0 < p < 1
The trials must meet the following requirements:
a.
b.
c.
d.
the total number of trials is fixed in advance;
there are just two outcomes of each trial; success and failure;
the outcomes of all the trials are statistically independent;
all the trials have the same probability of success.
The Binomial distribution has expected value E(X) = np and variance V(X) = np(1-p).
Examples
Geometric Distribution
Geometric distributions model (some) discrete random variables. Typically, a Geometric
random variable is the number of trials required to obtain the first failure, for example,
the number of tosses of a coin until the first 'tail' is obtained, or a process where
components from a production line are tested, in turn, until the first defective item is
found.
A discrete random variable X is said to follow a Geometric distribution with parameter p,
written X ~ Ge(p), if it has probability distribution
P(X=x) = px-1(1-p)x
where
x = 1, 2, 3, ...
p = success probability; 0 < p < 1
The trials must meet the following requirements:
a.
b.
c.
d.
the total number of trials is potentially infinite;
there are just two outcomes of each trial; success and failure;
the outcomes of all the trials are statistically independent;
all the trials have the same probability of success.
The Geometric distribution has expected value E(X)= 1/(1-p) and variance V(X)=p/{(1p)2}.
The Geometric distribution is related to the Binomial distribution in that both are based
on independent trials in which the probability of success is constant and equal to p.
However, a Geometric random variable is the number of trials until the first failure,
whereas a Binomial random variable is the number of successes in n trials.
Examples
Uniform Distribution
Uniform distributions model (some) continuous random variables and (some) discrete
random variables. The values of a uniform random variable are uniformly distributed
over an interval. For example, if buses arrive at a given bus stop every 15 minutes, and
you arrive at the bus stop at a random time, the time you wait for the next bus to arrive
could be described by a uniform distribution over the interval from 0 to 15.
A discrete random variable X is said to follow a Uniform distribution with parameters a
and b, written X ~ Un(a,b), if it has probability distribution
P(X=x) = 1/(b-a)
where
x = 1, 2, 3, ......., n.
A discrete uniform distribution has equal probability at each of its n values.
A continuous random variable X is said to follow a Uniform distribution with parameters
a and b, written X ~ Un(a,b), if its probability density function is constant within a finite
interval [a,b], and zero outside this interval (with a less than or equal to b).
The Uniform distribution has expected value E(X)=(a+b)/2 and variance {(b-a)2}/12.
Example
Central Limit Theorem
The Central Limit Theorem states that whenever a random sample of size n is taken
from any distribution with mean µ and variance
, then the sample mean
will
be approximately normally distributed with mean µ and variance
/n. The larger the
value of the sample size n, the better the approximation to the normal.
This is very useful when it comes to inference. For example, it allows us (if the sample
size is fairly large) to use hypothesis tests which assume normality even if our data
appear non-normal. This is because the tests use the sample mean , which the Central
Limit Theorem tells us will be approximately normally distributed.
Statistical Inference and Estimation
Statistical Inference: The process of drawing inference about a population on the basis
of information contained in a sample taken from the population is called statistical
inference.
Statistical inference is traditionally divided into two main branches: Estimation of
parameters and Testing of Hypothesis.
Estimation of Parameters: It is a procedure by which we obtain an estimate of the true
but unknown value of a population parameter by using the sample observations X1,
X2,…..,Xn. For example we may estimate the mean and the variance of population by
computing the mean and variance of the sample drawn from the population.
Testing of Hypothesis: It is procedure which enables us to decide on the basis of
information obtained from sampling whether to accept or reject any specified statement.
Estimates and Estimators: An estimate is a numerical value of the unknown parameter
obtained by applying a rule or a formula called as estimator, to a sample X1, X2,…, Xn
of size n, taken from population.
For example, if X1, X2,… Xn is a random sample of size n from a population with mean
x , then x = 1/n ∑Xi is an estimator of μ and x , the value of x is an estimate of μ.
Kinds of Estimates
There are two kinds of estimates as;
1. Point Estimates
2. Interval Estimates
1. Point Estimates: When an estimate of unknown population parameter is
expressed by a single value, it is called point estimate.
2. Interval Estimates: An estimate expressed by a range of values within which true
value of the population parameters is believed to lie, is referred to as an interval
estimate.
Suppose we wish to estimate the average height of very large group of student on the
basis a sample. If we find sample average height to be 64', then 64' is a point estimate of
the unknown population mean. If on the other hand, we state that the true average height
is a value between 62' and 66' is an interval estimate.
Example: A random sample of n=6 has the elements of 6, 10, 13, 14, 18 and 20.
Compute a point estimate of i) The population mean ii) The Population Standard
Deviation iii) The Standard Error of Mean.
i) The sample mean is
Then x = 1/n ∑Xi = (6+10+!3+14+18+20)/6 = 81/6=13.5
Thus point estimate of population mean μ is 13.5 and x is the estimator.
ii) The sample standard deviation is
S = √ 1/n ∑ (Xi - x )2 = (6-13.5)2 + (10-13.5)2 +(13-13.5)2 +(14-13.5)2 +(18-13.5)2 + (2013.5)2 /6 = 4.68
Thus the point estimate of the population standard deviation
σ
is 4.68 and S is the
estimator
iii) When the sample size is less than 5% of the population size, the standard error of
mean is
S x = S/√n = 4.68/√6 = 1.91
Hence S x is the estimator for σ x and 1.91 is the point estimate of the standard error of
mean.
Testing of Hypothesis and Significance:
Hypothesis Testing: Hypothesis testing is a procedure which enables us to decide on the
basis of information obtained from sample data whether to accept or reject a statement.
For example, Agriculturists may hypothesize that these farmers are aware about new
technology will be the most productive. With statistical techniques, we are able to decide
whether or not our theoretical hypothesis is confirmed by the empirical evidence.
Null Hypothesis: The null hypothesis (written as HO:) is a statement written in such a
way that there is no difference between two items. When we test the null hypothesis, we
will determine a P value, which provides a numerical value for the likelihood that the null
hypothesis is true.
Alternate Hypothesis: If it is unlikely that the null hypothesis is true, then we will reject
our null hypothesis in favour of an alternate hypothesis (written as HA), and this states
that the two items are not equal.
Simple and Composite Hypothesis: In statistical hypothesis completely specifies the
distribution is called Simple Hypothesis, otherwise it is called as Composite Hypothesis.
Descriptive Statistics: Figure associated with the number of birth, number of employees
and other data that average person calls.
Characteristic: To describe characteristics of population and sample.
Inferential Statistics: It is used to make an inference to a whole population from a
sample. For example, when firm tests markets a new product in D.I.Khan, it wishes to
make an inference from these sample markets to predict what will happen throughout
Pakistan.
Characteristic: To generalize from sample to the population.
Determining Sample Size: Three factors are required to specify sample size.
1. The variance or heterogeneity of population i.e Standard deviation(S).
Only small sample is required if the population is homogeneous. For example,
predicting the average age of college students require a smaller sample size than
predicting the average age of people visiting the zoo on a given Sunday afternoon.
2. The magnitude of acceptable error i.e E. It indicates how precise the estimate
must be.
3. The confidence level i.e Z
Sample size (n) = (ZS/E)2
Suppose a survey researcher studying expenditure on major crop wishes to have a
95 % confidence interval level (Table value of which i.e Z = 1.96) and a range of
error (E) of less than Rs. 2. The estimate of standard deviation is Rs. 29.
n = (ZS/E)2 = [ (1.96 x 29)/ 2]2 = 808
If the range of error (E) is acceptable at Rs. 4, the sample size is reduced.
n = (ZS/E)2 = [ (1.96 x 29)/ 4]2 = 202
Thus doubling the range of acceptance error reduces sample size to one quarter of its
original size and vise versa.
Large
Error
Small
Sample Size Large
Figure: 1 Sample Size and Error are inversely related.
Confidence Interval: In statistical term, increasing the sample size, decreases the
width of the confidence interval at a given confidence level. When the standard
deviation of the population is unknown, a confidence interval is calculated by using
the following formula
X ± Z S/ √ n
Standard Error of Mean is E
E = Z S/ √n
If n increases, E is reduced and vise versa.
Level of Significance: Significance level of a test is the probability used as a standard
for rejecting a null hypothesis Ho when Ho is assumed to be true. The widely used
values of α is called level of significance i.e 1% (0.01), 5% (0.05) or 10% (0.10).
Level of Confidence: When H0 is assumed to be true, this probability is equal to
some small pre-assigned value usually denoted by α, the equality 1-α is called level of
confidence. i.e 99% (0.99), 95% (0.95) or 90% (0.90).
Rejection and Acceptance Region: The possible results of an experiment can be
divided into two groups;
A. Results appearing consistent with the hypothesis.
B. Results leading us to reject the hypothesis.
The group A is called acceptance region, while group B is called rejection region or
critical region. The dividing line in between these two regions is called level of
significance (α). All possible values which a test- statistics may assume can be
divided into two mutually exclusive groups. One group (A) consisting of values
which appear to be consistent with the null hypothesis, and the other (B) having
values which are unlikely to occur if Ho is true.
For example, If the calculated value of test statistics is higher than its table value,
then reject H0 so it is called rejection region or critical region, otherwise accept Ho,
then it is called acceptance region.
Types of Errors: In theory of hypothesis, two types of errors are committed;
A) we reject a hypothesis when it is infact true.
B) we accept a hypothesis when it is actually false.
The former type i.e rejection of H0, when it is true is called Type I Error and the latter
type i.e the acceptance of H0, when is false is called Type II Error may be presented
in the following table.
True Situation
H0 is true
H0 is false
Accept H0
Correct decision
Wrong decision
Type II Error
Decision
Reject H0
Wrong decision
Type I Error
Correct decision
Test Statistics: A test statistics is a procedure or a method by which we verify an
established hypothesis or it is function obtained from the sample data on which the
decision of rejection or acceptance of H0 is based or a method provides a basis for
testing a null hypothesis is known a test statistics.
i.e t- test. Z-test, F-test, Chi square test, ANOVA etc.
One tailed/ sided and two tailed/ sided: A test of any statistical hypothesis where
the alternative hypothesis is one sided such as;
H0 : µ = µo
H1 : µ > µo or µ < µo
The critical region for the H1 : µ > µo has entirely on the right tail of the distribution.
α
1-α
and the critical region for the H1 : µ < µo has entirely on the left tail of the
distribution.
α
1-α
A test of any statistically hypothesis where alternative hypothesis H1 is two sided as;
H0 : µ = µo
H1 : µ ≠ µo
These constitute on both sides/ tailed of the distribution.
α/2
α/2
1- α
General Procedure for Testing Hypothesis:
The procedure for testing a hypothesis about a population parameter involves the
following six steps,
1. State your problem & formulate an appropriate null hypothesis Ho with an
alternative hypothesis H1, which to be accepted when Ho is rejected.
2. Decide upon a significance level, α of the test, which is probability of rejecting
the null hypothesis if it is true.
3. Choose an appropriate test-statistics, determine & sketch the sampling distribution
of the test-statistics, assuming Ho is true.
4. Determine the rejection or critical region in such a way that a probability of
rejecting the null hypothesis Ho, if it is true, is equal to the significance level, α
the location of the critical region depends upon the form of H1. The significance
level will separate the acceptance region from the rejection region.
5. Compute the value of the test-statistics from the sample data in order to decide
whether to accept or reject the null hypothesis Ho.
6. Formulate the decision rule as below.
a) Reject the null hypothesis Ho, if the computed value of the test-statistics falls in
the rejection region & conclude that H1 is true.
b) Accept the null hypothesis Ho, otherwise when a hypothesis is rejected, we can
give α measure of the strength of the rejection by giving the P-value, the smallest
significance level at which the null hypothesis is being rejected.
Example
A random sample of n = 25 values gives x = 83 can this sample be regarded as drawn
from normal population with mean μ = 80 & б =7
Solution
i) We formulate our null & alternate hypothesis as
Ho: μ = 80 and H1: μ ≠ 80 (two sided).
ii) We set the significance level at α = 0.05
iii) Test-statistics is to be used is Z = (x-μ)/б/√n, which under the null hypothesis is a
standard normal variable.
iv) The critical region for α = 0.05 is
for the sample, z
≥ 1.96.
z
≥ 1.96 , the hypothesis will be rejected if,
v) We calculate the value of Z from the sample data
vi) Conclusion: Since our calculated value Z = 2.14 falls in the critical region, so we reject
our null hypothesis Ho: μ = 80 & accept H1 : μ ≠ 80. We may conclude that the sample
with x = 83 cannot be regarded as drawn from the population with μ = 80
Tests based on Normal Distribution: Two parameters are used in this distribution
are µ (population mean) and δ2 (Population variance). Let (x1, x2,-------xn) be a sample
from N ~ (µ - δ2) → (Normal Distribution). It is desired to test H0: µ = µo is some predetermined value of µ. Here two cases arise;
Case-I δ2 is known: Where δ2 is known, then sample mean (X) is normally distributed
with population mean (µ) and population variance (δ2/n), so
Z = (X-µ)/ √ δ2/n = (X-µ)/ δ2/√n
Where Z is standard normal variance (S)
Z = (X-µ)/ δ2/√n (Where H0)
Critical region always depends on H1
We know that
H0 : µ = µo
A. Either H1 : µ≠ µo (two tailed test ) or
B1.H1 :µ >µo one tailed test (right hand side) or B2. H1 : µ < µo two tailed test (left hand
side)
In case│Z1│> Zα/2 Reject Ho otherwise accept Ho
Example: A researcher worker is interested in testing a fertilizer effect on wheat production
which has average production of 40 kg/acre with δ2 = 25 kgs. He selected at random 16 acres of
land which were similar in all respects. The wheat was sown and fertilizers were applied. The
yield of 16 plots was observed to be 40, 44, 43, 43, 41, 40, 41, 44, 42, 41, 42, 43, 46, 40, 38, 44.
Test this claim that production/ yield of wheat will not be increased due to fertilizer effects.
Answer: We formulate hypothesis as
H0 : µ = µo (µo = 40 kgs)
H1 :µ >µo (one sided right hand test)
Level of significance is α is 0.05
Test Statistics to be used is Z = (X-µ)/ δ/√n
With given values µo = 40 kgs X= ∑X/n= 672/16= 42, where δ2 =25
Now putting the values, we get Z = (42-40)/ 5/√16 =1.6
Critical region Z > Zα or Z > Z 0.05(n-1) or 1.6 > 1.645
Conclusion: Hence we accept H0, which means that there is no effect of fertilizer on the increase
of wheat production.
Critical values of Z in a form a table;
Level of Significance
Two Tailed Test
One Tailed Test
0.10
± 1.645(± Z α/2)
± 1.28(± Z α)
0.05
± 1.96(± Z α/2)
± 1.645(± Z α)
0.01
± 2.58(± Z α/2)
± 2.33(± Z α)
Example: It is hypothesized that that average diameter of leaves of a certain tree is 20.4 mm with
a standard deviation of 2.0 mm. check this supposition; we select a random sample of 16 leaves
and found that this mean is 22 mm. Test whether the sample supports this hypothesis.
Answer: We formulate our hypothesis as;
H0 : µ = µo (µo = 20.4 mm)
H1 :µ ≠µo (two side test)l
Level of significance α = 0.05
While Test Statistics to be used is
Z = (X-µ)/ δ/√n
The known values are µo = 20.4 mm, δ = 2.0 mm, n= 16, X = 22.0 mm
Hence by putting the values, Z = (22-20.4)/2/√16 = 3.2
Critical region is │Z1│> Zα/2
As 3.2 > 1.96 as Ho is rejected, so this sample does not
supports this hypothesis.
Note: In case 2, If the δ2 is unknown then the formula will be applied as Z = (X-µ)/ S/√n
(under H0),
t-Distribution: T-test is used to measure two means. When δ2 is unknown and n> 30, then
for a large sample size, Z-test will replace δ2 by S. When n < 30 and population standard
deviation is unknown, then t-test instead of Z-test will be used. T-test can be symbolically
expressed as;
t = (X-µo)/ δ√n
with n-1 degree of freedom.
Example: Ten students are chosen at random from a normal population and their heights are
found to be in inches as 63, 63, 66, 67, 68, 69, 70, 70, 71, 71. In the light of above data, “Is the
mean height in the population is 66 inches”?.
Answer: We formulate our hypothesis as;
H0 : µ = µo (µo = 66 inches)
H1 : µ ≠ µo (two tailed test)
Level of significance (α) is 0.05
While the test statistics is t = (X-µo)/ δ√n with n-1 df
Computations
X
63
63
66
67
68
69
70
70
71
71
Total
Dx = X-PM (PM =68)
-5
-5
-2
-1
0
1
2
2
3
3
∑Dx = -2
Dx2
25
25
4
1
0
1
4
4
9
9
∑Dx2 = 82
X = (P.M/1) + ∑DX/n = (68/1) + (-2/10) = 67.8
δ2 = (1/n-1) {∑Dx2 – (∑Dx)2/n}
δ2 = 1/10-1 {82 – (-2)2/10 } = 9.066
δ = √9.066 = 3.011
Now putting the values in the formula as
t = X - µo/ δ √ n = 67.6- 66/ 3.011 √ 10 with n-1 df
t = 1.89 with 9 degree of freedom
Here the critical region is │t│> t α/2 (n-1)
│1.89│< t 0.025 (9)
│1.89│< 2.26 Which proves that H0 is accepted, so we conclude that mean height of
population is 66 inches.
Testing the equality of two means: Testing equality of two means or the
difference of two means, when the population variance are equal (δ1 2 = δ22), but
unknown. Suppose we have X1, X2, -------Xn and Y1, Y2, --------Yn are the two
independent random small samples with mean X and Y drawn from two normal
population with population mean µ1 and µ2 with the same unknown population variance.
We wish to test the hypothesis whether two population means are same.
t = {(X-Y) – (µ1 - µ2)}/ √ (δ12/n1 + δ22/n1)
Since population standard deviation δ is unknown, therefore we take sp as a pooled
variance as;
sp = √ {1/(n1+n2-2)} [{∑X2 – (∑X)2/n1} + {∑Y2 – (∑Y)2/n2}]
Example: In the test, two groups obtained marked as
X 9, 11, 13, 11, 15, 9, 12, 14
Y 10, 12, 10, 14, 9, 8, 10
Is there is any difference in their means of their population.
Answer: Formulate hypothesis as;
H0: µ1 -µ2 = 0
H1 : µ1 -µ2 ≠ 0
Level of significance α is 0.05
Test statistics is t = (X – Y)/ sp√{1/n1 -1/n2}
with n1+n2-2 degree of
freedom
Computations
X
Y
X2
Y2
9
10
81
100
11
12
121
144
13
10
169
100
11
14
121
196
15
9
225
84
9
8
81
64
12
10
144
100
14
196
∑X = 94
∑Y= 73
∑X2 = 1138
∑Y2 = 785
By putting the values,
sp = √ (1/n1+n2-2) [{∑X2 – (∑X)2/n1} + {∑Y2 – (∑Y)2/n2}] =√ (1/8+7-2) [{1138 –
(94)2/8} + {785 – (73)2/7}] = 2.097
Hence t = (X – Y)/ sp√{1/n1 -1/n2} = (11.75-10.42)/ 2.097√1/8 + 1/7 = 1.24
Here the critical region is │t│< t α/2 (n1 +n2-2)
│1.24│< t 0.025 (8 + 7 - 2)
│1.24│< 2.16 which proves that H0 is accepted, so we conclude that there is difference
in their population means.
Testing Hypothesis about two means with paired observations.
Example. Ten young recruits were put through physical training programme. Their
weights were recorded before and after the training with the following results;
Recruits
Weight before
Weight after
1
125
136
2
195
201
3
160
158
4
171
184
5
140
145
6
201
195
7
170
175
8
176
190
9
195
190
10
139
145
Use α = 0.05 would you say that that the programme affects the average weights of
recruits. Assume the distribution of weights before and after to be approximately normal.
Answer: We state our null and alternate hypothesis as;
Ho : µD = 0
H1 : µD ≠ 0
Level of significance is α = 0.05
Test statistics under Ho is as
t = d/ sd/√n with n-1 degree of freedom
Computations
Recruits
Weights
Difference di
di2
(after-before)
Before
After
1
125
136
11
121
2
195
201
6
36
3
160
158
-2
4
4
171
184
13
169
5
140
145
5
25
6
201
195
-6
36
7
170
175
5
25
8
176
190
14
196
9
195
190
-5
25
10
139
145
6
36
∑di = 47
∑ di2 = 673
d = ∑di/n = 47/10 = 4.7
sd2 = 1/n-1 [ ∑di2 – (∑di)2/n] = 1/10-1 [ 673- (47)2/10]
sd = 7.09
Now by putting the values in the formula;
t = d/ sd/√ n = 4.7/ 7.09/√10 = 2.09
Critical region is │ t │< t α/2 (n-1)
│ 2.09│< t 0.025 (10-1)
2.09 < 2.262 which proves that H0 is accepted, so that training programme
affects the average weights of recruits.
Note. When the values are independent, the test statistics will be as
t = {(X-Y)-∆}/sp √1/n1 + 1/n2
with n1+ n2-1 degree of freedom
Note: Testing the significance of coefficient correlation r by t-test
Test statistics will be;
t = {r√n-2}/ √1-r2 with n-2 df
Chi-Square Test : A test of goodness if fit is a technique by which we test the
hypothesis whether the sample distribution is in agreement with theoretical (hypothetical)
distribution. Symbolically, it can be expressed as;
X2 = ∑(oi2 – ei2)/ei
Where X2 = chi- square,
oi = Observed value, ei = Expected values
Procedure:
1. State the null hypothesis H0, which is usually sample distribution agrees with the
theoretical (hypothetical) distribution.
2. Level of significance α = 0.05
3. Test statistics is X2 = ∑(oi2 – ei2)/ei
4. Critical region X2cal = X20.05(r-1) (c-1) degree of freedom
Example. The following table shows the academic conditions of 100 people sex. Is
there no relationship between sex and academic conditions?
Academic
Sex
Total
condition
Male
Female
Strong
30
10
40
Poor
20
40
60
Total
50
50
100
Answer: We formulate the hypothesis as;
H0: There is no relationship between sex and academic condition.
H1: There is relationship between sex and academic condition.
Level of significance α = 0.05
Test statistics is X2 = ∑(oi2 – ei2)/ei with (r-1) (c-1) degree of freedom
Computations
Academic
Sex
Total
condition
Male
Female
Strong
30
10
40
Poor
20
40
60
Total
50
50
100
e11 = {40 x 50}/100 = 20
e12 = {40 x 50}/100 = 20
e13 = {60 x 50}/100 = 30
e14 = {60 x 50}/100 = 30
Oij
eij
Oij - eij
(Oij –eij)2
(Oij –eij)2/ei
30
20
10
100
5
10
20
-10
100
5
20
30
-10
100
3.33
40
30
10
100
3.33
Total
16.66
2
2
2
X = ∑(oi – ei )/ei with (r-1) (c-1) degree of freedom
By putting the values, X2 = 16.66
Critical region: X2cal > X20.05(r-1) (c-1) degree of freedom
16.66 > 3.84 Hence Ho is rejected, which proves that there is relationship between
sex and academic conditions.
Example: Genetic theory states that children having one parameter of blood type-M
and other parameter of blood type -N, will always be one of three types as M, MN, N.
The proportion of three types on average will be as 1:2:1. The report says that out of
300 children having one M percent and one N percent 30% were found to be M type,
45% of MN type and remainder of type N. Test the hypothesis whether the traits of
genetic theory is not consistent with the report.
Answer: We formulate the hypothesis as;
H0: The genetic theory is not consistent with the report or the fit is not good.
H1: The genetic theory is consistent with the report or the fit is good.
Level of significance α = 0.05
Test statistics will be used
X2 = ∑(oi2 – ei2)/ei with (n-1) degree of freedom
Computations:
O1 = (30x300)/100 = 90
O2 = (45x300)/100 = 135
O3 = (25x300)/100 = 75
e1 = (1x300)/4 = 75
e2 = (2x300)/4 = 150
e3 = (1x300)/4 = 75
Oi
ei
Oi - ei
(oi –ei)2
(oi –ei)2/ ei
90
75
15
225
3
135
150
-15
-15
1.5
75
75
0
0
0
∑(oi2 – ei2)/ei= 4.5
X2 = ∑(oi2 – ei2)/ei with (n-1) degree of freedom
X2 = 4.5
Critical region: X2cal < X20.05(3-1) degree of freedom
4.5 < 5.99 Hence Ho is accepted, which proves that the genetic theory is not
consistent with the report or the fit is not good.
Analysis of Variance: (ANOVA): In simple word, the analysis of variance is defined
as, “It is statistical device for partitioning the total variations into separate
components that measures the different sources of variation. We use the following
terms in making the analysis of variance table;
1. Source of variation: A component of an experiment for which we calculate the
sum of squares and mean squares.
2. Degree of freedom: For a given set of conditions, the number of degree of
freedom is the total number of observations minus one restriction imparts the
aggregate data.
3. Sum of squares: It is sum of squares of squares of deviations for each item from
its mean. i.e E (X-X)2.
4. Mean square: When sum of squares divided by respective degree of freedom. It is
also known a estimate of variance. (s2).
5. F ratio: The ratio of treatment estimate of variance to error estimate of variance is
called F ratio.
6. F tabulated: F ∞ (n1, n2)
Analysis of variance technique is applied into different criterion of classification i.e
One way classification or one criterion or category classification or two way
classification or two criterion or categories classification.
ANOVA (TWO WAY WITHOUT INTERACTION) or One Way ANOVA:
ANOVA for Randomized Block Design or One Way ANOVA : To test for statistical
significance in a randomized block design, the linear model of individual observation is;
Yij = µ + αj + ßi + εij
Where Yij = Individual observation on the dependent variable.
µ = grand mean
αj = jth treatment effect
ßi = ith block effect
εij = random error or residual
The statistical objective is to determine if significance differences among treatment
means and block means exist. This will be done by calculating an F ratio for each source
of effects.
Example: To illustrate the analysis of a Latin Square Design, let us return to the
experiment in which the letters A, B, C and D represents four varieties of wheat, the rows
represents four different fertilizers and the columns accounts for the four varieties of
wheat measured in kg per plot. It is assumed that variance of variations do not interact.
Using a 0.05 level of significance, test the hypothesis that;
a) H/o: There is no difference in the average yields of wheat when different kinds of
fertilizers are used.
b) H//o: There is no difference in the average yields of wheat due to different years.
c) H///o: There is no difference in the average yields of the four varieties of wheat.
Table: Yields of wheat in kg per plot
Fertilizer
Year
Treatment
1978
1979
A
B
T1
70
75
D
A
T2
66
59
C
D
T3
59
66
B
C
T4
41
57
1980
C
68
B
55
A
39
D
39
Solutions:
Table: Yields of wheat in kg per plot
Fertilizer
Year
Treatment
1978
1979
1980
1981
A
B
C
D
T1
70
75
68
81
D
A
B
C
T2
66
59
55
63
C
D
A
B
T3
59
66
39
42
B
C
D
A
T4
41
57
39
55
Total
236
257
201
241
1
a) H/o: α1 = α2 = α3 = α4 =0
b) H//o: β1 = β 2 = β3 = β3 = 0
c) H//o: TA = TB = TC = TD = 0
2
a) H/1: At least one of the αi is not equal to zero
b) H//1: At least one of the βi is not equal to zero
c) H///1: At least one of the TK is not equal to zero
1981
D
81
C
63
B
42
A
55
Total
294
243
206
192
935
α = 0.05
Critical region a) f1 > 4.76 b) f2 > 4.76 c) f3 > 4.76
From table, we find the row, column and treatment totals to be;
T1 = 294, T2= 243, T3= 206, T4= 192
T.1= 236, T.2= 257, T.3= 201, T.4= 241
T..A=223, T..B=213, T..C= 247, T..D= 252
Hence SST = 702 + 752 + --------+ 552 – 9352/16 = 2500
SSR = (2942 + 2432 + 2062 + 1922)/4 - 9352/16 = 1557
SSC = (2362 + 2572 + 2012 + 2412)/4 - 9352/16 = 418
SSTR = (2232 + 2132 + 2472 + 2522)/4 - 9352/16 = 264
SSE = 2500-1557-418-264 = 261
3
4
5
Two way Analysis of variance (ANOVA) without interaction Table
Source of
Sum of
Degree of
Mean square
variance
squares
freedom
SSR=1557
r-1 =3
S21 = SSR/(r-1) =519.00
Rows
means
SSC= 418
c-1= 3
S22 = SSC/(c-1) =139.33
Columns
means
S23 = SSTR/r-1 =88.00
Treatment SSTA= 264 r-1 = 3
Error
SSE= 261
(c-1)(r-2)=6
Total
SST = 2500
15
Computed
f
f1 = S21/ S24
= 11.93
f2 = S22/ S24
= 3.20
f3 = S23/ S24
= 2.02
S24 = SSE/(c-1)(r-2)
= 43.5
Decisions:
a) Reject H/o and conclude that a difference in the average yields of wheat exists when
different kinds of fertilizers are used.
As f1c= 11.93 while f10.05 (3,6) =4.76 since f1c> f10.05 (3,6)
b) Accept H//o and conclude that there is no difference in the average yields due to
different years.
As f2c= 3.20 while f10.05 (3,6) = 4.76 since f1c< f10.05 (3,6)
c) Accept H///o and conclude that there is no difference in the average yields of the four
varieties of wheat.
As f3c= 2.02 while f10.05 (3,6) = 4.76 since f1c< f10.05 (3,6)
ANOVA (TWO WAY WITH INTERACTION) or Two Way ANOVA or Factorial
Design:
There is considerable similarity between the factorial design and the one way analysis of
variance. The sum of squares for each of the treatment factors (rows and columns) is
similar to the between- groups sum of squares in the single factor model- that is , each
treatment sum of squares is calculated by taking the deviation of the treatment means
from the grand mean. In a two factor experimental design, the linear model for an
individual observation is;
Yijk = µ + αj + ßi + Iij + εijk
Where Yijk = Individual observation on the dependent variable.
µ = grand mean
αj = jth effect of factor A- column treatment
ßi = ith effect of factor B- row treatment
Iij= Interaction effect of factors A and B
εijk = random error or residual
Example: Use a 0.05 level of significance to test the following hypothesis,
a) H/o: There is no difference in the average yield of wheat when different kinds of
fertilizers are used.
b) H//o: There is no difference in the average yield of three varieties of wheat.
c) H///o: There is no interaction between the different kinds of fertilizers and the
different varieties of wheat.
Fertilizer
Treatment
T1
T2
T3
T4
Varieties of wheat
V1
64
66
70
65
63
53
59
68
65
58
41
46
V2
72
81
64
57
43
52
66
71
59
57
61
53
V3
74
51
65
47
58
67
58
39
42
52
59
38
Solutions:
Fertilizer
Treatment
T1
T2
T3
T4
Total
Varieties of wheat
V1
V2
200
217
186
152
192
196
145
171
723
736
Total
V3
190
172
139
150
651
1. a) H/o: α1 = α2 = α3 = α4 =0
b) H//o: β1 = β 2 = β3 = 0
c) H///o: (α β) 11 = (α β) 12 = --------- = (α β) 43 =0
2. a) H/1: at least one of the αi is not equal to zero.
b) H//1: at least one of the βj is not equal to zero.
c) H///1: at least one of the (αβ)ij is not equal to zero.
3. α = 0.05
607
510
527
466
2110
4. Critical region: a) f1 > 3.01, b) f2 > 3.40 c) f3 > 2.51
5. Computations:
SST = Total sum of squares = 642 + 662 + -----+ 382 – 21102/ 36 = 3779
SSR = Row sum of squares = (6072 + 5102 + 5272 + 4662 )/ 9 - 21102/ 36 = 1157
SSC = Column sum of squares = (7232 + 7362 + 6512)/12 - 21102/ 36 = 350
SS (RC) = Sum of squares for interaction of rows and columns
= (2002 + 1862 + ----+ 1502)/ 3 - (6072 + 5102 + 5272 + 4662 )/ 9
- (7232 + 7352 + 6512)/12 + 21102/ 36
= (2002 + 1862 + ----+ 1502)/ 3 – 124826 -124019 + 123669 = 771
SSE = Error sum of squares = SST –SSR –SSC-SS(RC)
= 3779-1157-350-771= 1501
Two way Analysis of variance (ANOVA) with interaction Table
Source of
Sum of Degree of
Mean square
variance
squares freedom
1157
r-1 =3
S21 = SSR/(r-1) =385.66
Rows means
S22
Columns
means
Interaction
350
c-1= 2
771
(r-1) (c-1)=6 S23
Error
1501
rc (n-1)=24
Total
Decisions:
S24
Computed
f
f1 = S21/ S24
= 6.17
= SSC/(c-1) =175.00 f2 = S22/ S24
= 2.80
= SSR(RC)/ (r-1) (c-1) f3 = S23/ S24
=128.50
= 2.05
= SSE/r c (n-1)
= 62.54
3779
rcn-1 = 35
/
a) Reject H o and conclude that a difference in the average yield of
wheat exists when different kinds of fertilizers are used.
As f1c= 6.17 while f10.05 (3,24) =3.01 since f1c> f10.05 (3,24)
b) Accept H//o and conclude that there is no difference in the average
yield of three varieties of wheat.
As f2c= 2.80 while f20.05 (2, 24) =3.40 since f1c< f10.05 (3,24)
c) Accept H///o and conclude that there is no interaction between the
different kinds of fertilizers and different varieties of wheat
As f3c= 2.05 while f10.05 (6, 24) =2.51 since f1c < f10.05 (3,24)