Download statistics and biometrics

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

History of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
KENYA METHODIST UNIVERSITY
Department of Economics and Applied Statistics
STATISTICS AND BIOMETRICS
By
Prof. George K. King’oriah PhD., MBS.
P. O. Box 45240 NAIROBI - 00100
Kenya
Tel: (020) 2118443/2247987/2248172
SafariCom + 254 0725 – 751878 Celtel 0735 - 372326
Email: [email protected]
© KEMU SCHOOL OF BUSINESS AND ECONOMICS
1
GENERAL STATISTICS CURRICULUM OUTLINES
ELEMENTARY BUSINESS STATISTICS
The nature of variables, Sums and summation signs, Types of statistical data,
Samples, Sampling, Population and the universe. The functioning of factorials, frequency
distributions, Measures of central tendency like the mean, median, mode, proportion.
Measures of variability – Variance, and Standard deviation; and Standard errors.
Probability, Rules of probability, Permutations and combinations.
The binomial theorem and approach to normal curve. Random variables and
probability distributions. The normal curve and use of normal tables, the normal deviate
“Z”, the “t” distribution the proportion “  ” and standard error of the proportion “  ”,
confidence intervals and hypothesis testing, use of standard errors in statistics. Analysis
of variance, and the use of “F” statistic; simple linear regression/correlation analysis; the
Chi-square statistic (introduction)
INTERMEDIATE BUSINESS STATISTICS
Non- parametric statistics; Chi-square statistical analysis, Median test for one
sample, Mann-Whitney test for two independent samples, Wilcoxon rank test for two
matched samples, Kruskal-Wallis test for several independent variables: Rank order
correlation, Spearman’s rank correlation “  s ”, Kendall’s Rank correlation coefficient,
Kendall’s partial Tau “  ”; Biserial Correlation. Poisson distribution “  ”, Bayes
Theorem, and Posterior probability
Non- linear regression correlation, partial regression correlation, Multiple
regression by Snedecor method, Multiple regression using linear equations and Matrix
algebra, significant testing for linear, non linear and multiple regression and correlation
parameters. Advanced analysis of variance, randomized block design and Latin squares.
ADVANCED BUSINESS AND APPLIED STATISTICS
Use of linear and non-linear regression/correlation analysis, use of multiple
regression in measurement of economic and non-economic variables, identification and
identification problems, Ordinary least squares and generalized least squares models and
their use, models and model building, multi-collinearity, Heteroscedasticity, Two-stage
least squares models, Maximum likelihood methods, Bayesian methods of estimation, use
of Dummy Variables; Logit, Probit and Tobit models, Aggregation problems and their
use in the estimation of structural variables, Index Numbers, Time Series Analysis,
Autocorrelation, Analysis of Seasonal Fluctuations, cyclical movements and irregular
fluctuations, Price Indices, Decision Theory and decision making.
Application of computer packages for data analysis (SPSS, Excel, E-Views, SHAZAM,
MINITAB, etc). Use of equation editors for statistical and mathematical presentations
2
STATISTICS AND BIOMETRICS
Purpose of the course
The course is designed to introduce the learner to the purpose and meaning of Statistics;
and to the use of statistics in research design, data collection, data analysis and research
reporting, especially in biometrics and agricultural research.
Objectives of the Course
By the end of the course the learner is expected to :1.
Understand and use sampling and sampling methods
2.
Understand the nature and use of Statistics
3.
Use Statistics for data collection, data analysis and report writing within the
environment of biometric research.
4.
Be able to participate in national and international fora and discussions which
involve analysis and interpretation of statistical and biometric research data.
5.
Be able to explain (in a simplified and an understandable manner) the meaning of
research data and findings to students, farmers, ordinary citizen, Politicians
and other stakeholders in the Agricultural Sector.
6.
Be able to carry out Biometric project evaluation, management, and monitoring.
Course Description
Common concepts in Statistics, Graphic Presentation of Data, Measures of
Central Tendency and Dispersion, Measures of variability or dispersion. Simple Rules of
Probability, Counting Techniques, Binomial Distribution. Normal Distribution, The t
distribution The Proportion and Normal Distribution. Sampling Methods, Hypothesis
testing, Confidence Intervals. Chi-Square Distribution. Analysis of Variance. Linear
Regression/correlation.
Partial
Correlation.
Logarithmic
and
Mathematical
Transformations. Non-Linear Regression and Correlation. Non-parametric Statistics:
Median Test for one Sample. Mann-Whitney Test for Two Independent Samples,
Wilcoxon Signed-Rank Test, Kruskal-Wallis Test, Spearman’s Rank Order Correlation.
Use of statistical methods for measurement, data collection analysis, and Research
reporting.
3
Course Outline
1.
Statistics, Biostatistics, measurements and analysis
2.
Basic Probability Theory
3.
The Normal Distribution
4.
Sampling Methods and Statistical Estimation
5.
Confidence Intervals
6.
Hypothesis Testing using Small and Large Samples
7.
The Chi-Square Statistic and Testing for the Goodness of Fit
8.
Analysis of Variance
9.
Linear Regression and Correlation
10.
Partial Regression and Multiple Regression
11.
Non-Linear Regression and Correlation
12.
Significance Testing using Regression and Correlation methods
13.
Non-Parametric Statistics
Teaching Methods
1.
Classroom Lectures and learner interaction in lectures and discussions
2.
Distance Study materials and frequent instructor supervision at distance learning
centers
3.
Students may be guided to design simple experiments and test various hypotheses,
using various statistical and biometric methods.
Recommended Textbooks
Class Text Books
King’oriah, George K. (2004), Fundamentals of Applied Statistics, Jomo Kenyatta
Foundation, Nairobi.
Steel, Robert G.D. and James H. Torrie, (1980); Principles and Procedures of Statistics,
A Biometric Approach. McGraw Hill Book Company, New York.
4
Useful References (The Most recent editions of these textbooks should be obtained)
Gibbons, Jean D., (1970) Non-Parametric Statistical inference. McGraw-Hill Book
Company, New York (N.Y.)
Keller, Gerrald, Brian Warrack and Henry Bartel; Statistics for Management and
Economics. (1994) Duxbury Press, Belmont (California).
Levine, David M., David Stephan, (et. al.) (2006), Statistics for Managers. Prentice Hall
of India New Delhi.
Pfaffenberger, Roger C., Statistical Methods for Business and Economics. (1977),
Richard D. Irwin, Homewood, Illinois (U.S.A)
Salvatore, Dominick, and Derrick Reagle; (2002) Statistics and Econometrics, McGrawHill Book Company, New York (N.Y.)
Siegel, Sidney C. Non-Parametric Statistics for the Behavioral Sciences. (1956)
McGraw-Hill Book Company, New York
Snedecor, George W. and William G. Cochran; Statistical Methods (1967) Iowa
University Press, Ames, Iowa. (U.S.A.)
5
STATISTICS AND BIOMETRICS
By Prof. George K. King’oriah, B.A., M.I.S.K., M.Sc., Ph.D., M.B.S.
CHAPTER ONE:
STATISTICS, BIOSTATISTICS, MEASUREMENT AND ANALYSIS
.. .. .. 6
Objectives and uses of Statistics
Types of Statistics
Basic common concepts in Statistics
Graphic Presentation of Data
Measures of Central Tendency and Dispersion
Measures of variability or dispersion
CHAPTER TWO
BASIC PROBALBILITY THEORY .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
39
Some basic terminology
Simple Rules of Probability
Counting Techniques
Binomial Distribution
CHAPTER THREE
THE NORMAL CURVE AS A PROBABILITY DISTRIBUTION .. .. .. .. .. 58
The Normal Distribution
Using Standard Normal Tables
The “ t ” distribution and Sample Data
The Proportion and Normal Distribution
CHAPTER FOUR
STATISTICAL INFERENCE USING THE MEAN AND PROPORTION
.. 93
Sampling Methods
Hypothesis testing
Confidence Intervals
Type  and Type  errors
Confidence Interval using the proportion
CHAPTER FIVE
THE CHI-SQUARE STATISTIC AND ITS APPLICATIONS .. .. .. .. ..
The Chi-Square Distribution
Contingency Tables and Degrees of Freedom
Applications of the Chi-Square Test
6
116
CHAPTER SIX
ANALYSIS OF VARIANCE
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
132
CHAPTER SEVEN
LINEAR REGRESSION AND CORRELATION .. .. .. .. .. .. .. .. .. .. .. ..
156
Introduction
One-way Analysis of Variance
Two-Way Analysis of Variance
Introduction
Regression Equation of the Linear Form
The Linear Least squares Line
Tests of Significance for Regression coefficients
Coefficient of Determination and the Correlation Coefficient
Analysis of Variance for Regression and Correlation
CHAPTER EIGHT
PARTIAL REGRESSION, MULTIPLE LINEAR REGRASSION. . .. .. .. ..
AND CORRELATION
183
Introduction
Partial Correlation
Computational techniques
Significance Tests
Analysis of Variance
CHAPTER NINE
NON-LINEAR CORRELATION AND REGRESSION. .. .. .. .. .. .. .. .. ..
204
Logarithmic and other Mathematical Transformations
Non-Linear Regression and Correlation
Testing for non-linear Relationship
Significance Tests using non-Linear Regression/Correlation
CHAPTER TEN
NON-PARAMETRIC STATISTICS .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
Need for Non-Parametric Statistics
Median Test for one Sample
Mann-Whitney Test for Two Independent Sample
Wicoxon Signed Rank Test
Kruskal-Wallis Test for Several Independent Variables
Spearman’s Rank Order Correlation
7
226
CHAPTER ONE:
STATISTICS, BIOSTATIATICS, MEASUREMENT AND ANALYSIS
Objectives and uses of Statistics
After working for sometimes in adult life, we are required to use statistics in our
efforts of analyzing data measured from real-life phenomena. The immediate question
which is asked by most of us is, why statistics has not been necessary all this time, and
now it is required! The answer to this puzzle is that all that time we have not been asked
to use statistics, our lives may have been much simpler than now, and we did not require
rigorous data analysis, rigorous proofs, and rigorous exhibition of data accuracy and
veracity. Now that we are interested in all these things, I welcome to one of the most
powerful tools available to mankind for data collection, data analysis, and data reporting or presentation. This tool is largely a product of the technological developments of the
19th and 20th centuries. That is when it became necessary to be accurate with data (of all
kinds) because of the nature of production mechanisms involved in all the industries that
sprang up as a result of the industrial revolution, all current human endeavor, and the
current information age (using computers and allied equipment) at the dawn of the 21st
Century. For example, in this regard, we are never satisfied by reports from anybody that
something is bigger or better than another one. The current technical minded man or
woman requires to know how much bigger the object is, and how better it is. This means
that there must be reliable data (always), upon which to base our decision making. This is
the nature of today’s life.
Think of the time your child comes home from school at the end of the term, with
a report form. He or she tells you that he/she was number ten in this term’s examination.
Unlike our forefathers, we are not satisfied by the mere number (or position) that our
child scored (obtained) in school. We need to know the average mark that the child
obtained, so that we can know the actual academic strength of our child. We then may
consider the average or the mean grade. This is when we are able to preliminarily rate our
child’s ability. There are very few of us who have not done this, to ascertain whether our
child is good material for admission to secondary schools, or to universities or other
8
institutions of higher learning. This is one of the commonest and simplest uses of
Statistics, which we shall be considering soon in this module. The average mark is called
the mean grade or mean mark! Do you see how close Statistics is to our daily lives? Then
it pays to learn statistics!
For a summary of how to begin reading this exciting subject, learners are referred
to George K. Kingoriah’s Fundamentals of Applied Statistics (2004, Jomo Kenyatta
Foundation, Nairobi.) Chapter One.
These days, statistical calculations are required in almost all human activities and
all academic disciplines. Advances in science and technology in all fields of study, all
production areas, and all areas of human endeavor have necessitated the use of statistics.
Some phenomena are easily quantified and can be easily measured. Other natural
phenomena are not that easy to measure. In all these cases, we resort to the use of
statistics so that we can know how to handle each case we come across. In this module
we shall learn how to analyze facts and figures, and to summarize many diffuse data
types using the tools that are available within the discipline of statistics.
We may therefore define statistics as a body of knowledge in the realm of applied
mathematics with its own ethics, terminology content, theorems and techniques. When
we master this discipline though our rigorous study we seek to master all these theorems
and analytical techniques and to make them our toolkit which we can use as our second
nature in order to understand the universe around us and to make sure that we are not
duped by con-people who push facts and figures to us so that we can believe them in
order they can gain emotionally, politically, spiritually or materially from our data
indigestion.
In this way we view the discipline of statistics as a tool for helping us in our daily
needs of data analysis and understanding. It may not be necessary to use the tool always,
but if it is available we can use it any time we need it. It is better to have it on stand-by,
than to miss it altogether. This way we become flexible in the understanding of our
universe.
Origins of Statistics
1.
Statistics has many origins. However, there are two chief sources of this
discipline. One of the earliest sources of statistics is in the book of Numbers within the
9
Holy Bible, when it was necessary to know how many of the children of Israel were
proceeding to the Promised Land, which flowed with milk and honey.
Activity
Learners are requested to appreciate that the numbers given in the Book of
Numbers are facts and figures of governing any community, or any state. This is where
the name Statistics came from. Please read the Book of Numbers in the Old Testament of
the Holy Bible to be able to appreciate this.
Also one of the other ancient events most copiously recorded happened when
Jesus Christ was born. Remember how there was data needs within the ancient Roman
Empire, which caused Caesar Augustus to have an enumeration of all people in the
Roman Empire? When it came to Palestine, it affected the Holy Family, and they had to
travel to Bethlehem so that the population count can be accurate for the purposes of the
emperor. They had to be counted in their district of origin, not where they worked and
lived. Upon this event, Jesus was born in Bethlehem.
The term “Statistics”
1.
Heads of states and other officials of government need to know the characteristics
of the populations they are serving, such as birth rates, death rates, numbers of livestock
per person, the amount of crop yields of farms in the areas they administer, and so on.
These are the facts and figures of state. Hence the name Statistics.
2.
The second major source of statistics lay in the desire of kings and noblemen in
Europe during classical and Baroque times (1600 - 1900 A.D.), who needed to know
chances of winning in their gambling endeavors. This led to the development of
Probability theory and the games of chance; which is very much in use within the
discipline of statistics today.
3.
The word Statistics is used in describing the subject matter of the discipline where
facts and figures and probability theory are used to collect, analyze and present data in a
rigorous manner, so that we can be assisted to make informed decisions. Anybody
studying Statistics today is therefore interested in this kind of meaning of the word
Statistics. The techniques, terminology and methodology of applying facts and figures
10
which are encountered in daily life - especially as a result of research or scientific
investigation, are the subject matter of the discipline of statistics. In this regard, the uses
and applications of various algorithms which mankind has inherited from the study of
mathematics is the discipline of Statistics.
Science and Scientific Method
During the age of discovery (1300 - 1800) people used to travel out of, and go
back to Europe, and give accounts of what wonderful things they had seen and what
wonderful events they had witnessed during their colorful travels. For a satirical example
of this activity, we need to look at novels like Jonathan Swift’s Gulliver’s Travels (see
any encyclopedia for this reference), and to note how mysterious the strange lands are
described in this book. Reports of activities of strangely gigantic races and dwarfish races
are given. Scholars in Europe got fed up with con-people who after taking a ship
overseas, came to report of such fantastic events. (Jonathan Swift was not one of these).
Sir Francis Bacon (1561 - 1626) (among many other endeavors and achievements)
pioneered in the development of a method to avoid academic and factual conmanship,
which he called Novum Organum or “the new instrument of knowledge”. The essence of
this method was to prescribe what must done to ensure that what was reported by the
traveler or the investigator is true replicable, and verifiable. According to him,
experiments or methods of study must be carefully outlined or defined. The expected
results must be carefully stated. The way of ascertaining whether the expected results
were correct must be carefully stated even before going to the field to collect data or to
attempt the required exploratory feat. The method of study or exploration must be
replicable by all other persons in the same discipline of study. It is only after these
rigorous steps that the person proposing the study must begin that study. Ethics dictate
that he must explain why and when he fails to achieve his objective, and why he knows
he has failed. If he succeeds, he must mention succinctly why he has succeeded. This is
the process which has been used to pioneer all the discoveries and the inventions of
mankind since that time (including computers and the marvels of internet). The
methodology involved is the so-called Scientific Method of investigation; and the
11
discipline of Statistics is one of the tools used to ensure the success and rigor of this
method.
Science is therefore a systematic discipline of studying the natural world around
us. It is not looking through sophisticated equipment, or reading subjects like physics,
chemistry, and botany (etc.) alone which comprise science. Science is a discipline of
study or investigation. Later we shall learn how to set up study hypotheses and how to
prove them using the scientific method of investigation.
Biostatistics
There are physical and social sciences. Social sciences are involved in the
investigation of human behavior and human activities using statistical and other
investigative models. For example, Economics is a social science. Physical sciences, on
the other hand, are amenable to experimentation because of the fact that the variables
under investigation can be accurately measured and controlled. Bio-statistics is a subdiscipline of statistics adopted to analyze all the phenomena which occur in our
environmental milieu, and which manifest themselves as biological substances or
creatures. Their behavior, the qualitative and quantitative data that result from their
individual or collective manifestation is amenable to the study and investigation using
statistical tools. These tools are what we call biostatistics. There is no great difference
between biostatistics and other types of statistics. The only difference is in the specialized
application of the broad statistical methodology to investigation of those phenomena
which arise in the biological environment.
Activity
Learners are requested to write a short essay on what science is; and on the
differences between physical and social sciences. Consult all the reference material that
may be available in your University library, especially books on Research Methodology.
Functions of Statistics
1.
Statistics is not a method by which one can prove almost anything one wants to
prove. Carefully laid rules of interpreting facts and figures help to limit the prevalence of
12
this abuse in data analysis and data presentation. Although there is nothing significant to
prevent the abuse, statisticians are instructed against intellectual dishonesty; and are
cautioned against possible misuse of this tool. During the instruction of Statistics, ethical
behavior is assiduously encouraged.
2.
Statistics is not simply a collection of facts and figures. There could not be any
point of studying the subject if this were so, since the collection would have no meaning,
especially after such collection has been accomplished.
3.
Statistics is not a substitute for abstract or theoretical thinking. Theory cannot be
replaced by Statistics, but is usually supplemented by statistics through accurate data
collection, data analysis and data presentation.
4.
Statistics does not seek to avoid the existence of odd or exceptional cases. It seeks
to reveal such existence, and to facilitate careful examination of the cases within the
scope of research at hand. In some cases it is through the investigation of the odd, the
peculiar or the residual cases that we have come to understand the world better. Sir
Alexander Fleming (1881 - 1955) invented Penicillin in 1924, because he got interested
in the strange behavior and death of the bacteria which he was culturing after they were
accidentally exposed to mould (stale) bread-crumbs that fell within his petri-dishes by
accident. In the first instance, he had not set off to discover penicillin, he was doing other
things. The world has been saved from millions of bacterial infections as a result of him
investigating this strange, fortuitous event, which took place in his bacteria-containing
petri-dishes by accident.
5.
Statistics is not a “fix-all” tool-kit for every type of scientific investigation. It is
one of the many tools which are available to the scientist for data analysis and problem
solution. If in any event statistics is not required, it is obviously foolish to try to force
statistical analysis when other tools like deductive and inductive reasoning, cartography,
calculus, historical analysis, (and others) can do. Also, one does not use advanced
statistical techniques when simple ones are sufficient to do the job; simply for the
purposes of impressing his/her peers, all and sundry. Always use the most effective and
the simplest statistical tool available to avoid making a fool of yourself among all other
scholars. That simple and humble habit will also save time and money in any research
environment.
13
Types of Statistics
Descriptive Statistics
Descriptive statistics are sometimes called inductive statistics because their very
nature induces the conclusion of the researcher or the investigator. Information collected
from research is sometimes too much to be meaningful. This information requires to be
sifted and arranged to a point where the researcher or the investigator can see the pattern
of trends and what is in it. Information is sometimes summarized - but most often
arranged in a comprehensive manner, in accordance with the subject matter of the
research or the investigation in question. This arrangement involves computing such
measures as means, ratios standard deviations (to be explained later) and others. By so
doing, we dilute the massive amounts of data to manageable and understandable
proportions. Note carefully, that this may not necessarily involve summarizing the data
collected. Where it does we must be very careful we do not lose any important data. If we
do, our conclusions could be misleading.
Cautious interpretation is therefore necessary. In the event that some data must be
omitted, the limitations of such omission must ne carefully outlined and the reasons why
such omission was done must be exhaustively justified.
Inferential Statistics
This type of statistics is sometimes called deductive statistics because the
researcher deduces the meaning of data from its behavior after its analysis. Sometimes it
is necessary to generalize on the basis of limited information which is caused by certain
difficult or constraining circumstances. Time and money may not allow for exhaustive
investigation of the subject matter and exhaustive data collection. Therefore, the
researcher draws a sample representing a small proportion of the group under
investigation which he considers has representative characteristics of each member of the
larger group. From this sample, he can make inferences, and he can deduce his
conclusions. This is the most abused type of statistics, and the researcher or investigator
must be careful not to misuse the information if he is interested in intellectual honesty
and integrity.
14
Ethical requirements in biostatistics are as rigorous as in all other types of
statistics and other means of measurement, data analysis and compilation. All final
reports must be accurate to the best of the knowledge and ability of the investigator.
Basic common concepts in Statistics
Operational Definitions in Statistics
Population or the Universe : This is a term used by statisticians to refer to the totality of
the population of all objects they are interested in studying at any one moment.
For example, all students enrolled in the course of Biostatistics in Kenya Methodist
University in any one year (if this is the subject of our interest), comprise the population
on which our study is centered. The word Population generally means the actual number
of people in the demographic sense. However, statisticians stretch the word to mean any
group of inanimate objects which are the subject of interest to the investigator. This is
also called the universe - to connote the fact that this is all there is of the number of the
subjects having some characteristic of interest.
A sample:
Sometimes it is not possible to study the whole population of the objects
of our interest. It becomes necessary to choose a few representative members of
that population which will be expected to have adequate characteristics of the universe.
Rigorous methods of sampling are used to ensure that all members are represented by the
resulting sample, and that the sample has the representative characteristics of the entire
population. Unless this is done, it is possible to come to erroneous conclusions through
the use of some unrepresentative sample.
A variable :
A variable is a phenomenon which changes with respect to the influence
of other phenomena or objects. This changing aspect of the phenomenon is what
makes it be known as a variable, because it varies with respect to changes in other
phenomena. There are two types of variables: dependent variables and independent
variables. The former vary because of other phenomena which exist outside the domain
of the investigation. Therefore its magnitude depends on the magnitude or the presence of
other variables - hence the term dependent variable. Independent variables are those
phenomena which are capable of influencing other variables without they (themselves)
15
being influenced. They are outside the domain of investigation, but they determine the
result or the size of the subject dependent variables in the experiment or in the model. In
the statement the weight of on individual depends on the dietary habits of an individual
we have both the dependent and the independent variable. The weight of an individual is
a dependent variable, while the dietary habits which influence that weight are the
independent variable.
A model : is a theoretical construction of all expected interrelationships of facts and
figures which cause and affect any natural physical or social phenomenon.
Sometimes it is impossible to represent the complex reality in the manner that it can be
easily understood and interpreted. A model is an attempt to construct a simplified
explanation of the complex reality in a manner which can be easily understood, and
whose nature can generally be said to be possible at all times. Models are often used to
describe phenomena in social sciences because the interaction of variables is of a
complex nature and it is impossible to run controlled experiments to determine the nature
of the interaction of such variables. For example, the habits of any consumer population
can only be modeled, because we cannot lock any member or members of the population
or any two members of the population in a laboratory for a long time to study their
consumer habits.
An Experiment: This term will no doubt appeal to students of biostatistics. This is an
attempt to observe scientifically physical characteristics of natural variables and
the interaction thereof in a laboratory, or within some controlled environment. In so
doing, it is possible to have some alternative but equal and equally representative
variables which are set up, but denied the influence of the dependent variable, to ensure
that really the independent variable within an experiment is the one which is the cause of
change in the variable under investigation. An experiment is possible in all physical and
biological sciences. Students of biostatistics no doubt have done some experiment or
another during their academic careers. An experiment is more exact than a model. The
standard of measurement can be made very precise, and the results are precisely
determined within this experiment using mathematical tools.
16
A measurement
is a precise quantification of phenomena determined using some
instrument or some precise method of observation. To measure is to quantify or to
determine precisely the magnitude of any kind of variable. A measure is a quantity so
determined.
An Index is an imprecise measurement which is only capable of indicating a trend.
Indexes or indices are designed only where the procedure of measurement is able
to yield an imperfect or an imprecise indicator. Examples of this are the consumer price
index in economics, or the discomfort index in climatology and weather forecasting. An
index is only an imperfect indicator of some underlying concept or variable which is not
directly measurable.
Validity of an Index
This refers to the appropriateness of an index in measuring some underlying
concept. A valid index should be appropriate for such measurement, ideally, despite any
operational procedure used. Any change in operation in the procedure which changes the
index implies invalidity of the index. In practice it is difficult to find indices obeying this
rule precisely. This accounts for the many disagreements regarding the use and the results
of most indices.
A hypothesis is an educated guess of the nature of the relationship between variables
which is capable of being proved during the process of modeling or
experimentation. We shall consider hypotheses later in the course of this module. In order
to determine whether any hypothesis is testable valid measurements or indices must be
used. Only operationally defined concepts are feasible foe a hypothesis to be testable.
Concepts which are not operationally defined which may appear in any hypothesis lead to
endless debate as to their validity, and had better be avoided.
Summation signs
One of the most confusing features of statistics to a beginner is the summation
signs. We all are used to individual counts of phenomena; but these complex instructions
present some significant difficulty, because it is not every day that we are instructed in
17
mathematical symbolism to sum quantities resulting from many different observations of
different aspects of one phenomenon. Learners are requested to pay a lot of attention,
and to learn the nature of these instructions. In fact I do not mind if a whole day is taken
to learn this section; because then, we guarantee ourselves an enjoyable time of tactical
use of these instructions. Learners should therefore take their cool time, because this is
the only hurdle we have to clear in the use of these symbols. For more work on these
instructions refer to King’oriah (Jomo Kenyatta Foundation, 2004), and/or any other text
on Statistics.
The Greek letter Sigma written as “  ” is used to herald these instructions.
Using this convenient symbol we are instructed to all what is listed in front of it. For
example the instruction
x
i
means you add all observations of any characteristic “i ”.
In this regard, the footnote “ i ” is only a description of the nature of the objects to be
added. Then this means we must add all
x i , x 2 , .... , x n
observations in the subject
n
population. The instruction
x
i
means we add all observations from the first
i  1
observation to the last observation which is the nth observation. In actual arithmetic terms
it implies :n
x
i
 x 1  x 2  ....  x n .
i  1
From the observation of the above expression, we find that the summation sign “

” is
n
surrounded by a curious arrangement: “  xi ”. At the bottom there is " i  1" . At the
i  1
top there is “ n ”. The former instructs the analyst where to start his work, and the latter
where to end the summation. These two are called the indices of summation. " i  1" is
the running index, and “ n ” is the end of summation. Obviously the x i is the variable
designation of what is being summed.
Summation signs follow simple operations of algebra. This means that in dealing
with each expression, we must follow all the rules of algebra. We must remember the
18
BODMAS rule when we encounter complicated expressions. thus, in manipulating the
summation arithmetic instructions we deal with all the expressions in the Brackets first,
followed by the expressions indicated by “Off”s, then “D”ivision, “M”ultiplication,
“A”ddition and finally “S”ubtraction; hence - BODMAS.
Activity
1.
Solve the following expressions :-
 a    xi
n
 yi
i  1

 b    xi
n
 yi
i  1

2
 c   k xi  .
n
i  1
All these can be found explained in King’oriah (2004), Jomo Kenyatta
Foundation Nairobi, Chapter Two.
2.
After a busy working day, five orange vendors were observed, special regard
being paid to the number of oranges remaining unsold at the end of a busy market
day. The following table illustrates the results:(a)
Sum all the unsold oranges
(b)
Sum the square of unsold oranges
(c)
Find the square of the total number of oranges sold.
(d)
Explain if you can see the difference of the sum of the total number of
oranges, and the square of the total sum of squares. is there any
difference?
Types of Statistical Data
All data measurements can be classifies into the following categories :(a)
Nominal data
(b)
Ordinal Data
(c)
Interval Data
(d)
Ration data
These terms are used to denote the nature of data and the measurement level at which
such data and the measurement level at which such data has been acquired.
19
Nominal Data
This is the weakest level of measurement. Such a level entails the classification of
data qualitatively by name - hence the term “nominal”. For example the labeling
of data into two categories “men” and “women”, these two categories can be known only
by name. Meat can be classified as “fresh” and “stale”. Names like Caroline, Chege,
Acheampong, Patel, Kadija, and so on, are classifications on the nominal scale. If you
classify cats as “black” and “white”, you are measuring them using the nominal scale.
Analysis and manipulation of this data requires those statistical techniques which
can handle names and nominal data. The Chi-Square statistic in Chapter five of this
module is one of the few that are available for this work.
Ordinal Data
This is the kind of data which is categorized using those qualities one can
differentiate with size. In other words, data is amenable to be Transitive, that is: with
magnitude and direction. Thus, data classified as big, bigger, biggest, or: large, larger,
largest, and similar qualities, is data which has been acquired and arranged in an ordinal
manner. It is ordinal data, and the level of measurement for such data is ordinal. ChiSquare and Analysis of Variance (to be learned later), together with any other measures
and statistics that can handle this type of data, are used to manipulate and to make
deductions with this kind of data.
Interval Data
This is data acquired through a process of measurement where equal measuring
units are employed. Such data has magnitude and direction (is transitive) and the size of
the interval between each observation and the one above it is the same for all
observations. Equal measuring units are employed. This data therefore contains all the
characteristics of nominal data, and ordinal data. In addition, the scale of measurement
moves uniformly in equal intervals up and down the respective sizes of the data; in equal
intervals - hence the name “interval” data. The only weakness with this kind of data is
that the position of zero is not clear, unless it can be assumed. Thus data like 2001, 2002,
2003 ... and so on, is interval data. The zero year then can be assumed as 2001. Data like
20
temperature readings have absolute zero so far that it is not practical to find it, and use it
in every-day data manipulation. The same applies to time in hours or even in minutes,
and so on. The statistic used for analysis is such measures as: analysis of variance, and
regression-correlation. However, ratios are difficult to compute
Ratio Data
This is the highest level of measurement, with transitivity - magnitude and
direction; equal interval qualities; and the zero can be identified and used conveniently. It
is possible to perform all mathematical manipulations using this data, whereas in other
data such exercise is not possible due to lack of zero levels. Division and ratio
computation between one group of observations and another is possible - hence the use of
the word ratio. All the known statistical techniques are useful with this kind of data. this
is the kind of data most people can handle with ease, because the observations are
countable and divisible.
Statistical Generalization
After the above considerations we now come to the other meaning of the word
“statistic”, which means a quantity computed from a sample an obtained after the
manipulation of data using any known method of analysis. The characteristic of a sample
are called statistics, while those of a population or a universe are called parameters.
The concept of Probability
Probability is used in statistics to measure how possible events occur. Many times
events if interest to ordinary people and researchers are observed to happen. Due the
nature of the circumstances surrounding the occurrence of these events, one is not sure
whether these events can be repeated. Over the years, since the gambling exercises of the
baroque era, mathematicians have developed methods of testing the certainty that any
single event can happen. The probability that any event
A can happen is computed as
the ratio between the total number of all the successes (or favorable events), to the
number of all the observations in that population. Thus :-
21
P  A 
r
, where r = the number of favorable outcomes and n = total number of
n
trials.
Favorable outcomes need not be the most pleasant outcomes at all times. The term
may be used to describe even highly unpleasant affairs. Consequently the term is used to
denote all the events which are of interest to the observer, however pleasant or unpleasant
they may be. This farmer who is interested in the frequency of the failure of the short
rains is dealing with a very unpleasant event to all farmers. However, since this is the
event of his statistical interest, any failure of the short rains is a favorable event from his
statistical point of view.
We shall deal with probability in greater detail in Chapter Two of this module.
Activity
1.
Compute the probability of obtaining four heads in six tosses of an evenly
balanced coin.
2.
Compute the probability of obtaining a six in a single toss of a fairly balanced die.
(Plural - dice)
The Normal Curve
This concept will also be discussed in great details in Chapter Three of this
module. However, it is appropriate to introduce it as one of the concepts of the most
frequently used models of probability in statistics. We shall see later that the normal
curve is a logical concept described by means of a precise mathematical formula. It is
used frequently in statistics because many real-world variables like weight, height,
length, and so on, which occur naturally in populations of interest to statisticians are
distributed in such a manner that approximates the shape of this curve. The normal curve
is actually a representation of how frequently any real-life phenomenon occurs. It records
graphically the frequency with which an observation can be expected to occur within a
specific range of all the characteristics available either in a sample, or in any given
population.
22
Frequency distribution
For example, the average height of maize plants after three weeks of growth in a
normal season with good rains can be recorded in terms of the number of these plants that
have grown to specific heights in meters or centimeters as in the diagram below. If we
were to count all maize plants which achieve each height over and above any specified
level, we obtain several plants, which if we divide with the number of all the maize
involved in our experiment - say 100 - we obtain the probability that we can find maize
satisfying that category in that field.
Activity
1.
Use Figure 1 - 1 and count all the maize which has attained the height of 0.3
meters after three weeks of growth on our hypothetical field. What is the number
of these plants? This number is the frequency of occurrence of the maize with this
characteristic or this quality of being 0.30 meters high.
2.
What is the ratio of this number to the total plants available in our sample?
This is the proportion of the sample with this characteristic to all the maize
considered in our experiment.
3.
Count how many plants are 0.30 meters and below in this array of data. Divide
this number with the total number of maize plants in our sample. This result gives
you the ratio, or the proportion of all the maize which is 0.30 meters and below in
the sample of our interest.
4.
Plot the graph of the numbers of maize plants in each category of data over the
whole range of measurement qualities in our sample. This should give you a
curve of the distribution of all characteristics of maize plants having different
qualities in our sample. The graph is what is called a frequency distribution,
which has the quality of out interest on the horizontal axis and the number having
each of these qualities on the vertical axis.
The normal curve as a frequency distribution
We shall see later that the normal curve is a frequency distribution of
characteristics, which are supposed to be the same, but which differ slightly in
23
magnitude. The greatest number of the characteristics of all the observations is clustered
at the center of the distribution; and the smaller numbers are clustered towards the ends
of the distribution, just like our maize plants in this example illustrated in Figure 1 - 1.
Figure 1 - 1
The numbers of maize plants counted for each height in meters, from
0.05 meters to 0.70 meters; at 0.03 meter intervals.
(Source: George K. King’oriah, op cit., 2004)
Factorials
We shall need to understand the meaning of factorials, especially when we will be
dealing with permutations and combinations. A factorial is any number n which is
multiplied by all other numbers less than it. The factorial is denoted by an exclamation
mark behind any number which is required to be manipulated this way. Thus we can have
24
such numbers as 8!, 20! , 3!, and any other number that we wish; and we can denote
this statement as n! The general expression for a factorial is :n!  n  n  1  n  2  .....  2  1.
In this case “ n ” is the number whose factorial we are looking for, and “ ! ” is the
denomination for factorial as we have seen above. As an example, we take
4! to be :
4!  4  3  3  2  1 . Do not bother to take factorials of large numbers because
this kind of instruction makes the answer explode so fast that most hand calculators
cannot handle numbers above 12! This makes it tricky to manipulate factorials. However,
if we remember that we can cancel our the expressions in any quotient then our life is
very much simplified; like in the following example :8!
8  7  6  5  4  3  2  1

5!
5  4  3  2  1
Here we can cancel the 5!  5  4  3  3  2  1 above and below within this
expression to be left with
8!
 8  7  6  336 . Not a big deal! Therefore, do all
5!
the necessary cancellation and then multiply what is left of the factorial quotient.
Addition and subtraction is rarely required in our course.
Graphic Presentation of Data
After any field survey data will need to be arranged and analyzed. this is because
there are too many data entries for the observer to make any sense out of the field
collection. Graphic presentation of data can be resorted to so that the investigator can
make sense of the data. This is a section of a broad area of statistics which can be called
descriptive statistics. Most of us are familiar with data in newspapers, periodicals,
government reports, etc., which is illustrated by means of colorful bar-graphs pie charts,
wind-roses, proportional balls, etc. For a closer look at the wide use of this technique
25
learners can have a look at Arthur Robinson, Randall Sale, and Joel Morrison’s Elements
of Cartography, which is cited at the end of this chapter or any other text. Figure 1 - 2 is
an example of two cases where graphic presentation of data has been found necessary.
There are many others illustrated in the cited textbooks (King’oriah, 2004) and in other
materials. Frequency distributions, cumulative frequency distributions, histograms,
ogives, frequency polygons (and others) are good examples. These are easy to learn and
draw, as long as you make sure you have a good textbook at your disposal. Readers are
requested to read and learn from these examples, and then participate in the activity
prescribed below.
Activity
Given the following data, comprising the height in millimeters of 105 maize
plants after two weeks of growth :-
129 148 139 141 150 148 138 141 140 146 153 141
148 138 145 141 141 142 141 141 143 140 138 138
145 141 142 131 142 141 140 143 144 145 134 139
148 137 146 121 148 136 141 140 147 146 144 142
136 137 140 143 148 140 136 146 143 143 145 142
138 148 143 144 139 141 143 137 144 144 146 143
158 149 136 148 134 138 145 144 139 138 143 141
145 141 139 140 140 142 133 139 149 139 142 145
132 146 140 140 132 145 145 142 149
26
Figure 1 - 2: An Example of a Histogram and a Pie Chart. The upper
part of the diagram is a histogram, and the lower one is a Pie Chart.
(Source: King’oriah 2004, and World Bank 1980.)

Draw a table of the distribution of different classes of heights of maize plants in
millimeters after two weeks of growth.

Identify the distribution of heights of these classes by grouping the data into 10
classes of equal ranges and equal class intervals. In this regard, you must draw a
27
diagrams of appropriately crossed tallies to indicate how you obtained the groups of your
choice.

Show the upper and the lower class limits of each data interval.

Identify the class marks of each data group or data interval.

Draw a histogram of the same data, a frequency polygon and an ogive.
All this work must show your clear understanding of the terms in italics. Learners are
expected to consult the given textbooks (King’oriah, 2004) and to submit the work within
the work period specified by your instructors.
Measures of Central Tendency and Dispersion
Representative measures
The frequency distribution and its graphical relatives can provide considerable
insights about the nature of the distribution of any set of data, as you have discovered in
the above activity. One can quickly tell about the form and shape of any distribution. It is
also possible to represent this information through results obtained from computation of
numerical measures like averages and other quantities which measure the spread or the
clustering of data and observations. The first kind of these measures is called
representative measures. They are designed to represent all the observations in either a
sample or a population by attempting to locate the center (or the middle) value of any
given distribution. Another way of thinking about these representative measures is that
they attempt to measure or locate the central tendency of the characteristics of a given
range of observations. In that regard, they are called measures of central tendency. We
consider each one of these measures through the presentation given hereunder, beginning
with the arithmetic mean.
The Arithmetic mean
In every-day colloquial language this measure is known as the average. This is a
very important measure in statistics, because, arguably, we can say that the whole
discipline of statistics is based on the manipulation of the mean in one way or another, as
we shall see in this module. The statisticians represent this measure using the Greek letter
28
“  ” for the mean of the population and the roman capital letter “X” with a bar on top.
This is usually called “ X-bar ”, and is denoted as “ X ”. Note carefully, that the symbol
is in capital letter form, because the small “ x ” means something else. It actually means
an error or a deviation from a trend line, or surface; as we shall see later when we shall
be considering either the standard deviation, or regression/correlation and related
materials. Therefore learners must show the capital letter form of this symbol in all their
presentations, particularly in examinations; to avoid being ambiguous, or being mistaken
by the examiners.
Nearly everybody knows how to compute the average. However, in the
summation notation that we just learned, when you compute the average (or mean) this is
what you do :n
 

Xi
i  1
for the population mean.
N
Notice that we are using the capital “ N ” to indicate that we are considering the universe,
or everything that is “considerable”, and that we have to consider. The analogous
notation for the sample mean is :n
X 

Xi
i  1
for the sample mean.
n
Here also we are using the lower case
“ n ” to indicate that we are using only data
sampled from the larger population whose number is
“ N ” . In statistics we observe
strict symbolic discipline, especially if we are presenting data in the longhand writing; so
that we may not ne mistaken.
Let us try a simple example comprising five simple observations belonging to a sample
which is presented in the following manner :X 1  10, X 2  15, X 3  6,
X
4
 12, and
29
X 5  15.
Now let us look for the sample mean X . Using our symbolism we see that :-
n
X 
5

Xi
i  1
n


Xi
i  1
5

10  15  6  12  11
54

 10.8
5
5
You can do the same with a large sample of more than 30 observations, because this is
where we begin assuming the universe, or a large collection of data, which we can
classify as the population. For more general characteristics of mean, see King’oriah
(2004), or any other standard textbook on statistics, especially the ones given in the
references: at the end of this chapter.
The median
This is an important measure in statistics, especially when data is such that the
mean cannot conveniently represent the given population conveniently. In some cases the
end-observations in the data array which is used to investigate the central tendency is so
large that it may tend to pull the value of the mean wither way. Imagine this statistician
who id looking for the mean rent of all dwellings in a certain neighborhood in your town
or city. No doubt some houses in the same locality are very expensive, and others are
very cheap. The mean would be pulled upwards by the expensive houses, and would be a
poor representative of the central tendency in this case. The median therefore comes to
our rescue because it mentions only the middle observation of the data array which is
given. In so doing, the median is an indication of the rough position of the mean under
ordinary circumstances when your data is no being pulled upwards by expensive units.
Now, given the following data, as we had for the computation of the mean we may wish
to determine the median of the data.
X 1  10, X 2  15, X 3  6,
X
4
 12, and
X 5  15.
The median can be determined easiest if we arrange the data in an orderly manner, from
the smallest observation to the largest, as follows :-
30
6,
10,
11,
12,
15
In this case, the median of the data, which we denote as
“ Md ”
is the middle value of
the array, which is clearly 11.
The middle value is
Md = 11.
You can try many more examples on your own; especially data with large observations
on either and which is not balanced by an equally large observation in the opposite end.
Unlike the mean, the median is not easily manipulated mathematically. However,
it is a useful measure to indicate the middle point of the range of any data. Generally, the
mean gives more information because of the quality of the observations used in its
computation. When the mean is computed, the entire range of data is included. Therefore
the mean is a more representative measure than the median.
The Mode
The interpretation of the statistical mode is analogous to that given to the same
word as it is used in the fashion industry. Any person dressed in current style is said to be
trendy, or “ in mode ”, meaning “ fashionable ” or wearing what everybody would like to
wear according to the current fashion trends. Therefore the mode is the easiest of the
measures of central tendency, because all what needs to be done to obtain it is to observe
which observation repeats itself more times than all the other observations. Therefore,
given the following array of simple data we find that :X 1  5, X 2  6, X 3  5,
X
4
 7, and
X 5  4.
Clearly, the mode of this data is 5, because the number is repeated twice. In view of this
we may expect that it is possible for some data to be having no mode in the case where
we do not have any repeating observation. In the other extreme data could be multimodal where more than two observations repeat themselves several times. Bi-modal data
is that which has two figures repeating themselves. Therefore all we need to do is to
31
observe the nature of the data array after we have arranged it from the smallest to the
largest observation.
The Proportion
In some areas of scientific observation it becomes necessary to deal with qualities
rather than quantities. Imagine a biological experiment where color is involved. Clearly,
you cannot measure color in numerical terms. Therefore all we can say is that a certain
proportion of our data is of a certain color. The percentage and the ratio mean the same
thing only that they are arithmetic transformations of the same measure. Mathematically,
the proportion is of the same family as the mean, and in statistics it is treated as some
kind of mean. The proportion is the ratio of the number of observations with the desired
characteristic to the total number of observations. For the population, the proportion is
represented by means of the Greek letter
“small”
“  ” , and for the sample a lower case or
“ p ” will do. In that case to calculate the proportion we have the following
procedure for all sample data :-
p

Number of observations in each category
,
Sample size " n"
And for the population data we have :-


Number of observations in each category
.
Population size " N"
Learners must read more about these measures in the prescribed textbooks to be able to
answer the set questions at the end of this chapter.
Measures of Variability or Dispersion
The concept of variability is very important in statistics. For example, in
production management an area of major concern is the variability of the quality of the
product being produced. In biostatistics, one may wish to know the variability of a crucial
variable like the diameter of the stems of certain plants in the investigator’s interest. It is
32
in these instances and many more that we are interested in investigating how disperse
data under our control is. Our academic and research colleagues, and even the common
man is usually interested in data dispersion in order to investigate aspects of uniformity
as well. We begin with the simplest measure, the range.
The range
The range can be described as the difference in magnitude between the largest
value and the smallest value in any data array. It measures the total spread in the set of
data. It is computed by taking the largest observation and subtracting the value of the
smallest observation. Depending on our data analysis requirements, there are several
types of ranges. One may divide the data into quartiles. This means determining the end
of the lowest quarter, the half mark, the end of the third quarter and the beginning of the
fourth quarter. Each portion of the data is a quartile. The first quartile is like from 0% to
25%. The second quartile is 26% to 50%. The third quartile from 51% to 75%, and the
final quartile of the data is from 76% to 100%. When data is arranged this way we may
be interested in the inter-quartile range, which describes data between 26% and 75%, in
an attempt to determine how disperse or how centralized data is. Learners should how to
compute this range. Generally subtract the first quartile from the third quartile to obtain
the following general formula for the inter-quartile range :Inter  quartile range  Q3  Q1`
This measure considers the spread within the middle 50% of the data (or the mid-spread)
and therefore is not influenced by extreme values. Using an ordered array of data the
inter-quartile range is computed in the following manner :Given the following hypothetical data which is ordered as in the following array :-
-6,1
-2.8
-1.2
-0.7 4.3 4.5 5.9 6.5 7.6 8.3 9.6 9.8 12.9 13.1 18.5
To find the inter-quartile range we find the mark of the first quartile and the third
quartile. In this case, the first quartile mark is - 0.7, and the third quartile mark is 9.8.
These ones can be designated as :-
33
Q1` =  0.7 and Q3` = 9.8
Therefore, the inter-quartile range is computed as follows :Inter  Quartile Range

9 .8   0 . 7

10 .5
This interval, usually called the middle fifty (%) examines the nature of the
middle data and induces various conclusions about the spread of the data. If this data
were representing some rate of return from a bond, the larger the spread the greater the
risk; because the bond can fluctuate wildly during the trading period and cause losses of
money. Small inter-quartile ranges measure consistency. This implies the consistent
maintenance of the value of the bond, and therefore makes the bond more attractive. This
means that there are no resistance outliers influencing the value of the bond from either
the lowest side or the highest side. In modern computer assisted packages the interquartile range is always included as a measure to show the strength of the mean as a
measure of central tendency. A small inter-quartile range is always an indicator that data
is strongly centralized and closely clustered around the population mean.
The Variance and Standard Deviation
Mean Absolute Deviation
The range does not tell us very much about the characteristics of observations
lying within the interval which contains the smallest and the largest observation.
However, given the characteristics of the observations at the exact center of the data
distribution, and the range defining the extent of both ends, an observer should be more
enlightened with respect to the nature of the subject data. We have already seen the use of
inter-quartile range in measuring the total spread in the set of data around the mean value
of the data. More information is required regarding how varied observations are
compared to the mean size
Historically statisticians tried to look for the deviations of data and to find the
mean average deviation from the mean. Then immediately they did this they found that
data which is above the mean is equal to the data which is below the mean. That is: the
34
negative deviations equal the positive ones, and that any quest for average deviation
results in a zero value: unless we deal with absolute figures. This led to the use of
absolute values (which are neutralized in such a manner that the negative and the positive
signs are not considered during addition). This led to a measure which is called Mean
Absolute Deviation (MAD). Although this measure will tell us something similar to the
information obtained from the use of inter-quartile range, it is not very useful for
advanced statistical analysis. It is also a weak measure, and one cannot deduce very much
from its use. See King’oriah (2004). Consider the following data on pineapple sizes from
a 6 hectare plot.
TABLE 1 - 1 AN ARRAY OF PINEAPPLE SIZES PICKED FROM DIFFERENT PARTS OF A
6-HECTARE PLOT (Source : King’oriah, 2004)
Pineapple
Number
Pineapple
Diameter
(cm)
Mean
Diameter
X
1
13
0.14
0.14
2
12
0.86
0.86
3
15
2.14
2.14
4
11
- 1.86
1.86
5
14
1.14
1.14
6
9
- 3.86
3.86
7
16
3.14
3.14

12.86 cm.

Not
Total
Absolute
Deviation
Deviation
Xi  X
90
meaningful
7
X
i
 X  11.14
i  1
In this table we are able to obtain the sum of absolute deviations to be 11.14. The mean
absolute deviation is therefore :7
MAD

X
i
 X
i  1

7
35
11.14
 1.5691
7
Big values of MAD imply that data is “disperse” about the mean value, and small values
of MAD mean that the variability of data about the mean value is small. The farmer of
these pineapples could perhaps compare the MAD of his sample pineapples with that of
other farmers to gauge the consistency of the method of pineapple breeding that is
practiced on his farm. Large MAD imply great variations in quality from one farm,
whereas small MAD imply consistency in quality and size of pineapples from this farm.
Mean Squared Deviation (Variance) “ 
”
2
As we have seen, dealing with absolute deviations is not easy in mathematics. However,
if we square absolute values we eliminate the signs, and specifically the negative signs.
The sum of squared deviations when divided by the number of observations gives the
Mean Squared Deviation, or the Variance - denoted as “ 
2
”
When expressed this way, the symbolism denotes the population variance.
However, the sample variance is expressed as
“ s 2 ”. This is not surprising since “  ”
is the Greek form of s. Therefore statisticians chose the Greek version of “ s ” for the
population just like they chose the Greek version for M - that is  for the population
mean. The English version of “ s 2 ” therefore took the denomination of the sample in
this case . We can now use our example to demonstrate the computation of mean signed
deviation using the same pineapple width data but in a modified Table 1 - 2. The
algorithm for the computation of variance is :-
X
7
 
Variance s 2

i
 X
i 1
7

2

34.8572
 4 . 95
7
In this connection, the farmer can begin telling his friends that the variance of his
pineapples is a certain tangible figure like we found using the steps outlined in Table
1 - 2: that is :-
 
Variance s 2
 4 . 95 .
36
TABLE 1 - 2: STEPS IN THE COMPUTATION OF VARIANCE
(Source: King’oriah 2004)
Pineapple
Diameter
(cm)
Mean
Diameter
X
Deviation
Xi  X


Squared
Deviation
X
i
 X

13
0.14
0.0196
12
0.86
0.7396
15
2.14
4.5796
- 1.86
3.4596
14
1.14
1.2996
9
- 3.86
14.8996
16
3.14
9.8572
TOTALS 90
Not
meaningful
34.8572
11
12.86 cm.
2
But then he would be talking of measures given in terms of Centimeters Squared! This is
the main weakness of using variance, because no one would like to present his figures in
squared values. Picture somebody saying “goats Squared”, “Meters Squared”,
“Kilograms squared”, etc.
The Standard Deviation
Since the use of the variance arose from the need to shed off the signs (especially
the negative sign) from the data statisticians found themselves in a very lucky position in
this case because merely looking for the square root of variance gives a result without
any negative signs. The original signs which existed when looking for deviations no
longer recur in this case. The resulting measure is called the Standard Deviation; and has
no negative signs either in front or behind it! Its computation algorithm is represented in
the following manner :-
37
X
7
Standard Deviation
s 
 X
i
i  1

2
34.8572
7

7

4 . 95  2.45cm.
This time the farmer can report to his compatriots a measure that is easily understandable.
Large standard deviations mean that data is more dispersed about its mean than other data
which could be having small standard deviations. This time the measure is in actual units.
We have now computed the sample standard deviation. We need to know that the
computation of the population standard deviation is done in the same manner. The
representative symbol for the population standard deviation is the Greek letter “  ”
because it is the square root representation of the symbol “ 
2
”. We shall deal with the
computation of the sample standard deviation later, in greater and more rigorous detail;
especially when we shall be considering the degrees of freedom. For now this is sufficient
information to show the derivation of these important measures. The symbolism denoting
the population standard deviation is :-
X
7
Standard Deviation
 

i
 X
i  1

2
.
N
Obviously, the method of computation of the population standard deviation is identical to
the steps we have shown above. In the modern digital age the computer program gives
the standard deviation as a matter of course. The learner should therefore know that the
machine uses the same method as we have used above.
Degrees of Freedom
When using the sample standard deviation as compared to the whole population,
the sample standard deviation formula is not as straight forward as we have discussed
above. Statisticians always make the denominator of the sample standard deviation
smaller, by adjusting the usual formula (which we have derived above) by one degree of
freedom. If the population standard deviation is denoted by :-
38
X
7
 
i
 X
i  1

2
N
The sample standard deviation by convention is represented by :-
X
7
s 
i
 X
i  1
n  1

2
.
(This adjustment is also valid for the expression that is used to denote the variance,
before the radical is used over the formula to obtain the standard deviation.) The smaller
denominator in the sample standard deviation expression is said to be adjusted by one
degree of freedom. The sample standard deviation denominator is made smaller than the
population variance-denominator by one (1.0) to provide some mathematical justification
of using this expression; despite the few data involved in the computation of the sample
standard deviation.
It is asserted that since one is dealing with a stretch of data in one dimension, one
is free to choose any value along that dimension, except the last one. One is free to
manipulate all figures along that dimension except the last one. Mathematicians say that
in that manipulation, the investigator loses one degree of freedom. It is these
manipulations that are applied to the sample data; because we cannot claim that the
sample satisfies the requirement of the large data observations which are available in a
population - hence the nature of the formula. (See King’oriah 2004.) We shall be dealing
with degrees of freedom as we proceed on with our work of considering other statistical
measures.
39
EXERCISES
1.
Outline four meaning of the word Statistics, differentiating each meaning from all
the others mentioned in your discussion.
2.
Explain the meanings of the following terms: Science, Scientific Method, and
Ethics in Research, Natural Sciences, Social Sciences, and Behavioral Sciences.
3.
Explain how it is possible to mis-represent the real world situation using facts and
figures, charts and diagrams (cheating through the use of statistics). In this regard,
what is the place of ethical behavior in Statistics, and how can you avoid this
temptation?
4.
Briefly describe the following terms :
Population, Independent variable, running index, nominal data ordinal data
interval data, and ratio data.
5.
Explain what you understand by the word sampling. Discuss various methods that
are available for carrying out sampling exercises.
6.
The following figures represent the number of trips made by the lecturers within
the Faculty of Agriculture in your local university to inspect and supervise field
experiments done by the students of Agriculture.
7.
29
6
10
11
13
13
16
11
23
19
19
17
21
39
25
22
9
16
18
13
18
3
9
(a)
Construct a frequency distribution with five classes for this data. Give the
relative frequencies , and construct a histogram from the frequency
distribution.
(b)
Compare the mean, the median, the range, and the variance for this data.
If there are 13270 female chicken, and 12,914 male chicken in a chicken-rearing
location, what is the proportion of either type of chicken to the total number of
chicken in this area?
Read the prescribed textbooks and explain why the proportion is treated as some
kind of mean in statistical analysis.
8.
Explain what you understand by the term Degrees of freedom, and how this
quantity is used in the computation of standard deviations.
40
CHAPTER TWO
BASIC PROBALBILITY THEORY
Introduction
We have already touched on the definition of statistics, and found how the tool of
statistics is used to determine the chances of occurrence of phenomena. We need to
mention here there are two aspects of the concept of probability. There is the idea of
priori probability. This term refers to the quantification and determination of the
probability of occurrence of an event through the circumstances of the given facts,
without going to the trouble of rigorous testing. For example, if we state that the
probability of obtaining heads in the process of tossing an evenly balanced coin, we may
straight away say “ 50% ”. This is what happens every day. “What are the chances that it
will rain today?” Kamau may ask Kaberia. Kaberia may answer, “Fifty-fifty.” Then
Kamau would know that he has a 50% chance to expect a rain shower during the day.
However, this is not technically accurate. Ideally, the concept of probability
applies for many trials, so that the trials are repeated more and more, and as the number
of trials gets bigger and bigger, the ratio of successful outcomes to the total number of
trials approaches the probability of interest. In this connection, we need to understand
two things. Firstly, probability is expressed as a fraction of 1.0; and it defines the fraction
out of 1.0 that a favorable event will occur. Secondly, the concept of probability is a
limiting concept. It is the limiting ratio of the relative frequency of favorable outcomes to
the total number of trials, as this number of trials approaches infinity. Thus, whereas we
can say that the probability P that an event A will occur is expressed as :P  A 
r
where r is the number of favorable outcomes, n is the number of trials and A
n
is the favorable event,
the most accurate method of presenting this sentence is :P  A 
Lim
n  
r
.
n
This statement means that the probability of an event A happening, is the limit - as the
number of trials approach infinity - of the ratio
r/n . This means that any ratio of this
kind can be regarded as a probability of the occurrence of any event A only if we can
41
be sure that after many trials (whose number approaches infinity) the ratio is what we
have cited as the probability ratio
r/n . The implication of this is that a trained
statistician cannot afford to be careless with the term probability. He/she must ascertain
that the ratio he gives is a true representation of the actual probability as the number of
trials approaches infinity.
In addition, there can never be less than zero (negative) probability. There can
never be less than zero chances that an event will occur, because at least one desired
event must be expected to occur. The measure of probability has two extremes. The first
extreme is when the number of successes is equal to the number of trials. Every time one
tries he succeeds in obtaining a favorable event. In that case, the ratio- which we call the
relative frequency is equal to 1/1 = 1.0. The outcome of the events is the same no matter
how many trials are done. The second extreme is when the number of successes is never
there, no matter how many trials are done. This time the relative frequency is zero. The
probability measure therefore ranges between zero and 1.0. More likely events to occur
have high probabilities, above 0.5, and those which are less likely to occur have low
probabilities - between zero and 0.5. Obviously, Kaberia’s statement above is the
“fifty/fifty” level of probability of occurrence, which is 0.5.
Some definitions
Any two events are said to be mutually exclusive if they cannot occur together
during the same trial or in the same experiment. For example, there are only two possible
outcomes in a coin toss - “heads or tails”. The nature of the experiment of flipping a coin
is such that a head and tail cannot occur together in any single toss of a coin. A head and
a tail are therefore mutually exclusive in this experiment.
Events are described as being collectively exhaustive if their total occurrence
comprises all possibilities in a situation or an experiment. In a deck of cards, for example,
all cards are either printed in red suit, or in black suit. Te describe a card drawn from a
well-shuffled deck of cards as “ red suit - black suit - spade ” is collectively exhaustive,
because one is certain that these events will definitely occur, no matter how one tries to
pick his cards. This is because all spades fit into all the categories described by the
statement.
42
Any two events are said to be independent when the chances of one event
occurring are not influenced by the happening of the other event. When two people with
no influence on one another toss a coin the outcome of one coin toss is not affected by the
outcome of the second toss of the coin by another person.
When we consider any one experiment, either an event will occur or it will not
occur. The occurrence of an event is considered to be a success and the non-occurrence of
an event is considered to be a failure in statistical terms. The probability of success is
said to be a complement outcome to the probability of failure. Likewise, the probability of
failure is said to be a complement of the probability of success. The event and its
complement are collectively exhaustive. The probability of all of them is all that there is
of all this kind of event, and therefore the total equals 1.0.
Conditional probability is defined as the probability that any event will occur,
given that another event has occurred. For example, in a deck of 52cards, there are 13
spades. If we consider the probability of obtaining a spade after shuffling the cards it is
no doubt 13/52 = 0.25. However, if we remove and set aside the spade after we have
obtained it, we have a different result, because the probability of obtaining a spade on
condition that another spade has been drawn is 12/51. Thus we can say :-
P  A  P spade 
13
 0 . 25
52
Then in symbolic terms we can state the second condition, which we call the conditional
probability - conditional on the happening of the first event, as follows :-
   P spade
P BA 
spade has been drawn 
The vertical slash symbol
“|”
12
 0 . 2353
51
indicates “on condition that”. Learners can therefore
read the last expression conveniently and compute conditional probabilities. If in any
difficulty, refer to the recommended texts (King’oriah, 2004).
43
Simple Rules of Probability
Ordinarily, the researcher does not compute probabilities himself. He usually
makes use of various probability tables and charts which have been ready-made by
statisticians over time. The basic mathematical rules of probabilities which are described
hereunder are just meant to give some insight into how probabilities were conceptualized
and developed; in order to give rise to the development of various kinds of tables and
distributions; which we shall consider later in this module, and which are given in the
recommended textbooks. The order of discussing these rules hereunder is not the kind
that we may expect in ordinary mathematics - beginning with addition, then subtraction
onwards, ending with other operative functions in mathematics. Our discussion order
begins with the multiplication rule onwards; because this is what is considered the easiest
method of learning these rules. Then all the rules are summarized in the expected order at
the end of the section.
The Multiplication rule
If
A and
B are independent events, the probability of joint occurrence of
these two events is equal to the product of the probability of the occurrence of each one
of them. Since flipping two coins are two independent events, the probability of obtaining
head on the first coin and then on the second coin is the product of the probability of
obtaining each of these events. Thus we can symbolically represent this phenomenon by
means of the following statements :-


 
P H 1 and H2  P H 1 
 
P H 2  0 .5  0 .5  0 . 25
This rule can be extended to any number of events, as song as they are all independent of
one another. Thus the probability of joint occurrence of four heads in five coin tosses can
be calculated in the following manner :-

P H 1 and H2 and H 3 and H4 and H 5
 
 
 P H1  P H 2
 

 
 P H5  P H4
44
 
 P H5
 0 .5  0 .5  0.5  0 .5  0 .5  0 . 3125
Notice that the effect of multiplication reduces the size of the fraction. This is true in
mathematics and also true conceptually, because of the difficulty in having such a lucky
coincidence occurring in real life. Each event is difficult to achieve, and the succession of
all other events are equally difficult. Therefore, chances that one could be so lucky
become slimmer and slimmer as the multiplicity of the required events increases.
When the outcome of any event A affects the outcome of some other event B the
probability of joint occurrence of A and B depends on the conditional probability of the
occurrence of B , given the fact that A has already occurred. This means that :-


P  A or B  P A  P B A
For example, in a deck of cards one may consider the probability of drawing a spade on
two consecutive draws without replacing the first spade. This means that :-
P spade on two consecutive draws
 P  first draw  P second draw without replacing the first spade
Using our actual figures we have :P  first draw  
13
 0 . 25 :
52


P Second draw first draw 
Call this event A
12
 0 . 2353
51

Call this event B

Therefore P  A and B  P A  P B A  0. 25  0. 2353  0. 0058825
Here we see that the chances of the two events happening together have been
considerably reduced. This is because there are possibilities that we may miss each event,
or all of them completely.
45
Addition Rule
In situations where one is faced with alternative events, each one of which is a
success, the chances of success increase because the investigator is satisfied by any of the
successes. Therefore the number of ways in which he could be successful is the net sum
of all the ways in which each of the favorable outcomes may occur. The symbolism for
this process is as follows :P  A or B  P A  P  B
The equation is additive, meaning that success is recorded when A happens, and equal
success is recorded when B also happens. Addition rule of computing probabilities is
therefore applicable to alternative events of this kind.
When events are not mutually exclusive some areas intersect, and there is a chance
of both events occurring together. This is best illustrated using circles on a plane, which
are either standing apart, standing touching one another; depending on whether there is
interaction between the circles. Figure 1 - 3 is an illustration of these circles. This visual
technique of showing interaction or non-interaction using circles is using Venn Diagrams,
after its inventor, Joseph Venn, who lived between 1834 and 1923.
Figure 1 - 3: Venn Diagrams
46
In the upper part of the diagram, the two circles do not touch or overlap. This
means there is no interaction between the two phenomena represented by the two circles.
In the lower part of Figure 1 - 3, the circles overlap to create an area covered by both A
and B. This means that there is some interaction between the phenomena represented by
circle A and B; over the shaded area. In this case the addition rule applies; but then there
is the subtraction rule (the negative form of addition), so that we can be enabled to cut
off the effect of the interaction on the shaded area in order to obtain the probability of A
or B , without any interaction.
Consider a deck of well shuffled cards. There are four “ As ” in the deck. These
comprise A of spades, A of clubs, A of hearts and A of diamonds. Suppose we were
interested in finding the probability of obtaining both a clear A , which is not the A of
hearts, and a probability of obtaining a clear heart which is not A. We are faced with an
additive situation in the following manner :P  A   Probability of obtaining a clear A , without any interaction.
P  B   Probability of obtaining a clear heart , without any interaction.
P  A and B   Probability of obtaining A of hearts , where there is interaction
between the cards bearing letter A and at the same time having a heart
face-mark.
There are four As in a deck of 52 cards. Therefore :-
P A  
4
,
52
PB  
4
,
52
Then there is only one card which is A of hearts that forms an area of interaction. In this
regard the probability of the area of interaction is computed in the following manner :P  A and B  
1
52
Thereafter, we subtract out our interaction space in this example, in order to obtain the
probability of getting a pure event A and a pure event B in the following manner :-
47
P  A or B   P  A  P  B  P  A and B
In numerical terms, the whole of this exercise becomes :-
P  A or B  
13
4
1


 0 . 769  0 . 25  0 . 0192  0 . 3077 .
52
52
52
Now if we are interested in summarizing the rules that we have learned, and arranging
them in the way arithmetic operation signs manifest themselves; ranging form addition,
subtraction, multiplication
...
etc., we find that we have four successive rules of
probability for the time being.
1.
Addition rule for mutually exclusive independent events :P  A or B  P A  P  B
2.
Addition rule for events which have a chance of occurring together at least once,
and therefore are not mutually exclusive :P  A or B   P  A  P  B  P  A and B
3.
Multiplication rule for dependent events: where the happening of one event
influences the chances of the happening of subsequent events :-


P  A and B  P A  P B A
4.
Multiplication rule for independent events, where the happening of one event has
no influence on the happening of all the subsequent events.
P  A and B  P A  P  B
48
For more exhaustive elaboration of these rules of probability, learners are advised to
consult the prescribed textbooks. The rationale for the application of each of these rules
must be noted; together with the many examples illustrating the use of each of the rules.
It is only after doing this that the learners can successfully attempt the problems at the
end of the chapter (and any other set problems) for evaluation by instructors.
Counting Techniques
There is the reverse way of looking at probabilities. In stead of asking what probability
there is that an isolated may occur, one is interested in the number of alternative ways a
set of events may turn out. Of course it the total number n of alternative ways that an
event can turn out is known, then the probability that one of those ways turns out is 1/n.
Consider the twenty-six letters of alphabet. In this regard, one may be interested in the
alternative ways of arranging them. Computation using modern computers shows that
there are 400 trillion ways of arranging these letters. The way which we have memorized
since we were small is just one of these. The probability that such an order is selected can
be said to be
1
. Such problems and many more are the concern of the
400 trillion
counting techniques of statistics.
One of the methods of approaching similar problems to the simpler types of the
example given is by means of decision trees. This involves dividing the problem into
many events (or stages) and drawing the various interconnections between them. If, for
example, there is one production line of car assembly that produces three types of cars of
three types of engine capacities 1100cc, 1500cc, and 2000 cc. Then if among the various
engine capacities there are three types of color configuration - like yellow, blue and
cream, and then if there are three body-build alternatives Saloon, Pick-Up and StationWagon; one may wonder what is the probability of obtaining one car, which is 1500cc,
yellow in color and saloon. This is a typical simple problem which can be solved by
means of a decision tree, like shown in Figure 1 - 4.
49
Figure 1 - 4 : A decision Tree
Physically one can count the ultimate ends of the decision tree and find that there are 27
alternatives available. We may conclude that the probability of obtaining one trail of alternatives;
like 1500cc, yellow and then saloon is exactly 1/27, because there are these 27 alternatives.
Decision trees are convenient only for simple evaluating simple alternatives like the one
given. The more stages of alternatives available, we find that it is impossible to keep drawing
alternative family tree diagrams, because they become prohibitively complicated. this is where
alternative methods of counting techniques are necessary. Using these counting techniques and
appropriate computer programs this is how we are able to obtain all the possible arrangements of
the 26 letters of the alphabet. These are the methods that we briefly allude in the discussion that
follows hereunder. Learners are advised to read more within the prescribed textbooks.
Permutations
This is one of these techniques which is available and can save us from the pain of
drawing decision trees always. It attempts to answer similar problems which arise from the desire
to know how many arrangements of n-objects, r taken at a time, are possible . In permutations
50
the order of arrangement is important. Therefore we have the following formula for the algorithm
used to compute permutations :-
Prn 
n!
.
n  r !
The formula reads : The permutation of n objects, r taken at a time is equal to the ratio of n!
to the difference (n - r)! In that case,
n
is the total number of objects. The symbol
r
represents equal groups of objects that ate handled each time to effect the required arrangement.
P is the permutation or another name for numerous alternative arrangements.
Example
Find the number of all possible arrangements of five objects, taking three at a time, where
objects must be arranged in a sequential manner.
Solution
Here a specific arrangement is prescribed. This means that we can use only a permutation
computational algorithm to solve this problem. Therefore, we use the formula :-
Prn 
n!
n  r !
and insert the figures which we are given in the problem. Accordingly, we have the
following arrangement which includes the given figures :-
Prn 

n!
5!
5  4  3  2  1
 P35 

n  r !
5  3!
5  3!
5  4  3  2  1
 5  4  3  60 .
2!
This means you can arrange five objects taking two objects at a time 60 times.
Learners must try as many problems in this as possible to familiarize themselves in the
use of permutations. The probability of obtaining any of these possible arrangements is
1/60 = 0.016667.
51
Combinations
We use combinations when the order of arrangements of objects is not necessary. In that
case, the formula for this type of computation is as follows :-
Crn 
n!
r !n  r !
If we examine the formula, we note that it is similar to the one of computing permutations, but
there is an
r ! in front of the denominator expression within the brackets. The formula reads,
“The combination of
n
objects taking
r
objects at a time, is the ratio of
n!
to the
product of r! and the difference (n - r)!
Example
Find the number of all possible arrangements of five objects, taking two at a time, where
objects must be arranged in a no specific order.
Solution
Here no specific arrangement is prescribed. This means that we can use only a
combination computational algorithm to solve this problem. Therefore, we use the
formula
Crn 
n!
. If we now insert the given figures into this formula we
r !n  r !
have :-
Crn 
n!
5!
5  4  3  2  1
5  4
 C25 


 10
r !n  r !
2!5  2!
2  1 3  2  1
2
This means that five objects can be arranged ten times if we take two objects at a time
when we are shuffling these objects. The probability of obtaining any of these possible
combinations is therefore 1/10 = 0.1.
52
The Binomial Distribution
If a medical team wishes to know whether or not their new drug provides their
patients some relief from a certain disease, they need a computational technique to able to
determine exactly the probability that this drug will work. If the probability distribution
of this kind of computation is known, then it would be much easier to make many
decisions in life which revolve around the two alternative options of success or failure.
The Binomial Distribution assists us to compute these probabilities. Before we consider
this probability distribution we need to consider and understand the concept of random
variables.
A random variable is a numerical quantity whose value is unknown, to begin
with, but whose value could arise among very many alternatives options available, due to
chance. It is a variable whose value is determined by chance. It is a theoretical concept
arising out of knowing that there could be many possible outcomes in a process. In a coin
toss activity, we know that Heads could be one of the outcomes, before the tossing of the
coin. In the same manner, Tails is another outcome could happen. This means that the
probability that any of these outcomes could happen is a random variable.
The probability distribution of a random variable is an array of probabilities that
each of all the possible random variables will occur. The random variable becomes an
actual variable as a result of the actual occurrence of one of the outcomes of the random
variable actually takes place. Consider the tossing of a coin toss five times. The
probability that Heads will come out (number) three-times is a random variable, because
it has not actually happened. When it actually takes place, then (number) three-times
becomes an actual figure.
Random Variables, Probability Distribution and Binomial Distribution
Looking at the “ whether or not ” situation we note that each experiment, or each
phenomenon, has its probability distribution determined by its own circumstances. There
is no reason, for example, why 50% of all the patients who are subjected to a medical
experiment will respond to some drug of interest. The doctors in this case are faced with
two possibilities, involving success or failure of the drug. This is the same as saying, “Let
us see whether or not the drug will work.” This is a situation which makes this
53
experiment a binomial experiment. This is the kind of experiment which has two likely
outcomes, success or failure. There are still more conditions which make any
experimental process become a binomial process. These are the so-called Bernoulli
Conditions; named after Jacob Bernoulli who in 1710 gave a comprehensive account of
the nature of this hit or miss process. For each Binomial experiment involving the
probability of success or failure, the Bernoulli conditions are as follows :-
1.
The experimenter must determine a fixed number of trials within which this
experiment must take place.
For example, in a series of coin tosses, where we are trying to see whether or not
we shall get Heads must be fixed. The number of trials n must be definitely
known.
2.
Each trial must be capable of only two possible outcomes - either success or
failure.
3.
Any one outcome must be independent of all the other trials before and after the
individual trial.
4.
The expected probability of success (  ) must be constant from trial to trial. The
probability of failure 1    must also be constant from trial to trial.
5.
The experiment must be the result of a simple random sampling, giving each trial
an equal chance of either failure or success each time.
When trial outcomes are results of a Bernoulli Process, the number of successes becomes
a random variable when examined from a priori point of view - before the experiment has
been tried. When the experiment matures, and the results are known, then the variables
which were initially conceptualized as random variables become actual variables which
reflect the nature of actual outcomes.
Example
An evenly balanced coin is tossed five times. Find the probability of obtaining
exactly two heads.
Solution
In this problem we wish to learn systematically the method of handling binomial
54
experiments. We begin by examining the given facts using the Bernoulli
conditions which we have just learned.
Bernoulli Conditions for the experiment
We find that condition ( 1 ) is satisfied by the definite number of trials which has
been specified in the question. Condition ( 2 ) is also satisfied, because each trial
has exactly two outcomes, success or failure, heads or tails. Each coin toss is
independent from any other, previous or subsequent toss, and this satisfies
condition number ( 3 ) The nature of coin tossing experiments is such that each
toss is independent of any other toss, and the expected probability of success or
failure is constant from trial to trial - thus condition ( 4 ) is satisfied. The fifth
condition ( 5 ) is satisfied by the fact that the system of coin tossing is repeated
and random, and the results of the coin toss come out of a random “experimental”
process.
Once we are satisfied that the experiment has all the Bernoulli conditions correct, then we
must use the Binomial formula to compute the probability like the one we are looking for
in our example. We shall take the formula empirically and only satisfy our curiosity be
learning that this is what was developed by Bernoulli for solving this kind of problems.
The formula is as follows :-
P  R  r   Crn   r 1   
n  r

n!
r ! n  r !
 1   
r
n  r
For the beginner, this formula looks formidable - the puzzle being how to memorize and
use it within an examination environment or in daily biometric life. On second thought
we see that it is really very simple. Let us take the first and the middle expression :-
P  R  r   Crn   r 1   
n  r
We find that we have already learned the combination formula and that all the designated
variables can be interpreted as follows:-
55
1.
P
is the obvious sign indicating that we need to look for the probability in
this
2.
experimental situation.
The bold R is the designation for a random variable, whose value is unknown
now but
3.
which must be interpreted during an experimental process.
The lower case
r
is the actual outcome of the random variable after
experimentation.
4.
Crn
Is the combination formula which we have come across already when
learning
about combinations and permutations. This is the only material you need
to memorize in
this case, but remember it is :-
Crn 
n!
.
r !n  r !
Also remember it is the permutation formula with an extra
r!
in the
denominator to
neutralize the need for the specific orders which are necessary in
permutations. The
permutation formula is : Prn 
5.
n!
. This is easy, is it not ?
n  r !
The variable  is the population or the universal or the expected probability of
success, and
(1 - )
is the complement of this: which is the probability of
failure.
6.
The lower case n which appears often in this expression as a superscript is the
fixed number of trials. The lower case r which appears as a superscript (an exponent)
on the right hand of the equation is the actual number of successes, which is identical to
what
has been described in ( 3 ) above.
Study these six facts very carefully before leaving this section because they will help you
have some working knowledge of Binomial experiments which satisfy Bernoulli
Conditions. Note that Statistics as a subject is plagued with symbol difficulties because
56
the same formula can appear in different textbooks with slightly different looking symbol
configuration. While we recommend what we are using within this module because it is
universally used in most statistical textbooks in print, we need to indicate some other
types which you may come across during your studies.
(a)
The first form goes like this : n
n x
P xsuccesses , n  x failures     x 1   
 x

(b)

The other alternative form, still stating the same facts looks like this : n
n x
b x; n,       x 1   
 x
This one emphasizes the fact that you are dealing with a binomial experiment hence the
“ b ” in front of the square brackets on the left hand side of the equals
sign. All these have identical instructions to P  R  r   Crn   r 1   
(c)
n  r
.
Another one you might come across is:-
PK
 N 

   p K q n  k 
 K 

Usually when these expressions are presented, the author makes sure he/she has
defined the variables very carefully, and all the symbols have been explained. Pay
meticulous attention to the use of symbols in the textbook you will be using in this case
and in all the other instances of symbol use in Statistics. The results of using these
symbols will be identical.
Having understood all this, it is now time to work out our example of five coin
tosses which we started with. This is now a simple process, because it merely involves
inserting various given values in appropriate places and using your calculator and
BODMAS rule that we learned in our junior school times to obtain the solution.
P  R  r   Crn   r 1   
n  r

5!
 0.52  0.55  2
2!5  2!
57

5!
 0.5 2  0 .5 3
2!  3!

5!
 0.5  2  0 .53
2!  3!


 10  0. 25  0.125 
5  4  3!
 0.5 2  0 .53
2!  3!
20
 0. 25  0 .125
2!
0. 31250
Activity
Study the use of this method of working meticulously; using the interpretation we
have given to the formula in the foregoing discussion; before leaving this part of the
discussion. The material in pages 83 to 95 of the Fundamentals of Applied Statistics
(King’oriah, 2004, Jomo Kenyatta Foundation, Nairobi) will be very useful and easy for
the learner to supplement the information given in this module. A few easy, real-life-like
examples of using the Binomial formula are given. I do not mind even if you use a half a
day for the study of this one formula before going on. Then try to solve the problems at
the end of Chapter Four (King’oriah, 2004) Pages 93 to 95.
EXERCISES
1.
2.
Explain the meaning of the following concepts used in Probability Theory :(a)
Mutually Exclusive Events
(b)
Independent Events
(c)
Collectively Exhaustive Events
(d)
A priori examination of an experiment
(e)
Union of Events
(f)
Intersection of Events
(g)
Success and failure in probability
(h)
Complementary Events
(i)
Conditional Probability
Use the binomial formula to compute the probability of obtaining tails in the
process of tossing an evenly balanced coin eight times.
58
3.
(i)
Explain in detail the meaning of probability of an event.
( ii )
In a single toss of an honest die, calculate:
(a)
probability of getting a 4
(b)
probability of not getting a 4
(c)
probability of getting a e and a 4
(d)
probability of getting a 2 and a 5
( iii ) Explain in detail the meaning of the probability of an event.
4.
A club has 8 members.
(a)
How many different committees of 3 members each can be
formed from the club, with the realization that two committees
are different even when only one member is different?
(This means without concern for order of arrangement.)
(b)
How many committees of three members each can be formed
from the club if each committee is to have a president, a
treasurer and a secretary?
In each of the above two cases, give adequate reasons for your answer.
59
CHAPTER THREE
THE NORMAL CURVE AS A PROBABILITY DISTRIBUTION
Introduction
Any graph of any probability distribution is usually constructed in such a way as
to have all possible outcomes or characteristics on the horizontal axis; and the frequency
of occurrence of these characteristics on the vertical axis. The Normal curve is no
exception to this rule. In fact we can regard this distribution literally as the mother of all
distributions in Statistics. We shall introduce the normal curve systematically, using a
histogram which has been constructed out of the probabilities of the outcomes expected
from an experiment of tossing an evenly balanced coin five times. Here we are moving
slowly from the “known to the unknown”. We have just completed learning the Binomial
Distribution, and we shall use this to accomplish our present task. Study the Table 3 - 1.
TABLE 3 - 1: THE OUTCOME OF TOSSING A COIN FIVE TIMES
Possible
Number of
Heads
Probability that the actual Number of Heads ( r ) equals the Possible
n  r
Number of Heads R. Denoted as : P  R  r   Crn   r 1   
0
5!
 0.5 0  0 .55  0 
0! 5  0!
1
5!
 0.51  0 .55  1 
1! 5  1!
5!
05
. 2  0.5 3 
2!  3!
0.31250
2
5!
 0.5 2  0 .55  2  
2! 5  2!
5!
05
. 3  05
. 2 
3!  2!
0.31250
3
5!
 0.5 3  0 .55  3 
3! 5  3!
5!
05
. 4  05
. 1 
4!  5!
015625
.
4
5!
 0.5 4  0 .55  4  
4! 5  4!
5
5!
 0.5 5  0 .55  5 
5! 5  5!
TOTAL PROBABILITY
5!
0.5 0  05
. 5 
0!  5!
5!
05
. 1  05
. 4 
1!  4!
5!
0.5 5  0.5 0 
5!  0!
=
60
0.03125
015625
.
0.03125
1.0000
Table 3 - 1 is the result of our computation. We begin by asking ourselves what
the probability is; of obtaining zero heads, one head, two heads, three heads, four heads
and five heads in an experiment where we are tossing an evenly balanced coin five times.
For each type of toss, we go ahead and use the binomial formula and compute the
probability of obtaining heads, just like we used in the last chapter. Accordingly, from
our computation activities, we come up with the information listed in Table 3 - 1. You are
advised to work out all the figures in each of the rows of this table, and to verify that the
probability values given at the end of each row are correct. Unless you verify this, you
will find it difficult to have an intuitive understanding of the normal curve. The next step
is for us to draw histograms of the sizes of these results in each row, so that we can
compare the outcome.
Figure 3 - 1: A Histogram for Binomial Probabilities that the
Actual Number of Heads ( r ) equals the Possible Number of
Heads ( R ) in Five Coin Tosses.
The reader is reminded about a similar diagram in Chapter One (on Page 19) about the
height of three-week old maize plants. Both these diagrams have one characteristic in
common. They have one mode, and have histograms and “bars” which are highest in the
middle. In both diagrams, if the frequency were increased so that the characteristic on the
horizontal axis is selected as finely as possible, instead of discrete observations, we have
61
very fine gradations of the quality (along the axis measuring quality - the horizontal axis).
If we draw frequency distributions of infinite observations for both cases (in Chapter 1
and Chapter 3), the result will be a smooth graph, which is highest in the middle, and then
flattens out to be lowest at both ends.
The second characteristic of both graphs is that the most frequent observations in
both diagrams occur either in the middle, or near the middle, within the region of the
most typical characteristic. This is the trend of thought which was taken by the early
statisticians who contributed to the discovery of Normal curve, especially Abraham de
Moivre (1667 - 1754) and Carl Friedrich Gauss (1777 - 1855).
The diagram in Chapter One (Figure 1 - 1, page 19) enables us to count physically
how many observations are recorded in one category, or in many categories. Once we do
this we can divide with the total count of all crosses to obtain the ratio of the subject
observations to the total number of observations. We are even able to compute the area
under the curve formed by the tops of the series of crosses, which area is expressed in
terms of the count of these crosses under the general “roof-line” of the whole group. This
count-and-ratio approach approximates the logic of the normal curve.
In Figure 3 - 1 above, the early pioneers of Statistics were able to develop the
frequency distribution which they called the Normal Distribution, by turning the
observations along the horizontal axis into very many infinite category successions, and
then plotting (mathematically at least) the frequency over each category of characteristic.
When they did this mathematically, they obtained a distribution which is bell-shaped and
with one peak over the most typical observation, the mean (). In Figure 1 - 1, the most
typical characteristic is 0.3 of a meter. In Figure 3 - 1, the most typical probability seems
to be 0.31250. We can rightfully argue that for Figure 3 - 1, this is the modal (or the most
typical) probability (under the circumstances of the experiment, where   0 .5 , n  5 ,
and [ R = r ] varies between [R = 1 head] and [R = 5 heads] of the successful events).
We shall not delve in the mathematical computations which were involved in the
derivation of the normal curve, but we need to note that the curve is a probability
distribution with an exact mathematical equation. Using this equation, a probability
distribution is possible, because we can accurately plot an inverted bell-shaped curve; and
most importantly, we can compute the area under this curve. Any such value of the area
62
under the curve is the probability of finding an observation having the value within the
designated range of the subject characteristics.
Activity
1.
Go to Figure 2 - 1 and count all the maize which is 0.25 meters and below. This is
represented by all the crosses in this characteristic range. You will find the crosses
to be 31.
In that regard, we can say that the probability of finding maize which is 0.25
meters and below.
P  maize 0.25 meters and below  
2.
31 Actual observations
31

 0 . 31
100 Total observations
100
Count all the plants which are 0.5 meters high. Compute the percentage of all the
plants which are 0.5 meters high. Remember that there are 100 plants in total.
What is the proportion of these plants to the total number of plants? What is the
probability that one could randomly find a plant which is 0.5 meters high?
3.
Count all the plants which are 0.35 meters and above. Compute the probability of
finding some plant within this region by chance, if we consider that the total
number of plants in this sample is 100 plants. What is the probability of finding a
plant with a height below 0.35 meters?
4.
Make your own table like Table 3 - 1. In your case, compute all the probabilities
of finding zero Heads, 1 Heads, 2 Heads,
5.
..... up to 8 heads.
Using Table 3 - 1, find the total probability of obtaining 3 heads and below.
Explain why this kind of computation is possible.
In all the above activities we have been actually computing the area under the
curve, which can be described as a frequency distribution that is formed by observations
63
which result from the characteristics listed along the horizontal axis. The proportion or
the probability of an observation or a set of observations under the curve is actually some
kind of area statement. Please read the recommended texts and satisfy yourself that you
are dealing with areas.
The Normal Distribution
The area under the bell-shaped Normal curve
The founding fathers of statistics were able to use calculus and compute areas
under curves of different types. Using the Binomial distribution it was possible to
conceptualize very fine quality gradations along the horizontal axis, and to compute the
frequency for each of the closely spaced characteristics, to form the normal curve. The
general equation which they invented for describing the normal curve - which we do not
have to prove is as follows :-



In this equation

f  Xd X 


1

2 
2
 e

 x  
2
2 2

dX.
is the usual geometric constant, defining the number of times the
diameter of the circle goes into the circumference, which is 3.1416...times. The “ e ” is
the base of natural logarithm, usually known as Euler’s constant, whose value is
2.7183.... The symbol  is the universal mean of the population under consideration,
and 
is the population standard deviation.
The values of

and
e
can be obtained from our scientific calculators at the
touch of a button, because they are very useful in all advanced mathematical work. The
population mean, represented by symbol  , can either be computed from any number of
observations above 30 ( the more observations the better ), or be a known figure from the
observation of the nature of any data for a long time. The population standard deviation
 is likewise obtained from the raw population in the same manner as we discussed in
Chapter 2. Except for the values of  and e
64
which exist to define the curvature of
the curve and the continuity of the same curve, respectively, the normal curve can be said
to be completely determined by the values of the population mean (  ) and the population
standard deviation (  ). Given these values, it is possible to determine the position of
each point on the bell-shaped normal curve - or the normal frequency distribution.
Consider the example of the maize plants example given in Figure 1 - 1. In this example,
it was possible to tell the height (Number) of each of the columns of crosses for each
category on the horizontal axis. The highest cross on each column of crosses determines
the position on the frequency curve at that point; and at that level of the characteristic
along the horizontal axis.
The more we increase the total number of the sample and the more we keep on
drawing histograms representing the number of counts for each characteristic, the
smoother the graph of the distribution approaches the smooth shape of the normal curve,
like in the two diagrams shown in Figure 3 - 2.
To demonstrate that we are dealing with a real phenomenon and not merely a
mathematical abstraction, we will now show that it is possible to compute the height of
the normal curve at any point using the above formula, and given the two parameters, the
population mean  and the population standard deviation  . We shall strictly stick to
evaluating one point on the normal curve. Remember that a point is the smallest part of a
curve - even a straight line. In this case we can say that a curve like a normal curve is a
succession of points plotted using some definite equation, like we did in elementary
schools, when we were learning how to draw graphs.
This means that if we can obtain one point using the formula, then we can get a
succession of points comprising the normal curve using the same formula :-
f X 
1
2 
2
We now begin this very interesting exercise :-
65

e 
 x  
2
2 2

Figure 3 - 2: The more the observations along the horizontal axis, and the closer their
values along the same axis, the more their relative frequencies fit along the
smooth curve of the normal curve
Example
A variable X has a population mean of  = 3 and a standard deviation  = 1.6.
Compute the height of the normal curve at X = 2 using the equation for one point
on the normal curve
f X 
1
2 
2

e 
 x  
2
2 2

.
Solution
Identifying the height of the curve using this formula means that we plot a point
on the curve using the given data. If we insert the given values and those of the
constants into the equation and solve the equation using these values, we shall get
the required height. We shall first begin by evaluating the exponent, which lies
66
above the Euler’s constant “ e ” on the given equation. We also need to remind
ourselves that “ f ( X ) ” on our equation means “ y ”, which in mathematics is any
value on the y-axis. This is the value we described as the frequency of occurrence
of the “point” characteristic defined along the horizontal axis. It is like asking
how many maize plants are 0.2 meters high on Figure 1 - 1. If we count these
we find that they are eleven. This now is the height of the jagged curve defined by
the top of the crosses in the diagram at the time
X = 0.2. In this case, we are
asking the same question for the value of X = 2. Now we know what we are
looking for.
 X  
Evaluate the exponent 
Step One :
2

2 2 .
We do this by using the values given in the question, and inserting them in the
appropriate places and obtaining the value within the square brackets.

 X
Step Two :
 
2
2
2


 2  3 2 

2
 2  1. 6 

 1 5. 2
Evaluate the fraction
the known constant
1
2 
Step Three :
2

1
2   1. 6
2

1
2  2




 1  2  2.56
 0 .19531250
using the given value of  = 1.6, and
.
1
1. 6 2 

1
 0 . 24933893
1. 6 6 . 283185
Evaluate the Euler’s Constant e to the power of the exponent
calculated using Step One above .
e 0.19531250

2.7183 0.19531250
67

0 .82257649
Step Four :
Multiply the results of Step Three and those of Step Two.
0 .82257649  0 . 24933893  0 . 205510035
Amazing! We can therefore conclude that the height of the normal curve defined
by the parameters X = 2,  = 3, and  = 1.5 is  0 . 205510035 . These units are
usually presented in any numbers being used in the subject experiment. We can conclude
that using the two population parameters: the population mean  and the population
standard deviation , it is possible to compute a succession of all the points along the
trend of the normal curve using our equation for finding one value of “ y ” or “ f ( x )”
along this curve. The continuous surface of the normal curve is made of all these heights
joined together. Mathematically, we say it is the locus of all these points.
Similarly if we are given the two important parameters, the population mean (  )
and the population standard deviation ( ), we can compute any area of any slice under
the normal curve. The expression for the total area under the normal curve is :-



f  Xd X 


1

2 
 e
2

 x  
2
2 2
d X
And, using the same logic the equation defining the area subtending any values a and
b,
which are located in the horizontal axis (i.e. any characteristic value a and b ),
could be obtained using calculus by re-defining the end-limits of the general equation as
stated in the following expression :-

a
b
f  Xd X 

b
1
a
2 
2
 e

 x  
2
2 2
d X
Those of you who understand the branch of mathematics called Integral Calculus
know that this is true. In addition, if we can compute the areas of slices under the normal
curve, those areas will be identical to the probabilities that we can find any value within
the slice defined by the characteristic values
68
a
and
b
along the horizontal axis.
Luckily, we do not have to use the formula every time for this purpose, because there
exist tables which have been developed by the founding fathers of Statistics to assist us in
computing these areas; if we learn the method of using these tables - as we are about to
learn within the discussion which follows below. Just now, we need to study the
characteristics of the normal curve.
Characteristics of the Normal Distribution
Any probability distribution of any random variable gives a probability for each
possible value of that random variable. In the case of the normal distribution, the random
variable is the characteristic of interest; which is usually plotted along the horizontal axis.
We already know that along the vertical axis lies the frequency of each characteristic.
This is why we involved ourselves in the exercise of counting the maize plants in Figure
1 - 1. We also need to remember that the frequency of any normal population is greatest
around the mean characteristic (  ) of the population. This is why we involved ourselves
in the computation of the binomial experiment of finding the probabilities of success after
tossing a coin five times, as we have done using Table 3 - 1. We also did similar things
using Figure 1 - 1.
When the normal distribution is considered merely in terms of the characteristic
of interest and their frequency of occurrence, it could be called a frequency distribution.
For the normal curve to be a probability distribution, we have to think in terms of the
characteristic of interest, and the probability that such a characteristic may turn out to be
the real characteristic after investigation. This is where the concept of random variable
comes in. Before any investigation is done, there is a priori conception of all possible
ranges of any random variable of interest. .In this case, if we are thinking of the maize
example, all possible heights of maize plants after a certain period of growth may be
important. The characteristic heights would then form the horizontal axis; and the
frequency of their occurrence, the values along the vertical axis. In that regard, we may
compute the probability of interest in a similar manner as we have discussed above, either
using the Binomial Distribution, or using Integral Calculus. When such probability
distribution is plotted, a curve of the probability distribution is the result. The probability
distribution is actually an arithmetic transformation of actual frequencies; which also can
69
be plotted along a curve. We call the actual frequencies raw scores, and the observations
along the continuous probability distribution probability frequency values. The resulting
curve is of course a frequency distribution.
Having defined the normal curve as a probability distribution, we need to state the
characteristics of the normal curve. We know that the peak of the normal curve lies above
the mean characteristic in any distribution, and therefore we begin by stating the first
characteristic :(a)
The normal curve is uni-modal in appearance. It has a single peak above the mean
characteristic value  . When we use sample values to estimate the normal curve
 is replaced by the statistic X . The highest point on the normal
the parameter
curve defined using sample observations is therefore above the sample
(b)
The expression



f X 


1

2 
2

e 
 x  
2
2 2
d X
mean
X.
is for a curve
which is an inverted bell-shaped, which falls steeply on both sides of the mean
value ( , or
X ), and then flattens out towards both ends as it approaches the
horizontal axis. Statistically we say that the curve has an upper tail and the lower
tail as in Figure 3 - 3 below.
Figure 3 - 3: The Shape of the Normal Curve
70
(c)
The normal curve depends only on two parameters: the population mean
( , or
X ), and the
population standard deviation  or s for the sample
standard deviation.
(d)
The total area in terms of all there is of the observations is either the population or
the sample of interest. If we consider the curve as a frequency distribution the
area of that curve represents the total probability of finding members of that
population or that sample having all the possible range of the characteristic of
interest. Since we are considering all there is of the know characteristics under the
normal curve the total probability is actually 1.0. We can state this using the
equation for the definition of random variables :
 P X
 x  1. 0
i  1
If we have a finite number of frequencies and observations to consider we can
translate this equation to define the number between 1 and n observations :-
n
 P X
 x  1. 0
i  1
(e)
The shape of the normal curve is completely determined by the population or the
sample means and the population and the sample standard deviation. Each curve
has the same configuration as all the others, but in fact differs from all the others
depending on the circumstances of the experiment and the characteristics of the
population of interest. However, if we assume that the total probability of all the
existing observations is 1.0, then we have what is called as the Standard Normal
Curve. Such a curve is standardized in such a manner as to leave it with a mean of
zero and a standard deviation of 1.0.
71
Activity
The process of standardization can be very easily modeled by requesting a class
of 30 students (or more) to measure their heights accurately. Then request them to
compute their mean height. After this request them to stand up. Some students
will be taller and others will be shorter than the mean height. You may now
request all the members of that class who have exactly the mean height to sit
down. For the remaining students record exactly how much higher or smaller
they are than the mean height. Then look for the standard deviation of the class
heights.
Results
1.
When all the students with their heights exactly equal to the mean height are
either requested to sit down or to leave the group, this is like saying that the
measurement of all the deviations shall begin from the mean value. Thus the mean
value becomes your zero value.
2.
Those students who are taller than the mean will record a positive difference
above the mean. Record all these for the taller students, and deduct them from the
mean height. The students whose height is smaller than the mean will record a
negative difference from the mean. Do all the subtractions and record these with
their negative value, and clearly indicate this using the negative signs. For both
groups, ignore the mean value, and record only the deviations from the mean with
their appropriate signs indicated in front of their respective values. This is like
making the mean value zero.
3.
Examine the value of the standard deviation. Using the standard deviation as a
unit of measurement, find out of how much smaller or taller than the mean the
each of remaining members of the class are - after all the standard deviation is a
deviation like all the others, only it is Standard, which means it is the typical, or
it is the expected normal deviation from the mean. This comparison can be done
by dividing all deviations from the mean (whether large or small) by this standard
deviation, which is after all the regular standard of measurement.
4.
The result of this will give the number of standard deviations which separate each
height-observation from the mean value (  ); which we have discounted and
72
assumed it is zero. Also these deviations from the mean large and small will be
negative and positive, depending on which side of zero they happen to be located.
The number of standard deviations for each case ( and this number, in absolute
values, could be between zero (take on decimal values, and so on) and 4 standard
deviations (for reasons we shall discuss soon).
5.
This number of standard deviations from the mean for each observation (covering
each student who has not been excluded when we told those with the mean value
step aside), is the so called the Normal Deviate value, usually designated as the
“Z-value”. Any Z-value measures the number of standard deviations at which
any observation in any sample or any normal population stands away from the
mean value. This value could take fractional values, it could take as large values
as one, two, or even between three and four standard deviations from the mean.
Our work of manipulating the normal curve will rest heavily on the understanding
this characteristic of the Z-value. We shall do this presently.
6.
The area under the normal curve covering any interval on both sides of the
population mean depends solely upon the distance which separates the end-points
of this interval in terms of the number of standard deviations, which from now we
shall call Z-values, or the number of normal deviates. Any normal curve created
out of deviations from the mean population values is called a standardized
normal curve. Using the Z-values we can compute the area under the normal
curve between any two end-limits. This saves us from the use of esoteric
mathematics of integral calculus, which is also right; and can bring identical
results if accurately used. In terms of standard deviations, the proportion of the
normal curve between any two equal intervals on both sides of the mean is in
accordance with the table 3 - 2. This table actually reflects the very nature of the
normal curve which we have been explaining above. We must keep reiterating
that we are concerned with the number of any population having any particular
measurable or characteristic. Also we reiterate that in any Normal population
which is nearly homogeneous in character the magnitude of the characteristic of
interest is nearly identical, and does not differ considerably from the mean. This is
why most of the population is found near the mean when we consider it in terms
73
of the magnitude of interest and in terms of the distance in terms of standard
deviations away from the mean.
7.
The number of standard deviations of any observation from the population
or the sample mean can be calculated by dividing the actual size of that
observation which is above or below the mean with the population or the sample
standard deviation.
Table 3 - 2: AREA UNDER THE NORMAL CURVE SPANNING BOTH SIDES OF THE MEAN AS
MEASURED AS A PROPORTION AS A PERCENTAGE, AND IN TERMS OF
STANDARD DEVIATIONS
Number on each side of
the mean
Population %
having this range
of characteristic
Proportion of the
total area under
the normal curve
One
Standard
Deviation
   , to    
Two
Standard
Deviations
  2 , to   2 
Three
Standard
Deviations
  3 , to   3 
Four
Standard
Deviations
  4 , to   4 
68.0%
0.68
Probability of
finding a member
of population
within this bracket
0.68
95.5%
0.955
0.955
99.7%
0.997
0.997
99.994%
0.99994
0.99994
Using Standard Normal Tables
This discussion now brings us to a very interesting situation, where we can look
for any area under the normal curve subtended by any two end-limits along the
characteristic axis ( X-axis ) without any use of complicated integral calculus - which is
the tool mathematicians use for calculating areas under any curve. Amazing! But we must
thank our mathematical fathers especially Carl Friedrich Gauss (1777 - 1855) for the
pains-taking efforts of developing the theory behind the so-called Standard Normal
Tables. His work on the Standard Normal Tables and all the associated theories and
74
paradigms made the standard normal curve to be classified by mathematicians as a
Gaussian Distribution.
We can now comfortably learn how to use the Standard Normal Tables and the
fact that a normal curve is a probability distribution to solve a few easy problems. For
additional work on this exercise you are advised to read King’oriah (2004, Jomo
Kenyatta Foundation, Nairobi) Chapter Five.
Example
A certain nut has an expected population mean weight  of 50 grams, and a
population standard deviation
 of 10 grams. How many standard deviations

“ Z ” away from the population mean
is the nut which you have randomly
picked from the field, and it weighs 65 grams?
Solution
This is a good example of looking for the normal deviate Z . We must obtain the
difference between the population mean value and the sample which we have
taken from the field in terms of actual raw weight from the field. The formula for
this activity is :Z

X i  .

Where Xi is any value of the random value X (any actual value of an individual
observation ),  is The mean of the population under consideration, and
 is
the standard deviation of the population under consideration. This formula is
analyzed in a simple manner using our nut example. The difference between the
actual observation and the population mean is:-
65 gm.  50 gm.  15 gm.
This difference must be translated into the number of standard deviations so that
we can calculate the Z-value :-
75
Z
65 gm.  50 gm.
15 gm.

 1.5 Standard Deviations.
10 gm.
10 gm.

Here we are using the standard deviation as a measure of how far the actual field
observation lies away from the mean; and in this case the nut we picked lies 1.5
standard deviations away from the mean value .
Example
Suppose in the above example we find another nut which weights 34 grams,
calculate the Z-value of this difference between Xi and .
Solution
Using the same units which we have been given in the earlier example we have :-
Z

Xi  


34  50
10

 16
  1. 6
10
The Standard Normal Tables
Look at page 487 in your textbook. (King’oriah, 2004) or any book with the
standard normal tables entitled Areas Under the Normal Curve. In the first case of our
nut-example, we obtained 1.5 standard deviations from the mean. Look down the left
hand column of the table in front of you. You will find that beside a value of Z = 1.5
down the Z-column is an entry “ 4332 ”.
This means that the area under the normal curve with the end-limits bounded by
the mean  and the actual observation “65 grams” is 0.4332 out of 1.0000 of the
total area of the normal curve. And since the value is positive, the slice of the
normal curve lies on the right hand side of the mean as shown in Figure 3 - 3.
In the second case where Z = – 1.6, the actual observation lies on the left hand side of
the mean  , as indicated by the negative sign in front of the Z-value. The reading on the
table against Z = – 1.6 is “ 4462 ”. This means that an observation of 34 grams and
the mean  = 50 grams as end-limits (along the characteristic values X-axis) subtend an
76
area which is 0.4462 out of the total area of 1.0000. This area is shown on the left hand
side of the mean in Figure 3 - 3.
Figure 3 - 3
Showing the Number of Standard Deviations for the weight
of a Nut, Above and Below Mean
Having computed the areas of the slices on both sides of the mean which are
subtended by the given end-limits (in terms of the proportions of 1.0000), the next
question we need to ask ourselves is what the probability is, that we can find an
observation within both areas subtended by both intervals of the given end-limits. Both
proportions (out of 1.0000) which we have already computed - namely 0.4332, and
0.4462, respectively, are actually records of the probability that one can find some
observation within the areas subtended by the respective end-limits.
The probability that one can find a nut weighing between 34 grams and 65 grams
on both sides of the population mean  is the sum of both probabilities which
we have already computed. It is obtained using the equation below.
Total Probability = 0.4332 + 0.44662 = 0.88982
77
Example
What proportion of the normal curve lies between Xi = 143 and  = 168,
when the standard deviation of this population
 = 12?
Solution
Step One:
Z
Compute the normal deviate Z over this interval.
Xi  



143  168
12

 25
  2 . 08
12
Like all calculations of this kind, we ignore the negative sign in our answer,
because this merely tells us that observation 143 is smaller than observation 168.
Step Two:
When Z = 2.08 we look up the proportion in the standard normal
tables on page 487 of your textbook. (King’oriah, 2004)
The row 2.0 of all the Z-values down the left hand side column
and the top right hand column indicating 0.08 converge at a figure within
the body of the entire table at a reading of “ 4812 ”.
This means that 0.4812 of the normal curve lies between the observations
given within the question of our example - namely between Xi = 143 and
 = 168.
The real life implication of this can be better appreciated using percentages. It
means that 48.12% of all the members of the population have the size of the characteristic
of interest that lies between 143 and 168. These characteristics could be measured in any
units. Suppose the example refers to the weight of two-month old calves of a certain
breed of cattle. Then 48.12% of these calves must weigh between 143 kilograms and 168
kilograms.
78
Example
Using the parameters and statistics given in the above example you are requested
to compute the percentage of the two-month old calves which will be weighing
between 143 kilograms and 175 kilograms.
Step One :
We know that between 143 kg. and 168 kg. are found the weights of 48.12% of
all the calves in this population.
Step Two:
Compute the normal deviate between 168 kilograms and 175 kilograms.
Z
Step Three:

Xi  


175  168
12

7
 0 .58333
12
Look for the area under the normal curve between the mean of 168
kilograms and 175 kilograms. We shall now symbolize the area which we shall
find this way, and also all those that we found earlier, using the same designation
as “ Az ”. This symbol means “ the area defined by the value of the given, or the
value of the computed Z-figure” using the last expression.
Look down the left hand column “ Z ” of the Standard Normal Table on
page 487, and find where the Z-value reads “ 0.5 ”. Then, across the top-most
row of that Z-table; go across this top-most row, to that column-heading reading
08333 (if you can find it). You will find that the nearest figure to this that you can
find is 0.08 (may be 0.09) along the top-most row. For our convenience let us
adopt the column labeled the approximate figure of “ 08 ” along the top-most row
of the Z-table. Where the row labeled “ 05” intersects the column labeled “ 08 ”
lies the area-value which will correspond to our of Z = 0.5800 (instead of our
computed value of Z = 0.58333). After all, 0.5800 is a good rounded-up figure.
Therefore the area subtended by this number of standard deviations Z = 0.5800
from the mean (  ) occurs where the row value of “ 0.5 ” down the left hand
79
Z-column intersects the column along the top-row labeled “ 08 ”.
This intersection defines an area (the “ Az ” ) under the normal curve of 0.2190.
Using our percentage value interpretation, we find that 21.90% of the calves must
weigh between 168 and 175 Kilograms.
Figure 3 - 4
Showing the Proportion of the Calve Population Which
Weighs between 168 Kilograms and 175 Kilograms
The “ t ” Distribution and Sample Data
Sampling Distribution of Sample Means
It is not every time that we are lucky to deal with entire populations of the
universe of data. Very often we find that we are limited by time and expense from
dealing with the whole population. We therefore resort to sampling (See Chapter Four) in
order to achieve our investigations within the time available; and often, within the level
of expense that we can withstand. In this case, we do not exactly deal with a normal
distribution. We deal with its very close “cousin”: called the Sampling Distribution of
Sample Means.
This distribution has very similar qualities to the normal distribution because after all - all data is obtained from one population or another. However, we know that the
characteristic of this data comes from only one sample out of the very many which could
be drawn from the main population. In this case, we have a double-transformation of our
sample data. Firstly, the data is governed by the quality of all individuals in the entire
80
population, and secondly the sample is governed by its own internal quality as a sample,
as compared to any other sample which could be randomly selected form the entire
“universe” or population.
Mathematicians have struggled with this phenomenon for a long time; to try and
find how sample data can be used for accurate estimation of the qualities of the parent
population, without risking the inaccuracies which could be caused by the fact that there
is a great chance for the sample not to display the complete truth about the characteristics
of the parent population. This can happen because of some slight sampling errors, or
because of the random characteristic of the sample collected - which could differ from all
the other samples in some way, and also from the main population.
After some prolonged study of the problem, the concept of the Expected Value of
the Sample Mean - whatever its nature, was found to be the same as the population mean.
However, suppose there are some fine differences? This is why and how the
mathematicians came up with the Central Limit Theorem. This theorem clarifies the fact
that all samples are estimates of the value of the entire population mean; but each sample
may not be the exact estimate of that population’s mean. Therefore we assumed that
sampling will be repeated for a long time randomly; and then the mean of all the sample
means (with very fine differences) will be sought. We found that this mean is expected to
be identical to the parent population mean.
Obviously, this collection of sample means, with its fine differences, after many
repeated samples will form a distribution in its own right. This is what mathematicians
and statisticians call the Sampling distribution of Sample means. Since every sample is a
very close estimator of the main population mean, such a sampling distribution is
expected to be very closely nested about the main population mean (  ); because the
frequencies (of all samples which are expected to have sample-means very nearly the size
of the parent population-mean) will be very high. In other words, these samples will be
very many, and their count will be clustered about the main population mean. Here we
must remember the counting exercise we did at Figure 1 - 1 ( page 19 ). Then, let us
project our thoughts to the counting of the numbers of sample means around the
population mean (  ); may be, using a number of crosses like we used in that diagram.
81
If we were to generalize and smoothen the Sampling distribution of Sample means
from discrete observations (or counts of data) like the “crosses” found in Figure 1 - 1, the
result would be like the steeper curve in Figure 3 - 5. We would therefore obtain a very
steep “cousin” of the normal curve nested about the population mean (  ) of the normal
distribution which is formed by the “raw” observations of the parent population. The
ordinary normal distribution comprises the flatter curve in that diagram.
Figure 3 - 5: Theoretical Relationship between the Sampling Distribution
of Sample means and a normal curve derived from Raw scores
The sampling distribution of sample means also has its own kind of standard
deviation, with its own peculiar mathematical characteristics, which are also affected by
its origins. This is what is asserted by the Central Limit Theorem statement. Under these
circumstances, the standard deviation of sample means is called the standard error of
sample means. In order to discuss the Central Limit Theorem effectively, it is important
to state the theorem, and thereafter explain its implications.
The Central Limit Theorem states that :If repeated random samples of any size n are drawn from any population whose mean
is  and variance  2 , as the number of samples becomes large, the sampling distribution
of sample means approaches normality, with its mean equal to the parent-population
mean (  ) and its variance equal to
2
.
n
82
Firstly, we need to note that we are using the population variance of the sampling
 2 
distribution of sample means for the final formula:   . This is because the variance is
 n 
the measure which is used mostly by mathematicians for analysis of statistical theorems.
This does not matter, as long as we remember that the standard error of sample means
can be sought using the square-root of variance, expressed as

X
 2
 . An
 

 n 
ordinary standard deviation is a measure of the dispersion of an ordinary population of
raw scores, and the standard error of sample means is a measure of the standard
deviation of the distribution of sample means.
Secondly the peculiarities of the sampling distribution imply that we use it in the
analysis in a slightly different way than that of the ordinary normal distribution; even if
the two distributions have a close relationship. This means we use a statistic which is a
close relative of the normal distribution, called the “ t ” distribution. This statistic is
sometimes called the Student’s t Distribution (after its inventor, W.S. Gosset, who
wrote about this statistic under the pen-name “ Student ”, because his employer, Guinness
Breweries, had forbidden all its staff from publishing in journals during the period Gosset
was writing - 1908.)
The statistic first uses an estimate of the standard deviation through the usual
formula which reflects the fact that one degree of freedom has been lost during such an
estimate. This sample standard deviation is represented by and computed using the usual
formula :S 
n

X
i
 X

2
n  1
i  1
where S means the estimated standard deviation of the sample, and n is the number of
observations. The rest of the standard error formula can be compared to the usual formula
for the computation of standard deviations using raw scores. Remember, the usual
expression for the raw-score standard deviation of a normal distribution is
83
 
n

i  1
X
i
 X
n

2
, and compare this to our new for the computation of the sample
standard deviation-expression which we have just stated. Once the sample standard
deviation has been computed, then the standard error is computed using the result of that
computation in the following manner :-
n
S X 

X
i  1
i
 X
n  1
n

2

S
n
Sometimes this expression for the standard error denoted by  X 
S
, where
n
 X means the standard error of sample means. It reflects the fact that we are actually
dealing with a very large population of sample means. In all our work we shall use
S
S X 
as our designation, to mean the standard error of sample means. Once this
n
standard error has been computed, then the expression for the t-distribution can be
computed using the following expression :-
t 
X  
X  

.

SX
s n
This is the expression which has been used to make all the t-tables. These tables
are available at the end of your textbook ( King’oriah, 2004, page 498). The distribution
looks like the usual normal curve. The interpretation of the results is similar. The
distribution is bell-shaped and symmetrical about the parent population mean (  ). The
scores recorded in this distribution are composed of the differences between the sample
mean
X
and the value of the true population mean ; which difference is then divided
by S X each time. The number of standard errors which any member of this population
84
stands from any sample mean can be obtained, and be used to compare individual sample
means with the population mean (  ); or among themselves. The distribution can also be
used to compute probabilities, as we shall see many times in the discussion within this
module. The cardinal assumption which we make for this kind of statistical measure is
that the underlying distribution is normal. Unless this is the case, the t-distribution is not
appropriate for statistical estimation of the values of the normal curve, or anything else.
We are now ready to use a small example to show how to compute the standard error of
mean.
Example
Four Lorries are tested to estimate the average fuel consumption per lorry by the
manager of a fleet of vehicles of this kind. The known mean consumption rate per
every ten kilometers is 12 liters of diesel fuel. Estimate the standard error of the
mean using the individual consumption figures given on the small table below.
Lorry Number
1
2
3
4
Consumption
12.1
11.8
12.4
11.7
Solution
We first of all compute the variance of the observations in the usual manner.
Observe the computation in Table 3 - 3 carefully and the following expressions to
make sure that the variance has been accounted for, the standard deviation, and
finally, the standard error of mean.
1.
S
X
n


i  1
n
2.
S X


i  1
i
 X

2
n  1
X
i
 X
0 . 30
4  1


2
n  1

4
85
0 . 30
4  1
4

0.316
 0158
.
2
TABLE 3 - 3 : FIGURES USED IN COMPUTING THE ESTIMATED SAMPLE STANDARD
DEVIATION IN PREPARATION FOR THE COMPUTATION OF THE STANDARD
ERROR OF SAMPLE MEANS
Xi
X
12.1
X
i
 X

i
 X
12.1 - 12.0 = 0.1
0.01
11.8 - 12.0 = - 0.2
0.04
12.4
12.4 - 12.0 = 0.4
0.16
11.7
11.7 - 12.0 = 0.3
0.09
11.8
12.0
TOTAL
3.
X

2
0.30
The expression number ( 2 ) above is a systematic instruction of how to compute
the standard error of sample means from the sample standard deviation; which is
also computed in the first expression ( 1 ). Note how we are losing one degree of
freedom as we estimate the sample standard deviation. In the second expression
( 2 ) above, we must be careful not to repeat the subtraction of one degree of
freedom for the second time to adjust the denominator “
4 ”. This will bring
errors caused by double-counting.
After our understanding of what the standard error of sample means is, we now
need to use it to estimate areas under the t-distribution in the same way as we used the
Standard Normal Tables. We reiterate that the use of the t-distribution tables is similar to
the way we used the Standard Normal Tables. We make similar deductions as those we
make using the standard normal tables. The only difference is that the special tables have
been designed to take care of the peculiarities of the t-distribution, as given in any of your
textbooks (like King’oriah, 2004, page 498). We shall, however, defer t-table exercises a
little (until the following chapter) so that we can consider another very important statistic
which we shall use frequently in our statistical work. This is the standard error of the
proportion.
86
The Proportion and the Normal Distribution
In statistics the population proportion is regarded as some measure of central
tendency of population characteristics - like the population mean. The fact that 0.5 of all
the people in a certain population drink alcoholic beverages means that there is a central
tendency that any person you meet in the streets of the cities of that community drinks
alcoholic beverages. Thus the proportion is a qualitative measure of the characteristics of
the population, just like any observation (or any sample mean) is a quantitative measure
of the characteristics of any population. Remember that early in this chapter we said that
the normal curve of any population is estimated via the binomial distribution. The
proportion of success for any sample or population is a binomial variable. The lack of
success or failure is also a binomial variable. Both satisfy Bernoulli conditions.
Therefore the distribution of the proportion is very closely related to the normal
distribution. In fact, they are one and the same thing, only that for the raw scores (or
ordinary observations) we use quantitative measures (within the interval or ratio scales of
measurement), while for the proportion observations we use qualitative measures (within
the nominal scales of measurement) comprising success ( P,
1 
 ) or failure
 , 1  P .
The universal (or the population) measure of the proportion is denoted by the
Greek letter
 , while the sample proportion is denoted using the Roman letter “ P ” ,
sometimes in the lower case, and at other times in the upper case.
This means that given the sample proportion “ Ps ” , we can use this to estimate
the position of the sample proportion quality ( Ps ) under the normal distribution. This
can be done in the same manner as we can use any observation or any population mean of
the normal distribution or any sample-mean within the t-distribution.
Like the sample mean
 X  , the expected value of the sample proportion “ P ” is
almost equal to the population proportion
proportion
P
 ; because, after all, the sample whose
has been computed belongs to the main population. Therefore, the
Expected Value of any sample proportion is the population proportion: E ( Ps ) =
.
The parameter known as the Standard Error of the Proportion (whose mathematical
87
nature we shall not have time to discuss in this elementary and applied course) is denoted
symbolically as
 1   
P 
. An estimate of this standard error using the
N
 P
sample proportion is denoted as
s
P 1  P
. Here the various symbols
n

which we have used in both expressions for the standard errors of the proportion are
translated in the following manner :-
P
=
Standard error of the population proportion
 P
=
Standard error of the sample proportion
P
=
Sample proportion for successful events or trials or characteristic

=
The population proportion parameter for successful trials
s
1   = The population proportion parameter for unsuccessful trials
1  P = The sample proportion parameter for unsuccessful trials
N, n
= The number of observations in the population, and in the sample,
respectively.
The associated normal deviate “ Z ” for the proportion, which is used with the
standard normal curve to compute the position of samples which have specified
characteristics in any population having a population proportion
Z

Ps  
p
s

 , is :-
Ps  
P1  P
n
The behavior of this normal deviate is identical to that of the normal deviate computed
using raw observations (or scores), which we considered earlier. The Standard Normal
Curve is used to estimate the positions of sample proportions, in the same manner as we
used the same curve to estimate the position of individual members of the population
under the normal curve. Refer to pages 69 to 74 in this document.
88
We now need to compute the standard error or the population using the
information which we have just learned. After this we shall compute the normal deviate
“ Z ” using the sample-proportion and the population-proportion. Then using the number
of the standard errors which we shall calculate, we shall derive the required probability:
as illustrated by Figure 3 - 6.
Figure 3 - 6 : The shaded area is above P = 0.7.
Example

An orange farmer has been informed by the orange tree-breeders that 0.4 of all
oranges harvested from a certain type of orange tree within Tigania East will have some
green color patches on their orange skins. This is what has been found after a long period
of orange tree breeding

in that part of the country.
What is the probability that more than 0.7 out of ten (10) randomly selected
oranges from all the trees within that area will have green patches mixed with orange
patches on their skins?
Solution
1.
In this example the long observation of orange skin color indicates that whatever
was obtained after a long period of breeding is a population probability of success
 . This is what will be used to compute the standard error of the proportion. This
89
means that   0.4 . Then 1    0.6 . The underlying assumption is that
sampling has been done from a normal population. The latest sampling experience
reveals a sample proportion P  0.7 . The number of the oranges in this sample
( N ) is ten oranges.
2.
The standard error of the proportion is :-
 P 
3.
N
0.41  0.4
10


01549
.
The normal deviate is computed using the following method :-
Z 
4.
 1   
P  
0.7  0.4
0.7  0.4


 1937
.
 P
 P
01549
.
This means that in terms of its characteristics, the current sample is “1.97 standard
deviations” away from the mean (typical) characteristic or quality on the higher
side. Remember that the typical characteristic or quality in this case is the
population proportion of success; which we saw was   0.4 .
What remains now is to use the normal deviate to compute the required probability using
the computed normal deviate. We now introduce some simple expression for giving us
this kind of instruction. This type of expression will be used extensively in the following
chapters. We will use it here by the way of introduction.
5.


P Ps  0 . 7
 Ps  
0.7  0.4 
 P

  P  Z  1. 937
 P
  P

This expression reads, “The probability that the sample proportion PS will be equal to, or


be greater than 0.7 {expressed as P Ps  0.7 } is the same as that probability that the
90
number of standard deviations away from the mean (  ) will exceed the number of
standard deviations defined by Z  1937
.” This number of standard deviations
.
( Z  1937
) has been computed using the large expression in square brackets at number
.
(5) above - the same way as we had done previously.
6.
To obtain the probability that more than 0.7 out of ten (10) randomly selected
oranges from all the trees within that area will have green patches mixed with
orange patches on their orange-fruit skins, we need to obtain the probability that
0.7 out of 10 oranges will have this characteristic, and then to subtract the figure
which we shall get from the total probability on that half of the normal curve,
which is 0.5000.
To do this we look at the standard normal tables and find the area subtended by
Z = 1.937 standard deviations above the mean. We shall, from now henceforth,
describe this kind of area with the expression AZ. In this regard AZ  A1.937 .
Looking at the Standard Normal Tables on page 487 (King’oriah, 2004), we go
down the left hand “Z” column until we find a value of 1.9. Then we find where it
intersects with the figures of the column labeled .03, because .037 is not available.
The figure on the intersection of the “1.9 row” and “0.03 column” of the Standard
Normal Table, is 4732. Then we conclude that :-
AZ  A1.937  0.4732
7.
Remember this is on the upper half of the standard normal frequency distribution.
This upper half comprises 0.5000 of the total distribution, or 0.5 out of all the
total probabilities available, that is 1.0. This means that on this end of the
Standard Normal curve, the probability that more than 0.7 of the sample of ten
(10) randomly selected oranges from all the trees within this area will have green
patches mixed with orange patches on their orange-fruit-skins, will be the
91
difference between the total area of this half of the curve (which is 0.5000) and
the value of AZ  A1.937  0.4732 .
8.
AZ  A1.937  0.4732

P Ps  0.7


0 .5000  0.4732  0 . 0268
This means that taking any ten oranges, the researcher expects that about 2.7% of
these oranges or less will have yellow patches interlaced with green patches on
their orange-fruit-skins.
Using the same kind of logic we can do one of these examples using the standard normal
curve and the normal deviate Z, just to demonstrate the similarities. Observe all the steps
very closely and compare with the example we have just
finished.
Example
The Institute of Primate Research has established that a certain type of tropical
monkey found in Mount Kenya Forest has a mean life span of X  24 years and
a standard deviation of
  6 years. Find the probability that a sample of this
type of monkey caught in this forest will live to be more than 25 years old. One
hundred monkeys were caught by the researchers.
(Source: Hypothetical data, as in King’oriah 2004, page 135.)
Solution
1.
  24 years,   6 years N  100 monkeys.
X  

25  24 
25  24 
1
P X  25  P 

 Z 
 

6 10 
0.6

6 100 
  X


 P  Z  16667
.

92
2.
The area defined by 1.6667 standard errors of , above the population  can now
be defined as AZ  A1.6667  0.4525 . This area lies in the upper portion of
the normal distribution.
3.
The area above AZ  A1.6667  0.4525 can only be the difference between
0.5000 of the distribution in the upper portion, less A1.6667  0.4525 . This area
is equal to :-
05000
.
 0.4525  0.0475
This means that only 4.75% of the monkeys can be expected to live beyond 25
years of age.
Activity
1.
Look at the textbook (King’oriah 2004, page 167) for a computation of this kind,
which involves the standard error of sampling distribution of sample means and
the use of the t-distribution. While you are doing so, you are advised to read
Chapters Five, Six and Seven of the Textbook in preparation for the coming
Chapter Four of these guidelines.
2.
Attempt as many problems as you can at the end of Chapters Five and Six in the
same Textbook. Tutors may set as many of these problems as possible for
marking and grade awards.
EXERCISES
1.
(a)
Explain why the proportion is regarded as a kind of mean.
(b)
In each of the following situations, a random sample of n parts
is selected from a production process in which the proportion of
defective units is π. Calculate the standard error of the sample proportion, and the
93
normal deviate that corresponds to the stated possible value of the sample
proportion.
π
( i ) 0.1
( ii )
p
100
0.16
0.2
1,600
( iii ) 0.5
25
( iv )
2..
n
0.01
9,900
0.215
0.42
0.013
In Sarah Mwenje’s coffee farm, the yield per coffee bush is normally distributed,
with a yield of 20 kg. of ripe cherries per bush, and a standard deviation of 5
kilos.
(a)
What is the probability the average weight per bush in a random sample of n = 4
bushes will be less than 15 kg.?
(b)
Assuming the yields per bush are normally distributed, is your answer in (a)
above meaningful? Explain why.
3.
(a)
Explain with reasons why the expected value of a sample mean
from any population is the parameter mean of that population.
(b)
State the Central Limit Theorem.
(c)
Why do you think the Central Limit Theorem works for any
population with any kind of distribution and population mean?
(d)
How does the standard error of mean assist in model building
for all populations?
94
CHAPTER FOUR
STATISTICAL INFERENCE USING THE MEAN AND PROPORTION
Population and samples
It is often a task of a scientist, whether social, behavioral of natural to examine the
nature of the distribution of some variable character in a large population. This entails the
determination of values of central tendency and dispersion - usually the arithmetic mean,
standard deviation, and the proportion. Other features of the distribution such as
skewness, its peakiness (the so-called Kurtosis), and a measure of its departure from
some expected distribution may also be required.
The term “Population” is generally used in a wide but nevertheless strict sense. It
means the aggregate number of objects or events - not necessarily people - which vary in
respect of some variable of interest. For example, one may talk about the population of
all the school children of a certain age group in a given area. In Biostatistics, the
botanical population of some particular plant is often the subject of investigation. In
industrial sciences the population of all defective goods on a production line could be a
subject of interest, and so on.
In practice, the examination of a whole population is often either impossible or
impracticable. When this is so, we commonly examine a limited number of individual
cases which are part of the population. This means that we examine a sample of the
population. The various distribution constants of the sample can then be determined; and
on this basis the constants of its parent population can be estimated. Our knowledge of
the sample constants can be mathematically precise. On the other hand, we can never
know with certainty the constants of the parent population; we can only know what they
probably are. Whenever we make an estimate of the population characteristic from a
sample, we are faced with a question of precision of the estimate. Obviously, we aim at
making our estimates as precise as possible. We shall presently see that the precision of
an estimate can be stated in terms of probability of the true value being within such-andsuch a distance of the estimated parameter.
There are one or two useful terms peculiar to the study of sampling. It will be
convenient for the reader to become familiar with them at the outset. Various constants,
such as the mean, the standard deviation, etc., which characterize a population are called
95
the population parameters. Parameters are the true population measures. They cannot be
normally known with certainty. The various measures, such as the mean, the standard
deviation etc., are the true population measures which can be known with precision can
be computed from samples. The measures resulting from the sample computations are
called Sample Statistics. Thus sample statistics are estimates of population parameters.
The precision of these estimates constitutes the so-called reliability of the statistics, and
we shall see later that there are techniques which enable us to infer population parameters
with high degrees of accuracy.
The Process of Sampling
Before concerning ourselves with the reliability of the significance of statistics, it
is necessary to have clearly in our minds the essential facts about the process of sampling.
In general, the larger the sample size the greater the degree of accuracy in the prediction
of the related population parameters. As the sample becomes progressively larger, the
sheer mass of numbers reduces the variation in the various sample statistics, so that the
sample is more and more able to represent the population from which it was drawn.
However, it does not mean that the samples must always be large. Even small samples
can do. What matters is that the sample, to the best of our knowledge, is representative of
the population from which it was taken. To achieve this, certain conditions must be
satisfied in selecting the sample. If this is done, then it is possible to reduce the size of the
sample without sacrificing too much the degree of accuracy which is expected to be
attained form using the larger sample.
Our chief purpose of taking samples is to achieve a practical representation of the
members of the parent populations, so that we can conveniently observe the
characteristics of that population. For example there are situations in the testing of
materials for quality or manufacturing components to determine strength, when the items
under consideration are tested to destruction. Obviously, sampling is the only possible
procedure here. In order for the test to be economical, it is important to estimate how
many test pieces are to be selected from each batch, and how the selection of the pieces is
to be made.
96
Our definition of Populations implies that they are not always large. But very
often they contain thousands, or even millions of items. This is particularly true of
investigations concerning characteristics or attitudes of individuals. In cases of such large
populations, sampling is the only practical method of collecting data for any
investigation. Even in cases where it would be possible from a financial point of view,
the measure of a characteristic of the total population is really not necessary.
Appropriately selected small samples are capable of providing materials from which
accurate generalizations of the whole population can be successfully made. In the interest
of efficiency and economy, investigators in the various data fields of the social,
behavioral, and natural sciences invariably resort to sampling procedures and study their
subject populations by using sample statistics.
Selecting a sample
What is necessary to select a good sample is to ensure that it is truly a
representative of the larger population. The essential condition which must be satisfied is
that the individual items must be selected in a random manner. The validity of sampling
statistics rests firmly on the assumption that this randomizing has been done, and without
it, the conclusions which may be reached using unrepresentative samples may be
meaningless. To say that the items must be selected in a random manner means that
chance must be allowed to operate freely; and that every individual in the subject
population must have been given an equal chance of being selected. Under these
conditions and under no other, if sufficiently large numbers of items is collected, then the
sample will be a miniature cross-representation of the population from which it is drawn.
It must be remembered that, at best, sample statistics give only estimates of
population parameters, from which conclusions must be made with varying degrees of
assurance. The more accurately the sample represents the true characteristics of the
population from which it is drawn, higher the degree of assurance that is obtainable. It
will be appreciated that, despite this limitation, without sample statistics it would be
impossible to achieve any generalized conclusions which can be either of scientific of
practical value.
97
Techniques of Sampling
In general there are two major techniques which are used in compiling samples.
One technique is called Simple Random sampling, and the other one Stratified Random
Sampling. Among these there are various modifications of the major genre which the
learner is advised to search and peruse in relevant texts of research methodology, some of
which are listed at the end of this chapter. Simple Random Sampling refers to the
situation indicated earlier, in which the choice of items is so controlled that each member
of the population has an equal chance of being selected. The word Random does not
imply carelessness in selection. Neither does it imply hit or miss selection. It indicates
that the procedure of selection is so contrived as to ensure the chance-nature of selection.
For example if in any population the names of individuals are arranged in alphabetical
order, and one percent sample of selection is required, the first name may be selected by
sticking a pin somewhere in the list and then taking every one hundredth name following
that.
Such a selection would lack any form of bias. Another method commonly used is
to make use of the Random Number Tables [King’oriah, (2004), pages 484 - 487]. All
individuals may be numbered in sequence, and the selection may be made by following
in any systematic way the random numbers - which themselves have been recorded in a
sequence using some form of lottery procedure by the famous Rand Corporation.
The process of obtaining a random sample is not always that simple. Consider for
example, the problem of obtaining views of housewives on a particular product in a
market research project in any large city. The city may first be divided into numbers of
districts, and then each district into numbered blocks of approximately equal size. A
selection can now be made of districts by drawing lots, and interviewers allocated to each
selected district could choose housewives from the blocks selected in a similar random
selection manner in his district. In this way, a simple random sample of the market
attitude of the product may be obtained.
To secure a true random sample for a statistical study, great care must be
exercised in the selection process. The investigator must be constantly aware of the
possibility of bias creeping in. Another common technique which can be used to improve
the accuracy of sampling results and to prevent bias, and assure a more representative
98
sample is called Stratified Random Sampling. In essence, this means making use of
known characteristics of the parent population as a guide in the selection. A good
example of this can be found in opinion polling. Suppose an investigation is undertaken
to assess the public attitude to a proposed major reform in the private sector of the
education system. It is probable that the political parties may tend to have diverse views
on how to go about the education reform. It is also probable that people in different
economic and social groupings such as the professional societies, business interests, the
religious groups, skilled and non-skilled artisans, etc., would tend to react to the proposed
education reforms systematically as groups. There might even be differences of opinion
in general between men and women; and more probable still, between other divisions of
the population like urban and rural people; between also regional and educational
groupings.
Obviously, in any given case all possible variables are not necessarily important.
The whole population is studied to ascertain what proportions fall into each category, into
individual political blocks, men or women, town and country, and so on...Steps are then
taken to ensure that any sample would have proportional representations from all the
important sub-groups, the selection of items in each sub-group being carried out, of
course, in the manner of simple random sampling. Clearly, the stratification made of the
population will depend on the type and purpose of the investigation, but where used, it
will in appreciably improve the accuracy of the sampling results and help to avoid the
possibility of bias. Essentially it constitutes a good systematic control of experimental
conditions. A Stratified Random Sample is always likely to be more representative of a
total population than a purely random one.
Paradoxically purposeful sampling may be used to produce a sample which
represents the population adequately in some one respect or another. For example, if the
sample is to be (of necessity) very small, and there is a good scatter among the parent
population, a purely random sample may by chance yield a mean measure of the variable
which is clearly vastly different from the population mean. Provided that we are
concerned with the mean only, we may in fact get nearer the truth if we select for our
small sample only those individuals who seem on inspection to be close to what appears
to be the population average. Where the required random characteristic is lacking in
99
making the selection, a biased sample results; and such a sample must contain a
systematic error. If certain items or individuals have greater chance of being selected the
sample is not a true representation of the parent population.
Assuming that the procedures are scientifically satisfactory, we wish to see how
and when the conclusions based on observational and experimental data are statistically,
from a mathematical standpoint, warrantable or otherwise. Having emphasized that
unbiased sampling is a prerequisite of an adequate statistical treatment, we must begin to
discuss the treatment itself.
Hypothesis testing
Introduction
One of the greatest corner-stones of the Scientific Method of Investigation is the
clear statement of the problem through the formulation of hypotheses. This is usually
followed by a clear statement of the nature of the research which will be involved in
hypothesis testing and a clear decision rule. The process makes all the other researchers
in the field of academic enquiry to know that the hypothesis has been proven and the
envisaged problem has been solved. It is considered unethical in the research circles to
state one’s hypotheses, research methodology and decision rule after one has already
seen the how the data looks like, or the possible trend of events.
Hypotheses and the associated research methodology are therefore formed before
any sampling or any data manipulation is done. Thereafter, field data are used to test the
validity of each such hypothesis. Therefore a hypothesis is a theoretical proposition
which has some remote possibility of being tested statistically or indirectly. It is some
statement of some future event which could be either unknown or known vaguely at the
time of prediction; but which is set in such a manner that it can either be accepted or
rejected after appropriate testing. Such testing can either be done statistically, or using
other tools of data analysis and organization of facts. In this chapter we are interested in
situations where quantitative or qualitative data has been gathered from field observations
and then statistical methods are used for the hypothesis testing of the same data.
100
The Null and Alternative Hypotheses
Hypotheses meant to be tested statistically are usually formulated in a negative
manner. It is expected that in stating such hypotheses one should allow all chances that
the desirable event must happen: so that should the desired event take place despite such
a conservative approach, one can be in a position to confirm that the event did not occur
as a matter of chance. It is like a legal process, where when a person has been accused in
a court of law for a criminal offence. In this case, one is presumed innocent until one is
proved guilty beyond all reasonable doubt. This kind of strict ethical behavior and code
of ethics is applied to scientific research.
The negative statement of the suspected truth which is going to be investigated
through data collection and data manipulation is called a Null Hypothesis. For example if
one suspects that there is a difference between two cultivars of millet, the research
hypothesis statement must take the following two possibilities :-
(a)
Null Hypothesis (Ho): There is no difference between the two groups of millet
cultivars, and that if any difference exists it is due to mere chance.
(b)
Alternative Hypothesis (HA): There is a marked and a statistically significant
difference between the two groups of millet cultivars.
After this statistical tools are used to test the validity of data; and to see which
side of the two statements is supported by field investigations. Accepting the null
hypothesis means rejecting the alternative hypothesis: and vice-versa.
Steps in Hypothesis testing
1.
Formulate the null hypothesis after familiarization with the actual facts, and after
realizing that there is a suspected research problem.
2.
Formulate the alternative hypothesis, which is always the complement and the
direct opposite of the null hypothesis.
101
3.
Formulate the decision rule which, if the facts are the way of this rule, the null
hypothesis will be accepted. Otherwise the null hypothesis will be rejected, and its
complement, the alternative hypothesis will be accepted. This decision rule must
contain clear criteria of success or failure regarding either of the two hypotheses.
These three steps are performed before the researcher goes to collect the data in
the field, or before manipulating any data if the source of data is from documents. Other
researchers must know clearly what the present researcher is up to, and that the current
researcher is not going to “cook” success of his experiment in the field or from
documents.
4.
Collect and manipulate data in accordance with the chosen statistical or
probability model, e.g., the Normal Curve, the t-distribution or any other. This is
where the method of measurement is applied rigorously to test whether the data
support or reject the null hypothesis - thus accepting or rejecting the alternative
hypothesis.
5.
Examine the results of data manipulation, and see whether the decision rule has
been “obeyed”. If so, accept the null hypothesis; and if not, reject the same, and
accept the alternative hypothesis.
Let us now do a small hypothetical example, so that we can see how Hypothesis Testing
is done.
Example
Investigate whether there is any difference in the weights of male and female
goats using the special type of breed found in Buuri area of Meru District.
Solution
The process of hypothesis testing would proceed in the following manner:-
Step One:
Null Hypothesis (Ho): There is no difference in the weight of male and
female goats which are found in Buuri area of North Imenti District.
102
Step Two:
Alternative Hypothesis (HA): There is a marked and a statistically
significant difference between the weights of male and female goats of the type
found in Buuri area of North Imenti District.
Step Three:
Decision Rule: The null hypothesis will be tested at what the statisticians
call a confidence level. In symbolic terms this level is designated with a bold
letter C. If you are writing in longhand, like many of us will be doing, it pays to
cross your letter Capital this way “
C
”, so that the reader of your manuscript
will know you are talking about the bold capital C. In our case, let us adopt a 95%
confidence level. C = 0.95. This confidence level means that when the Normal
distribution model is used for testing these hypotheses, one would consider
similarity if both weights of female and male goats will fall within the area of the
normal curve subtended by 1.96 standard deviations on both sides of the mean.
Figure 4 - 1: The probability of Similarities and differences
on both sides of the mean
Each side of this curve which is within 1.96 standard deviations from the mean
(  ) would be a proportion of 0.4750 out of 1.000 (all the total available
proportion.) Both sides of (  ), namely, – 1.96  to the left, and +1.96 to
the right, would include :-
0.4750  2  0.9500 out of the total area under the normal
curve; which is 1.0000.
103
Also, remember that with respect to the Normal Distribution, the word
“proportion” is synonymous and identical to that other very important word that
we have now become used to: “probability”. Now examine Figure 4 - 1 carefully
for these details.
The diagram illustrates the fact that we are testing the characteristics of our
population at a 0.9500 probability level of confidence. Meaning: 95% confidence
level. The complement of this level of confidence is what the statisticians call
Significance Level and is denoted by the Greek letter “  ” . This is obviously :-
1. 0000  0.9500  0. 0500 = 5% significance level , or “  ”.
Therefore, if we are testing the results of the characteristics of our population (or
any sample) at 95% confidence level we are at the same time testing the results of
the population (or sample) at 5% confidence level. In research methodology and
statistical analysis these two terms are used interchangeably. Figure 4 - 1 is an
illustration of these levels of statistical significance.
Step Four : The representative measure : Take either the male or the female
goats as a representative measure. Weigh all of the goats of your sex of
choice, which are available to you, whether as a sample or a population (in
our case where we are using the Standard Normal Curve, we are assuming
a population of all goats in Buuri). We can then assume that the mean
weight of the female goats shall represent the population mean (  ), which
is a parameter against which we shall test the mean weight statistic of all
the male goats. If within our desired confidence level (or significance
level) there is no statistically significant difference between the weight of
the male goats from that of the female goats, we shall accept the Null
Hypothesis (Ho): that there is no difference in the weight of male and
female goats which are found in Buuri area of North Imenti District.
104
Step Five :
The Rejection Regions : At the end-tails of Figure 4 - 1, you will notice that
there is a region of dissimilarity either side, which is valued at “ /2 ”. This
means that whenever an alpha level ( or significance level) is stated in a problem
of this kind, which involves a test of similarity, the alpha or significance level is
divided into two. It has to be distributed onto the upper tail and the lower tail,
where the populations of samples which do not belong to the population of
interest are expected to be located. Any observation beyond 1.96 standard
deviations from the universal (population) mean from our population on both
sides of the mean are either too great (too heavy in our example), or too small (too
light in our example). Any observation falling in these regions is not a member of
our population. Our population is clustered about the mean on both sides. This
means that when we cut off the five percent which does not belong to the
population comprising female goats, we are actually distributing the five percent
of dissimilarities, as well as similarities to both sides of the population mean.
Therefore, at 5% confidence level we have to divide the 5% into two, distributing
2.5% to both the upper-end tail and the lower-end tail regions of dissimilarity.
Therefore, on each tail-end we expect to have 2 ½% of the population which
does not belong to the main body of similarity. Any mean weight or observation
of the male goats which falls within these rejection regions does not belong the
main body of the female goats; in terms of their body-weight measurements.
Specifically, when we compute the mean weight of the male goats, and if the
mean value in terms of the standard deviations happens to fall within the rejection
region on either sides of the normal curve, we must conclude that the weights of
the male goats are not the same as the weight of the female goats.
The same logic of analysis applies whether we are using the Standard Error of the
Sampling distribution of the proportion or the sampling distribution of the sample means.
Using other statistics, which we shall learn, we will find that any observation which falls
within the rejection regions (defined by alpha levels of whatever kind) will be rejected as
105
not belonging to the main body of the population whose observation lie clustered close to
the parameters of interest. In this case the parameter of the population is the mean (  ) .
Now let us attempt a more realistic example using the same kind of logic, so that we can
see how hypothesis testing is done using statistical analysis and confidence intervals.
Example
In considering very many scores for Statistics Examinations within a certain
university, the mean score for a long time is found to be 60% (when
considered as a raw observation or a count; and not a proportion). This year, a
class of 25 students sat for a similar examination. Their average was 52%, and
the standard deviation of 12%. Are the results of this year’s examination typical at
5% alpha level?
Solution
Step One:
Null Hypothesis (Ho):    60% There is no difference between the
scores of this year and those of all the other years.
Alternative Hypothesis (HA):    60% . There is a statistically
significant difference between the scores of this year’s class and those of
all other year’s classes.
Step Two:
Decision Level: We test this null hypothesis at 95% confidence level, or at
Five percent alpha level (significance level). The sample size n = 25.
This is a small sample, less than 30 students. The statistical model which
we shall use is the t-statistic. Accordingly we must adjust the sample by
one degree of freedom. This means that n - 1 = 24, after adjusting for
one degree of freedom.
Now is the proper time for us to learn how to use the t-statistical table at page 498
of your textbook (King’oriah 2004). Turn to that page, and look for the top two rows.
The first row has figures ranging from .10 to .0005. The other row has figures ranging
from .20 to .001. The topmost row is labeled “ Level of significance for one-tailed test”,
and the lower row is labeled, “ Level of significance for two-tailed test”.
106
We now consider our sample, and find that it may be safe to use one-tailed test,
because the observation lies below the universal mean (  ). On the t-table at page 498, we
use the column labeled “ .05 ” along the top row, which you can see is called “ Level of
significance for one-tailed test”. The left hand-most column records the degrees of
freedom. We manipulated our sample size: n = 25, and adjusted for the loss of one
degree of freedom, to get n = 24. Now we must look for the intersection of n = 24
(down the left-most column of the table) and the list of numbers which are located down
the column labeled “ .05 ” within the body of this table. At this intersection, we can see
the expected value of t (designated as “ t  ” and “pronounced as “tee alpha”) valued at
t
 =
1.711, which we now extract from the tables. This means that the rejection area
begins at t 
=
1.711 Standard Errors of Sample Means from the mean (  = 60% ). See
Figure 4 - 2. Because of the negative sign, the rejection region ( “-area” ) must lie on
the left hand side of the mean.
Step Four: The Standard Error : In this case also, we have not yet computed the
standard error of the mean. This is correct, because this kind of standard error is
usually computed after stating the decision rule very clearly, and not before. We
have been given the raw-standard deviation of 12%, and have assumed that this
figure has been computed from the raw scores using the formula for computing
the standard deviation :S

n

i  1
X
i
 X
n  1

2

12% .
This standard deviation is given in this example as 12% for the purposes of saving
time; and because we do not have the raw scores of all the classes which preceded
this year’s class.
Using this figure of 12%, we now compute the standard error of mean ( S X ) so
that we can make use the t-distribution table (on page 498) of your textbook.
107
S X

S
n

12
25
Step Five: The compute (observed) value of t

12

5
2 . 40
from the given facts : Now we
compute the actual t-statistic for this year’s sample using the formula :-
X  
52%  60%

SX
2 . 40
t 
Figure 4 - 2: The distance of X i = 52% from  = 60% in terms of Standard Errors
We repeat here that the percentages used in this formula to compute the actual or
the observed “ t ” are used as raw scores, and not as a proportion; in order to
avoid any confusion with the computation of the standard error of the proportion
which we considered before in Chapter Three of this module. Now, let us proceed
with the computation of the actual value of “ t ” .
t 
X  
SX

52%  60%

2 . 40
108
8
  3. 333
2 . 40
In terms of actual standard errors, this is how far the current sample X  52%
is away from the universal (population) mean (  = 60% ). It is -3.333 Standard
Errors away. We demonstrate this fact by means of Figure 4 - 2.
Accordingly, we conclude that this year’s class mean of 52% is far below
the usual average score of all the previous classes. Therefore we reject the null
Hypothesis that there is no difference between the scores of this year and those of
all the other years. We accept the alternative hypothesis that there is a statistically
significant difference between the scores of this year’s class and those of all other
year’s classes. In that regard, the null hypothesis has been rejected at 5% alpha
level.
Confidence Intervals
It is now time to consider learning a closely related concept in statistical
investigation - the concept of Confidence Interval . In setting up this interval on both
sides of the mean (  ), the onus of deciding the risk to be taken lies solely with the
statistician. There is no hard and fast rule about this. Sometimes one may use more than
two confidence intervals (say 95% and 99%) in one experiment to test the sensitivity of
his model. The risk of making an error after deciding on the appropriate confidence level
also is borne solely by the statistician. The probability of making an error after setting up
this confidence interval is called the alpha level - the one we have discussed above. The
confidence interval “ CI ” is the probability that the observed statistic belongs to the
population from which the parameter has been observed. It takes the form of the actual
number of standard deviations (or standard errors) which delineate the typical data as
end-limits, and the rejection region for all that data which does not belong to the subject
population.
The technique is not different from what we have just discussed in the preceding
section. The only difference is that of the approach. In this case we are building a
probability model for investigating the chances of finding a statistic within the level of
similarity which is typical to some known parameter. The expression for the confidence


interval looks like this : CI  P X  Z   X .
109
This statement says that the Confidence Interval ( CI ) is the probability “ P ” that
the population mean (  ) will be found within a specified number of actual standard
errors “  X ” of the actual population from the raw scores, multiplied by the standard
errors of the population “ Z ” (for the purposes of standardization), on both sides of the
mean. This same statement can be arranged in another simpler fashion :-

CI  P X  Z   X    X  Z   X

The meaning of this expression is, “Within the probability “ P ” that on both sides of the
sample mean “ X ” defined by a specific number of actual standard errors “ Z  X ” , we
expect to find the mean of the population (  ). Note that the population mean (  ) lies in
the center of this expression. This means that our expression allows our population mean
(  ) to randomly slide between the end limits of the interval set athwart the sample mean
“ X ”, as long as (  ) does not go outside this interval we have set using our specified
significance level. We include the concept of the significance level in the model by
specifying the alpha level, and stating it within the confidence interval notation: as
follows :-

CI  P X  Z    X    X  Z    X

In this case “ Z  ” is the computed number of standard errors at the alpha level set by the
statistician. The value “  X ” denotes actual number of standard errors which have been
computed using the given data. Remember that we have hitherto stated that using the raw
data, the standard error of mean is computed using the formula  X 

n
. When we
multiply this actual value with the expected number of standard errors obtained from our
Standard Normal Table ( Z  ), we obtain the actual standardized figure ( in number or
observational scores) which is so many standard errors
Z 

away from the statistic
“ X ”. Let us now do a simple example of interval building to facilitate our understanding
of this important concept.
110
Example
From your field experience, you have found that the mean weight of 9-year old
children in a certain district situated in central Kenya is 45 Kilograms. You came
to this conclusion after weighing 100 children, which you randomly sampled from
all over the district. The standard deviation of all your weights from this large
sample is 15 Kilograms. Where would you expect the true mean of this population
mean to lie at 95% confidence level?
Solution
1.
In this example the level of confidence C = 0.95 or 95%.
Under the Standard Normal Table, this is the area subtended by 0.4750 standard
deviations on both sides of the mean . This is the value we have hitherto called
AZ .
2.
Therefore AZ  0 . 4750 .
Now let us compute the actual standard error (in Kilograms) of the sample
means obtained from our field data.
X 
3.

n

15kg.
 1.5 kg.
100
Next, we build the confidence interval by inserting our field data in the formula :-

CI  P X  Z    X
  
X  Z    X



CI  P 45 kg.  1. 96  1.5 kg.    45 kg.  1. 96  1.5 kg.
CI  P 45 kg.  2.94    45 kg.  2.94
CI  P 42 . 06    47 . 94
4.
The true mean weight of this population (  ) is likely to be found between 42.06
Kilograms and 47.94 Kilograms. Any observation outside the interval with these
end-limits does not belong to the population of these children you found and
weighed during your field research.
111
Type  and Type  errors
The interval set on both sides of the sample mean may lie so far out that it may
not include the population mean. This happens if the sample used in deriving the mean is
so unusual (a-typical) that its mean lies in the critical (rejection) regions of the sampling
distribution of sample means. All precautions must be taken to set up a good research
design and to use accurate sampling techniques to avoid this eventuality. This is done by
setting up an appropriate interval in all our experiments. If the controlling interval around
the subject sample mean is too narrow there is a chance of rejecting a hypothesis through
the choice of this kind of interval when in fact the hypothesis is true. This is called
making the Type  error. On the other hand a very wide interval leads the researcher to
the risk of making type 
error. This is when the interval is so wide that he ends up
accepting into the fold of the population defined by his standard normal curve model any
members who may actually not belong to the subject population.
The researcher is usually in a dilemma. If he chooses a narrow interval he
increases the error caused by excluding the parameter from the sample, although the
parameter could actually be included within a properly set interval. This amounts to
accepting the null hypothesis even if it is false; and therefore committing the type 
error. On the other hand if he chooses a wide interval, he increases the error caused by
including a parameter from outside, although the parameter could actually be excluded
from a properly set interval. This amounts to rejecting the null hypothesis even if it is
true; and therefore committing the type  error. We can therefore define our two types of
error in the following manner :-
Rejecting a correct hypothesis through the choice of a narrow confidence interval or
setting up large alpha (rejection) regions amounts to mating a type  error.
Accepting a false hypothesis through the choice of a wide confidence interval or setting
up very small alpha (rejection) regions amounts to mating a type  error.
We must therefore be careful to balance the setting of our confidence intervals carefully
to guard against any of these two errors.
112
Confidence Interval using the proportion
Like the population mean, the confidence interval of the proportion may be
estimated using confidence intervals. These may be built either around the parameter
proportion “  ”, or around the statistic Sample proportion “ Ps ”. the logic involved in
this estimation is identical to what we have discussed in the previous section - after all,
the proportion is some kind of a mean. The expression which is used in this estimation is
analogous to the one we have just used, only the designation of the parameters and the
statistics which are in it is different to reflect the fact that we are now dealing with the
binomial variable, the proportion. This expression is as follows :-

CI  P Ps  Z    Ps
  
Ps  Z    Ps

Study this equation well and you will find that:-
Ps
= Sample proportion
 P = Sample standard error of the proportion
s
Z  = The number of standard errors from the proportion parameter  which
have been determined using the Confidence level set by Ourselves in order
to build the model for estimation our parameter population proportion .
 = The universal or the population proportion whose position within the
Normal Curve Model we are using for this estimation.
The equation reads, “The Confidence Interval CI is the probability “ P ” that the
parameter population proportion “ ” will lie within the Standard Normal Distribution
both sides of the sample proportion which we are using as our comparison standard Ps .
in the area subtended by Z    Ps standard errors (of raw scores) of the proportion on
We now use a simple example to estimate the population parameter  .
113
Example
Nauranga Sokhi Singh is a sawmill operator within the Mount Kenya Forest. He
wishes to calculate at 99% confidence level, the true proportion of undersize
timber passing out of his sawmill’s mechanical plane. He obtains a random
sample of 500 strip boards, measures them and finds that 0.25 of these have been
cut undersize by this machine.
Mr. Singh is alarmed! And is interested in knowing whether his sawmill will
continue working at this shocking degree of error. Provide some statistical advice
to your client, Mr. Singh. (Source: Adopted from King’oriah, 2004, pages 169 to
170.)
Solution
The method of approaching this problem is the same as the one we have learned
above.
Step One : C = 0.99 (or 99%) given, especially because accuracy in obtaining timber
strips from this machine is crucial.
Step Two : Using the sample proportion Ps = 0.25 we can compute the standard error of
the sample proportion:-
 P
Step Three :
s


Ps 1  Ps
n

0 . 25 1  0 . 25
500


0 . 0194
Obtain the area which is prescribed by the confidence level which we have
prescribed. C = 0.99 (or 99%) given: and therefore  = 0.01.
Consequently, we know from the Standard Normal Tables that the
0.99000 of the area is subtended by ( 0.4950  2 ) standard errors on
both
sides of the population parameter  .
Step Four :
Build the confidence interval.

CI  P Ps  Z    Ps
  
114
Ps  Z    Ps

whose values are actually :-

CI  P 0. 25  2 .57  0. 0194   

0. 25  2 .57  0. 0194
 P 0 . 25  0.050    0 . 25  0.050
 P 0 . 20    0 . 30
You tell Mr. Singh that the machine will continue to churn out between 0.2 and
0.3 undersize strip boards from all the timber he will be planing, because your
computation tells you that the population parameter “
 ” of all the timber strips planed
by this machine lies (and could slide between) 20% and 30% of all the timber which it
will be used to process. Advise Mr. Singh to have his machine either overhauled, or
replaced.
EXERCISES
1.
2.
Explain the meaning of the following terms:-
(a)
The null hypothesis and alternative hypothesis.
(b)
The standard error of mean
(c)
The standard error of the proportion
(d)
The normal deviate, Z
(a)
Explain with reasons why the expected value of a sample mean
from any population is the parameter mean of that population.
(b)
State the Central Limit Theorem.
(c)
Why do you think the Central Limit Theorem works for any
population with any kind of distribution and population mean?
(d)
How does the standard error of mean assist in model building
for all populations?
3.
(a)
Explain the difference between Hypothesis testing and
Statistical estimation.
115
(b)
Distinguish between the null and alternative hypotheses.
(c)
What is meant by one sided single tail test, and a two sided
double tail test ?
4.
(d)
Explain what is meant by decision rule in statistical analysis.
(a)
State the Central Limit Theorem.
(b)
Why do you think the Central limit Theorem works for any
population with any kind of distribution and population mean?
(c)
Why is the expected value of a sample mean from any
population approximately equal to the population mean of that
population?
5.
A random variable has a normal distribution with
standard deviation
σ
X  102 . 4 and
a
= 3.6. What is the probability that this random
variable will take on the following values :-
6.
(a)
107.8
(b)
Greater than 99.7
(c)
Between 106.9 and 110.5
(d)
Between 96.1 and 104.2
A company fitting exhaust pipes to custom-made cars announces that you will
receive a discount if it takes longer than 30 minutes to replace the silencer of your
car. Experience has shown that the time taken to replace a silencer is
approximately normally distributed with a mean of 25 minutes and a standard
deviation of 2.5 minutes.
(a)
Explain with reasons the kind of probability model you will use
to calculate the distribution of the time it takes to replace
silencers in custom-made cars.
116
(b)
What proportion of customers who receive a discount ?
(c)
What is the proportion of the silencers which take between 22
and 26 minutes to replace?
117
CHAPTER FIVE
THE CHI-SQUARE STATISTIC
Introduction
In some research situations one may be confronted with qualitative variables
where the Normal Distribution is meaningless because of the small sizes of samples
involved, or because the quantities being measured cannot be quantified in exact terms.
Qualities such as marital status, the color of the cultivar skins, sex of the animals being
observed, etc., are the relevant variables; instead of exact numbers which we have used to
measure variables all our lives. Chi-square is a distribution-free statistic which works at
lower levels of measurement like the nominal and ordinal levels.
The Chi Square Distribution
The chi-square Distribution is used extensively in testing the independence of any
two distributions. The main clue in understanding this distribution lies in the
understanding the generalized uses, and then applying it to specific experimental
situations. the question of interest is whether or not the observed proportions of a
qualitative attribute in a specific experimental situation are identical to those of an
expected distribution. For this comparison to be done the subject observed frequencies
are compared to those of similar situations - which are called the expected frequencies.
the differences are then noted and manipulated statistically in order to test whether such
differences are significant. the expression for the Chi-Square statistic is

2

O

k
i
c  1
 Ei

2
Ei
Where :1.
The letter  is the Greek letter Chi. The expression “  2 ” is “Chi-Squared”,
which is the parameter used in statistical testing and analysis of qualitative
samples.
2.
Oi = The observed frequency of the characteristic of interest.
118
3.
Ei = The expected frequency of the characteristic of interest.
4.
k = The number of paired groups in each class comprising the observed
frequencies Oi and the expected frequencies Ei .
5.
c = An individual observations of paired groups.
Except for the fact that this distribution deals with lower levels of measurements, it is
some form of variance. To prove this note that the algorithm for the computation of
variances has the same configuration to the formula we have just considered. Compare
both formulas as outlined below :-

2

O


k
c  1
i
 Ei
Ei

2
and

2

n

i  1
X
 

2
n
Despite this fact the assumption we make for the chi-square statistic are not as rigorous as
those of the variance. For example we cannot assume that data are obtained from a
normal distribution. this is why we have to use a special type of mathematical distribution
for the analysis of that data.
The Chi-Square distribution is such that if the observed values differ very little
from the expected values then the chi-square value is very little. On the other hand when
the differences are large, then the chi squares are also very large. This means that the
mode of this distribution tends to be located among the fewest observations. this mode is
not like that of the normal curve. The Chi-Square curve tends to be highest between Two
and 15 observations. For small numbers the curve is positively skewed, but for large
numbers of observations in excess of Twenty objects, the chi-square can be assumed to
approach the shape of the Normal Curve. Farther on, we can assume the normality of
data, and the Chi-Square has no difference from the Normal Distribution. Accordingly we
may use the Standard Normal curve for model building, data manipulation and analysis.
For our purposes, the rejection (alpha) area lies to the right of the distribution
because of its characteristic skew. However, theoretically it is possible to have a rejection
area situated to the left of the steepest area of the distribution. However, this is not of any
119
Figure 5 - 1 : Various values for Chi-Square Distribution with
Increasing Degrees of Freedom. ( Source: George K. King’oriah,
Fundamentals of Applied Statistics. 2004, Jomo Kenyatta Foundation, Nairobi.)
value to our current discussion. Figure 5 - 1 outlines the characteristic of the Chi-Square
distribution at various but increasing degrees of freedom. The Distribution tables for the
Chi-Square Statistic which are used in an analogous manner to the Standard normal
Tables for building confidence intervals and hypothesis Testing are available on page 499
0f your Textbook (King’oriah, 2004). The tail areas are given at various degrees of
freedom at the head of each column; and the entries within the body of the table are the
corresponding Chi-Square values at specific levels of confidence.
For our purposes, the decision rule for the Chi-Square test has an upper rejection
region. The location of the observed or the calculated Chi-Square is compared with the
location of the Expected Chi-Square at the degrees of freedom which are determined by
our sample size. If the calculated Chi-Square is bigger than the one found in the tables at
the prescribed degrees of freedom, Null the Hypothesis for similarity is rejected and the
120
Alternative hypothesis for differences of variables is accepted at the prescribed level of
confidence. The calculated Chi-Square which happens to be less than the appropriate
critical value falls within the main body of the distribution; and the Null hypothesis for
similarities of the variables being compared is accepted. The alternative hypothesis for
dis-similarity is accepted. The commonest use of this statistic is with a set of two
variables each time, although mathematically the distribution has many more uses than
this. For a change we can begin with using the distribution with one variable, where we
lose one degree of freedom because we are operating in one dimension. A simple
experiment using a six-sided, and fairly loaded die (plural dice) is used. Let us proceed
to use this example and see what happens with respect to the confidence interval using
the Chi-Square Distribution.
Example
A fair die has an equal probability of showing any one of the six faces on top
when it is tossed. Mr. Laibuni the gambler tosses the die 120 times as he records
the number of dots on the top face each time. Each of the faces is expected to
show up six times during the 120 times that the die is tossed. However, Mr.
Laibuni finds the results listed in Table 5 - 1
Does Mr. Laibuni have any reason to believe that the die is fair? Advise Mr.
Laibuni:
at
95% Confidence
Level.
(Source:
George K. King’oriah,
Fundamentals of Applied Statistics. 2004, Jomo Kenyatta Foundation, Nairobi.)
Solution
1.
Frame the null and the alternative hypotheses
Ho : The die is not loaded to produce biased results.
H
0
t 1  t 2  ....  t 6

The expected face turn ( t i ) are equal for all faces of the die.
HA : The die is loaded to produce biased results.
H
0
t 1  t 2  ....  t 6

The expected face turn ( t i ) are not equal for all faces of the die.
121
TABLE 5 - 1: THE NUMBER OF DOTS FACING UP IN 120 TOSSES
OF A FAIR GAMBLING DIE
Number of dots
out of 120 tosses
2.
1
Number of times
each face turns up on
top
12 times
2
14 times
3
31 times
4
29 times
5
20 times
6
14 times
Formulate the decision rule. We are given the 95% confidence level. Therefore,
the alpha level is 0.05.
To use a Chi-Square Statistic table requires that you have appropriate degrees of
freedom. There are six possibilities in this experiment. The nature of the
experiment is such that we are counting within one dimension, form the first face
to the sixth face. Therefore, n = 6, and when we lose one degree of freedom we
find that we are left with
n = 5 d f with which to use the Chi-Square
Distribution table on page 499 of your textbook.
4.
Using the Chi-Square table on page 499 wee must obtain the critical Chi-Square
value at 5% (or 0.05) significance level. This level is available along the top-most
column, the fourth column from the right. We must find our critical level ChiSquare where the values found within this column co-incide with n = 5 d f
which are available down the left-most column of this table. the rejection region
in this case must therefore begin at the Chi-Square value  2  11. 070 . This is
illustrated in Figure 5 - 2.
Our decision to reject that the null hypothesis that the die is not loaded occurs if
the observed or the calculated Chi-Square exceeds the Chi-Square for the critical
level, which is  2  11. 070 . We now set up a table for computing our statistic.
122
TABLE 5 - 2: STEPS IN THE COMPUTATION OF THE CHI-SQUARE USING A GAMBLER’S DIE.
O

O
Observed
Frequency:
Oi
Expected
Frequency:
Ei
Oi  E i
1
12
20
–8
64
3.20
2
14
20
–6
36
1.80
3
31
20
11
121
6.05
4
29
20
9
81
4.05
5
20
20
0
0
0.00
6
14
20
–6
36
1.80
Face of
Die
TOTALS 120
i
 Ei
2
i
 Ei

2
Ei
0
16.90
The calculated Chi-Square turns out to be 16.90, which is much larger than the
value defining the critical point of Chi-Squares which separate the rejection region from
the acceptance region. We reject the null hypothesis and accept the alternative hypothesis
that the die is not fair; it is loaded. If our computation was such that we obtained a
calculated Chi-Square below 11.070 We would have accepted the null hypothesis that the
die is not leaded.
This example is one of the simplest uses of Chi-Square statistic, to test the
Goodness of Fit of one distribution onto another. The same logic can be applied to
compare the observed values to the expected values in any situation. A slightly modified
method is required which enables the computation of qualities of any distribution. In any
case, whatever device we use we end up stretching out any two variables being compared
so that the difference between the actual values and the expected values can be discerned.
Once this has been some, the differences, the squared differences and ultimately the Chisquare are obtained as we have done above.
123
Figure 5 - 2: The Chi Square Distribution at Five Degrees of Freedom
and 95% Confidence level; Showing the calculated 2c = 16.90
Contingency tables
Contingency tables are used as a convenient means of displaying the interaction
of two or more variables on one another. Using these tables, the quantities of each
variable can be clearly seen, and the total effect of variable interaction is clearly
discernible. In addition the computation of the nature of this interaction and other kinds
of mathematical manipulations are possible. This kind of tables is useful in Chi-Square
analysis because occasionally, the null hypothesis being tested concerns the interaction of
variables between two factors. In most cases we are after testing the independence of one
variable factor from another one, that one quality is not affected by the presence of
another quality, although both qualities are found under the same interactional
environment. We now try to do another example where we are testing whether one factor
is affecting another one under experimental circumstances at a lower level of
measurement (nominal level).
Example
It is generally accepted that due to rural underdevelopment in some rural areas of
Kenya there are primary schools which do not have coursework textbooks, and
they perform poorly in final school examinations K.C.P.E. Samson M’Mutiga, an
educational researcher based in Uringu Division, has noticed that the subject
mathematics is performed poorly in all the schools within the division. He
124
suspects that this is due to the fact that some children cannot afford to buy the
prescribed textbooks for mathematics.
After collecting data about the children performance in mathematics from
four schools in Uringu division, and after recording them on a contingency table,
he uses a Chi-Square statistic to test the null hypothesis that: the pass rate in the
K.C.P.E. Mathematics within the schools in Uringu Division is not affected by
the possession of the prescribed textbooks ; against the alternative hypothesis that
the possession of the set books has a statistically significant effect on the pass-rate
in the K.C.P.E. subject Mathematics within the schools of this division at 95%
confidence level. The Contingency table which he used for analysis is the same as
Table 5 - 3. Demonstrate how he conducted the Chi-Square test of his null
hypothesis at 95% confidence level. (Source: King’oriah, 2004, Pages 426 to 432)
TABLE 5 - 3: A CONTINGENCY TABLE FOR COMPARING TWO QUALITATIVE VARIABLES
NAME OF PRIMARY SCHOOL
POSSESSION
STATUS
Uringu
Kunene
Lubunu
Amwari
13
17
16
13
17
3
14
7
30
20
30
20
(HAVING OR
NOT HAVING)
With
Textbook
Without
Textbook
TOTALS
125
The 5 - 3 contains Samson M’Mutiga’s field data. Now he wishes to use Chi-Square
techniques of analysis to investigate whether having the required set book influences the
pass rate in Mathematics within the schools of this division. He goes ahead and computes
the degrees of freedom associated with his samples; and which reflect the fact that the
two variables, the textbook-possession status of the pupils and the primary school of their
origin are interacting under one environment.
Degrees of freedom
In any contingency table the total number of cells can be found by multiplying the
number of rows with the number of columns. The number of cells in a contingency table
contains the total number of observations in any experiment. However, we must
remember that these observations are in two dimensions, and in each dimension we lose
one degree of freedom. In that connection also, interaction between the two variables can
be reflected mathematically through cross multiplication of their qualities. This is also
true with the degrees of freedom. Consequently, in the computation of the degrees of
freedom, we first of all adjust for one degree of freedom in every dimension, and then
cross-multiply the result. Therefore, in this case, the degrees of freedom must be the
number of rows minus one times the number of columns minus one. The Chi-Square
Table will be used just like we have done in the previous example, after we have made
sure that appropriate adjustments are made to the table, and correct degrees of freedom
are applied to our expected Chi-square statistic.
The given table has two rows ( r )and four columns ( c ). Therefore, the degrees of
freedom for the variable represented within rows is ( r - 1), and those of the variable
represented within the columns in the contingency table are (c - 1); because we are losing
one degree of freedom in each direction. To reflect the interaction of two variables, the
two answers regarding degrees of freedom for both the rows and the columns will have to
be cross-multiplied. Therefore, for this particular investigation the degrees of freedom to
be used in the Chi-Square test will be :-
Degrees of freedom
=
r
 1c  1
126
And the Chi-Square which we intend to compute will be designated as :Expected Chi Square =  e2
0. 05,  r
 1  c  1
This expression for the Expected Chi-Square “  e2 ” means that we are looking
for the same Chi-Square, below which there is no significant of difference between the
two variables at five percent significant level (95% confidence level) at r  1c  1
degrees of freedom. This is how the whole composite expression of the Chi square is
interpreted : “  e2
0. 05,  r
Degrees of Freedom =
 1  c  1
”. Now we need to put our expression in figures :-
(row – 1)  (column – 1)
r
=
d f
 1 
c
 1
The Chi-Square is expressed at the appropriate significance level and at the appropriate
degrees of freedom :-
e2
0. 05,  r
 1  c  1
 2

0 . 05 ,
2
 1  4  1
Observed Frequency and Expected Frequencies
Identifying the cells in a contingency table
Now we are ready to manipulate our data within our table to see how se can apply
the Chi-Square test to test the significance of the interaction of the two variables, the
school he visited and the possession status (with regard to owning the relevant set-book).
To accomplish this task, we shall construct a table which will allow us to manipulate the
instructions of the Chi-Square formula :-

2

O

k
i
c  1
127
 Ei
Ei

2
In this new table we assign a number to each cell according to the row and the column
which intersect ( read : “interact” ) at the position of each cell. The row designation
comes first, and the column designation comes after it, and both are separated by a
comma. For example, the intersection of the second row and the third column will be
represented by “ Cell ( 2, 3 ) ” . Let us now label the table of the interaction of the two
nominal variables which we started. as illustrated in table 5 - 2 .
In this table we shall use the formula which we shall learn just now to record the
expected observation and to differentiate this from the actual observation from the field.
The observations from the field will be represented by the “free” numbers (without
brackets) within Table 5 - 4, and the expected observations by other numbers, we shall
learn to compute presently.
Calculating Expected Values for Each cell
To do this we shall begin by drawing another table similar to Table 5 - 5, in which
we shall indicate the totals of the observations for each column and for each row. Within
this table, we demonstrate how to represent expected values of the Chi-Square for each
cell. We use rounded brackets to enclose the cell numbers in this kind of table, and the
square brackets to enclose the magnitude of the expected observations. The use of any
kind of brackets for either does not really make any difference. What is needed is
consistency. If you use round brackets for cell numbers for example, be consistent
throughout the table, and use square brackets for expected observations. If you use square
brackets for cell numbers, also be consistent and use them throughout for this purpose,
reserving the round brackets for the expected values. The important thing is that you
should not mix them for either category. Use one kind of bracket for one specific
category throughout. Therefore, in Table 5 - 5 rounded brackets are on top for each cell,
followed by the actual observation without brackets and lastly followed by the expected
values in square brackets. Now we embark to answer the question of how the expected
values are calculated for each cell.
128
TABLE 5 - 4: ILLUSTRATING HOW TO LABEL INTEREACTION CELLS
IN ANY CONTINGENCY TABLE
NAME OF PRIMARY SCHOOL
POSSESSION
STATUS
Uringu
Kunene
Lubunu
Amwari
(HAVING OR
NOT HAVING)
With
Textbook
Without
Cell (1, 1)
Cell (1, 2)
Cell (1, 3)
Cell (1, 4)
17
16
13
Cell (2, 1)
Cell (2, 2)
Cell (2, 3)
Cell (2, 4)
17
3
14
7
30
20
30
20
13
Textbook
TOTALS
129
TABLE 5 - 5: ILLUSTRATING HOW TO INSERT THE EXPECTED VALUES AND OBSERVED
VALUES IN ANY CONTINGENCY TABLE
NAME OF PRIMARY SCHOOL
POSSESSION
Uringu
Kunene
Lubunu
Amwari
TOTALS
(1, 2)
(1, 3)
(1, 4)
59
17
16
13
[11.8]
[17.7]
[11.8]
(2, 1)
(2, 2)
(2, 3)
(2, 4)
17
3
14
7
[12.3]
[8.2]
[12.3]
[8.2]
30
20
30
20
STATUS
(1, 1)
(HAVING OR
With
13
NOT
Textbook
HAVING)
Without
[17.7]
41
Textbook
TOTALS
Grand
Total
100
The expected value for each cell is computed by dividing the row total at the end
of the row containing that cell by the grand total. This tells us the proportion of the grand
total which is contributed by the proportion of the total observation within that row. This
answer is then divided by the column total of the column containing the cell of focus. For
example, for cell (1, 1) the expected value of
dividing the row total value of
“59”
[17.7]
has been obtained through
by the grand total of
“ 100 ” , and then
multiplying the resulting proportion by the column total of “ 30 ”. The answer is 17.7
130
students. Of course we cannot get a fraction of a human being, but we need this
hypothetical figure for the computation of the expected Chi-Square values.
Stating the Hypothesis
Obviously, before analyzing the date in any way we must have stated the
hypothesis. In this case we go ahead and state the hypothesis which Samson M’Mutiga
used at the beginning of his investigation. This follows exactly the rules which we
discussed in Chapter Four, or in the example involving the gambler above. The two
important hypotheses, the null and the alternative go as follows :-
Ho :
There is no significant difference in the pass-rate of the subject
Mathematics among those students who are in possession of a
Mathematics set textbook and those who do not.
Ho :
There is some statistically significant difference in the pass-rate of the
subject Mathematics among those students who are in possession of a
Mathematics set textbook and those who do not.
Once this is done, this is actually when we set out to calculate the expected values for
each cell and to compare these values with the actual observed values from the field
using the Chi-Square statistic. To do this we have to tabulate all the expected values
against the observed values, and use the same techniques as we used for Mr. Laibuni’s
gambling experiment. This will give us the computed Chi-Square statistic which
compared the data collected from the field and the expected values computed using Table
5 - 5. The columnal tabulation and summary of the expected data against the observed
data is given in Table 5 - 6. Notice in this table how the instructions of the chi-Square
expression are followed column by column, until finally the total computed Chi-Square is
obtained in the lowest right hand cell.
131
TABLE 5 - 5: ILLUSTRATING HOW TO COMPARE TABULATE THE OBSERVED AND EXPECTED
VALUES IN ANY CHI-SQUARED EXPERIMENT
O
i
 Ei
Cell Number
C ij
Oi
Ei
[O i - E i]
[O i - E i] 2
1, 1
13
17.7
– 4.7
22.09
1.248
1, 2
17
11.8
5.2
27.04
2.292
1, 3
16
17.7
– 1.7
2.89
0.163
1, 4
13
11.8
1.2
1.44
0.122
2, 1
17
12.3
4.7
22.09
1.796
2, 2
3
8.2
– 5.2
27.04
3.298
2, 3
14
12.3
1.7
2.89
0.235
2, 4
7
8.2
– 1.2
1.44
0.176
TOTALS
100
100
0.000

2
Ei
9.330
Decision Rule
We must remember that earlier on we obtained some value from the Chi-Square
table on page 499 (King’oriah, 2004) using
three degrees of freedom because
 2
0. 05,
2
 1  4  1
d f .  2  1  4  1
, which we found to be
 3 d f . This value is the
expected Chi-Square value (critical value) below which the null hypothesis of nodifference is accepted, and above which the null hypothesis of no-difference is rejected,
as we accept the alternative hypothesis of having statistically significant difference. This
critical
2
value at 5% significance level and 3 degrees of freedom is
Our computation in Table 5 - 5 records a total calculated
 2 -value
2
= 7.815.
of 9.330. We
therefore reject the null hypothesis and accept the alternative hypothesis.
We support Samson M’Mutiga that the performance of our students in the subject
Mathematics within the K.C.P.E examination depends on the possession of the set-books.
132
Figure 5 - 3: The Chi Square Distribution at Three Degrees of Freedom
and 95% Confidence level; Showing the calculated 2c = 9.330
Readers are requested to read the chapter on Chi-Square Tests meticulously and
note the example on pages 432 to
439 (King’oriah, 2004)where a student of the
University of Nairobi (A. O. Otieno, 1988) uses a Chi-Square statistic to discover that
actually the atmospheric pollution caused by the industrial activities of the Webuye Paper
Mills is causing most small buildings exposed to this pollution to collapse within Webuye
Town.
EXERCISES
1.
Test the null hypothesis that there is no difference in the preference of the
type of bathing facilities within a residential unit, among different sizes of
families surveyed in the City of Kisumu; at 5% significance level.
Education Level
Family Structure
Three Children
and less
Four Children and
above
Shower and bath tub
Bath-tub or
shower
10
30
30
30
133
2.
Test the null hypothesis that salary and education level are statistically
independent at 95% confidence level.
Education Level
High School
or less
High School and
some college
University and postgraduate
3.
Monthly salary in thousands of shillings
to 4.99
5 to 9.99
10 to 14.99
0
10
10
14
8
8
42
0
0
10
A marketing firm is deciding whether a food additive B is a better tasting food
than a food additive A. A sample of ten individuals rate the taste on a scale of 1
to 10, the results of the focus groups are listed in the table below. Test the null
hypothesis at 5% significance level that food additive B in no better tasting than
food additive A.
FOOD ADDITIVE TASTE COMPARISON
Individual
ID Number
1
Additive A
Rating
5.5
Additive B
Rating
6
2
7
8
3
9
9
4
3
6
5
6
8
6
6
6
7
8
4
8
6.5
8
9
7
8
10
6
9
134
CHAPTER SIX
ANALYSIS OF VARIANCE
Introduction
After some fulfilling discussion on hypothesis testing and contingency tables in
the preceding chapters, we are now ready to discuss another technique which uses
contingency tables and hypothesis testing of homogeneity or differences of samples. This
statistic is called Analysis of Variance.
the aim of Analysis of Variance is to find out whether several groups of
observations have been drawn from drawn from the same population. If this is so the
logic is as in the Chi-Square statistic: that the hypothesis of homogeneity would be
accepted and that of the difference of samples would be rejected. The only difference
between Analysis of Variance (ANOVA) and the Chi-Square Statistic is the level of
measurement. Chi-Square statistic operates at the nominal level of measurement where
we cannot assume the normality of data; while analysis of variance operates at the ratio
and interval scales where normality of the populations can be assumed, and the accuracy
of measurements can be guaranteed to be highly precise. Later we shall see how ANOVA
can be used to test the significance of linear and non-linear relationships in Regression
and Correlation because of its ability to be a powerful test at high levels of measurement.
There are many situations in life where in real life where this kind of analysis is
required. One of these, which we shall consider here, is whether when we subject
different plots to different fertilizer regimes we obtain the same yield. Straight away the
biostatistician’s imagination is kindled into imagining all the situations in his career
which require this kind of statistic which we shall not have time to consider in these
notes. Our interest here is to learn how to compute the test statistic. Let us now state our
fertilizer problem more clearly, an use it to learn how to use Analysis of Variance.
One Way Analysis of Variance
Example
A maize research scientist has three types of fertilizer treatment of different
chemical composition used by farmers in his area of operation. He would like to
135
test whether these fertilizer regimes have a significant effect on maize yield using
three experimental plots and a duration of five seasons.
Given the data in the contingency Table 6 - 1 below, test the null
hypothesis that there is no difference between maize yield which can be caused by
using the three fertilizer regimes; against the alternative hypothesis that there is a
statistically significant difference arising from the use of the three different
fertilizer regimes; at 95% confidence level.
TABLE 6 - 1 : DIFFERING MAIZE YIELDS IN THREE DIFFERENT HYPOTHETICAL PLOTS
SUBJECTED TO THREE DIFFERENT FERTILIZER REGIMES
SEASONS
FERTILIZER TREATMENT
Yield per
Yield per
Yield per
hectare
hectare
hectare
(bags)
(bags)
(bags)
Type A
Type B
Type C
Fertilizer
Fertilizer
Fertilizer
1
75
81
78
2
77
89
80
3
85
92
84
4
83
86
83
5
76
83
78
TOTALS: T
396
431
403
T.. =
MEANS
j. : X j
79.2
86.2
80.6
= 82
X1230
.
136
TOTALS
Solution
In this situation we follow the usual hypothesis testing techniques, wit all the
ethical matters being taken care of. In particular we follow the steps of hypothesis
testing outlined in Chapter Four
Step One : Formulate the null and the alternative hypotheses.
Ho: There is no difference in the mean maize yield among the different fields
which have been subjected to different fertilizer regimes at Five percent
significance level (95% confidence level).
H0 :  1   2  ....   n
In this case n is the three columns representing the three different
fertilizer regimes. The variable  is the population mean from which the
three samples are supposed to have been drawn. The small equation says
that although we expect that the three samples come from different
populations, the sample means come from one and the same population
mean. [Note here we are dealing with the population means from a normal
population, hence the use of the parameter .]
HA:
There is a statistically significant difference in the mean maize yield
among the different fields which have been subjected to different fertilizer
regimes at Five percent significance level (95% confidence level).
H0 :  1   2  ....   n
Here we assert that the populations from which our samples have been
drawn are different, causing the differences in the maize yield; which we
detect using our statistical techniques.
137
Understanding the double-summation notation
1.
We treat each cell as an area of interaction between any one row i , and the
corresponding column j , just like we did for the Chi-Square Statistic. This way
we have cells Xi j = cells (1, 1), (1, 2), ...... , (5, 3) in this contingency table.
(Compare this with the notation of a one-dimensional observation X i which we
n
discussed in Chapter 1). The sum of these interactions is
X
ij
along one
i  1
dimension, for example down the rows in one column. For example sum all the
cells (1, 1), (2, 1), ...... , (n, 1) from the first to the nth down the 1st column.
The number of the column is the second on this designation i j at the footnote.
Note carefully how this cell designation works for one column before going on.
2.
When you are being instructed to sum all observations down the columns first.
After obtaining the column totals   X i j  then you sum the column totals, to
r
i  1

obtain a grand total, the instruction goes on like this : -
k
 Xij
j  1
r
X
ij
.
i  1
Let us now go slowly, and interpret this instruction. Look at the right-most
sum
 r

  X i j
i  1

to the r
th
. This tells you to sum all the rows from the first row i = 1 up
row on top of the summation sign. Then come back to the left-hand-
most summation   X i j  . This left-hand-most designation instructs us to sum all
k
j1

the columns from the first (1 st ) one, when “ j = 1” ; across to the k th column
indicated at the top of the summation sign.
In the double summation, you deal with one summation expression at a
time; from the right-most summation expression; and ending with the
manipulation of the left-most summation expression. Now, if we interpret the
138
k
double summation above, shown as “  X i j
j  1
r
X
ij
”, we mean, “Sum all the
i  1
row observations from the first observation on the first row, to the
n
th
observation on the n th row , but down each column. Then sum all the totals of
these column-by-column totals from the first column ( j = 1), across to the k th
column indicated on top of the respective summation sign.” Learners should
understand this notation thoroughly before proceeding on; and should supplement
this information with what is available in the textbook (King’oriah 2004, pages
237 to 238.
Step Two : Summations and Double-Summations
(Look at the Table 6 - 1 very carefully.)
Sum all observations down the rows first, and then across the columns, to obtain
the totals “ T j .” and the grand Total “ T ..” . (These sums are called “Tee Jay
Dot ”, and “Tee Double-dot” respectively, as shown on by these expressions.)
This will give us the grand total T .. = 1230. When we divide the Grand Total by
the total number of all observations in the three plots we obtain the grand mean
X = 82. Individual column means “ X j  ” are obtained by dividing the column
sums with the number of observations in each column. In symbolic summary,
these two sums and their respective means are shown as :-
r
X
i1
 T1   396
X j   X 1  79 . 2
i  1
i2
 T2   431
X j   X 2   86 . 2
i3
 T3   403
X j   X 3   80 . 6
r
X
i  1
r
X
i  1
k
r
X
ij

T   1230
X
j

j  1i  1
139
 82
Study these expressions carefully, because their understanding will help you to
understand all future statistical work involving double summation. This is what has to be
done first before to open the way for data manipulation in one-way analysis of variance.
Step Three :
Confidence Levels and Degrees of Freedom
Decide on the confidence level upon which to test your Hypotheses, and then
calculate the relevant degrees of freedom.
Confidence
level : C  0 . 95 Significance level ; 1  C  1  0 . 95  0 . 05
Degrees of Freedom are stated in the following manner :-
F  c  1,
c  r  1 
This expression means, “ The statistic ( F ) [in the F-Statistical tables] is defined
by the significance level (  ), and
c  1,
c  r  1

degrees of freedom. If we
were to know the Confidence level and the Degrees of freedom (like we are about
to do) we can easily define the Critical Value of F, which we now designate as F
.
Like for the Chi-Square statistic, the degrees of freedom take account of
interaction between row and column observations within the ANOVA
contingency table.
The column degrees of freedom ( “ V1 d f. ” ) account for the loss of one
degree of freedom because the columns run along one dimension. Adjusting these
degrees by one d f. , we obtain c - 1 d f. ( Number of observations in Columns
minus One ). These account for the degrees of freedom among the three samples.
(The use of the term among will be important in this discussion presently. Please
note that it denoted the interaction among the columns or the three sample
characteristics which we also designate as Treatments)
The degrees of freedom caused by the whole population of observations
within the three plots, take into account the fact that in each row of each column,
we adjust for the loss of one degree of freedom : ( r - 1 ). Then the column
“row-adjusted” degrees of freedom are multiplied by the existing number of
140
columns “ c ” which are three ( 3 ); accounting for the three fertilizer regimes or
Treatments. This gives us the expression “

”
c r  1
d f. This second kind of
degrees of freedom are called , “ within degrees of freedom.” They are denoted by
the symbol “V2 d f.” . The second kind of degrees of freedom take into account
the sampling errors, and all the other random errors due to chance, which are
made when collecting data in the field (which data is influenced by the
environment - climate, weather soil types etc.) from the three plots. The summary
of all the above discussion can be stated as follows :Confidence level : C  0 . 95 Significance level ; 1  C  1  0 . 95  0 . 05
Degrees of Freedom : V1 d f.
V2 d f.
=
=

c - 1 d f.

c r  1
= 2
= 3 (5 - 1) = 12 d f.
Step Four : Using the ANOVA tables .
Using the values of the degrees of freedom which we have computed in
Step
Three, we turn to the ANOVA tables on page 490 of your Textbook
(King’oriah, 2004). The column degrees of freedom (V1 d f. ) are found in
columns arranged across the table and designated by the numbers on the top-most
row, with headings “ 1, 2, 3, ........ , 9 ”. We look for a V1 d f. which
labeled “ 2” along the top-most columns.
Down this column we slide, until we interact with the row which begins at
V2 d f. = 12. (The V2 d f. are read down the left-most column of that table.) The
number at the interaction of the column and row of interest reads 3.8853. We
now conclude that the F-value we are looking for, which accounts for the entire
interactional environment among the three plots ( which we shall call three
“ Treatments ” ) is :-
F  c  1,
c  r  1 
= F0.05  3  1, 3 5  1  F0.05 2,
141
12 
 3.88553
Readers should make sure they understand all the computations which we have
accomplished so far. Unless we are sure we have understood these, we are likely to have
difficulties with the discussion which follows from now on.
Figure 6 - 1: The Position of the Calculated value of the F-Statistic
called Fc , as compared to that of the Critical value of F .
If you have not understood the computations, and the associated logic, please supplement
the above discussion with what is available in the textbook (King’oriah, 2004) before
going on.
Step Five : Computation of the sums of squares
Variation within and among treatments
Having obtained the critical value of F which we have designated as “F ”, we
now have a statistical probability model for testing the similarity or the difference of the
data from the three treatments. (See Figure 6 - 1. Our critical value of F is clearly visible
in that diagram.) All we need now is to compute the F-value which is the result of our
activities in the field as we observe the tree treatments over five seasons.
142
The F-Statistic or (ANOVA: Analysis of variance), like the Chi-Square Statistic
compares the observed values and the expected values. Go back again to Table 6 - 1. You
will find that we computed the Grand Mean of all the observations from the three fields.
The value of this grand mean is “ 82 ”. If we view our data globally, this is the expected
value of all the observations in all the cells X i j in Table 6 - 1. We shall now make use of
a technique which will allow us compare the similarity or the difference of this value
X  82 with all the observations from the field. The difference between this Grand
Mean and every observation from the three fields is the so-called Total variation all the
data we have collected.
(a)
The Total Sum of squares
In our example we compute this Total squared variation by finding the sum of
squared differences between every observation we have collected from the three
Treatments. The mathematical notational designation of this important value is SS;
namely, the total Sum of Squares of all the observations we have recorded form the
Grand Mean. This means we find the difference between the Grand Mean and every
observation, each time squaring this difference, and obtaining some kind of a squared
deviation. If you remember what we did in Chapter One when we computed variances,
you will find that we are on our way to obtain a variance of some kind of all the data we
have recorded in our experiment. Now let us proceed :-
The Total sum of all the Squares
=
SS

 
c
r
j  1 i  1
Xij  X

2
Amazing! This expression tells us first (within the brackets on the right) to take each


observation recorded on each cell “ X i j ”, and subtract the grand mean X  82 from
it. The superscript (exponent) on top of this bracket to the right tells us to square each of
these differences obtained this way. Then the summation sign immediately lying to the
left of this bracket tells us that all this must happen down the column treatment-rows
(observations) for each of the three treatments (columns), from the first row ( i = 1) to
143
the r th row (in our case the fifth row in each column). This is the meaning of the first
X
r
summation mechanism on the right designated as “
ij
 X
i  1

2
” . Once you have
finished all that for each of the three treatments, sum all the total results which you have
computed across the three treatments. This is the use of the left-most summation sign ,
c
“

”. Therefore the complicated-looking expression :-
j  1
 X
c

SS
r
ij
 X
j  1 i  1

2
is just an instruction, to tell us to do something down the columns, and then across the
bottom of the columns! This is nothing special. We now compute the sum of all the
squared deviations from the grand mean in our entire experiment using this formula.
When we do this we obtain the following figures :-
SS  75  82
+ 81  82
+
78

2
 82
2
 89  82
2
 82
77
2
 85  82

2
 80  82
2
92
 83  82
2
 82
 86  82
2
 84  82

2
 82 +
2
 83  82
2
 83  82
2
76
2

78
2
 82
+...
2
SS  49  25  9  1  36  1  49  100 
 16  1  16  4  4  1  16
=
328.
If we wish to obey the formula strictly deal with each column at a time, we go ahead and
obtain the sum of squared differences for each column. this means we the first add the
five squared differences for the first column,
49  25  9  1  36  120
then the second five squared differences for the second column,
144
1 + 49 + 100  16 + 1 = 167
and lastly add the third five squared differences,
 16  4  4  1  16  41
What we got after these systematic additions are the column totals. Then, adding all these
column totals of the squared differences we obtain :-
120 + 167 + 41 = 328 .
This agrees with the SS figure that we just obtained above.
(b)
The Variation among Treatments
It is easiest to compute this variation after computing the Sum of Squared due to
Treatments ( SST). Once you have found this figure, then the variation due to random
errors in the entire population can be computed. However, we defer this other task until
later. We proceed with the calculation of SST. Aware of the manner we handle the
double summations, the expression for the Sum of Squares due to Treatments (SST) can
be symbolically expressed as :SST


c
r
j  1
X j  X

2
From the experience we have gained in ( a ) above, this is a much easier formula to
interpret. Again study the treatment means given in Table 6 - 1. You will find the first
symbol “ X j  ” , pronounced as “ X-bar Jay-Dot ” within the brackets of the expression.
This the individual treatment mean for each of yield observations due to the three
fertilizer regimes in our field experiment tabulated at Table 6 - 1.
Since this mean is the typical value in each treatment, we assume that if it were
not for random errors, each value within each treatment should have taken the size of the
145
treatment mean “ X j  ”. Therefore, according to our assumption, each treatment should
have Five ( 5 ) observations which are the size of the mean - disregarding the random
deviations.
Accordingly, we look for the difference between each treatment mean “ X j  ”.

from the Grand mean and square it to obtain X j   X
 . Add the three results across
  X  X  . However, this
2
c
the columns ( c ) to achieve the instruction given by
2
j
j  1
accounts for only the sum of one set of squared differences between the three treatment

means and the grand mean, the X 1  X
 
2
+
X 2  X

2
+

X 3  X
 . this is
2
when we remember that each of the three treatments has five observations (rows, denoted
as “ r ”. therefore, what we have just obtained is multiplied by the number of rows in
each one of the treatments - five ( 5 ) rows each.


5   X 1  X

 
2
this expression is the same as SST
+
X 2  X

r

c

2
+

X j  X
j  1
X 3  X

  .
2
2
. Ensure that you understand
it actually is the same. Using the formula we insert the actual values and compute our
SST in the following manner :-
SST

5 
79.2  82
2
 86 . 2  82  80 . 6  82
2
When you compute this you obtain the answer of :-
SST


c
r
X j  X
j  1
146

2
 137 . 2
2

(c)
Variation within Treatments
This is variation due to random error in the entire experiment usually
expressed as the sum of squared differences between each observation we have
ever recorded in this experiment (all the three treatments) and the Grand Mean.
the mathematical expression for this is :-
SSE
 X
c

r
ij
 X j
j  1 i  1

2
This is called the variation within treatments because it records the variation
between each treatment mean and the individual observations in each of the three
samples. Although we can follow this formula and obtain the SSE, we can simply
obtain the same figure the short way; by imagining that what has not been
explained by the treatments “ SST


c
r
X j  X
j  1
numbers of the observations
“ SS
 X
c

r
j  1 i  1
ij

2
 X
” from the total

2
” is definitely
explainable by nothing else except that it is the random error in each sample SSE.
Therefore, we conclude that SS - SST = SSE. In that regard :-
SS
–
SST
=
SSE
328
–
137.2
=
190.8
You can verify whether this calculation is correct using the manipulation of the
full formula
SSE

 X
c
r
ij
 X j
j  1 i  1

2
.
This means you will have to
make a table similar to the one on the Textbook (King’oriah, 2004, page 234).
The summary of step Five is as follows:-
SS = 328,
SST = 137.2
147
SSE = 190.8
This result is very important for the next crucial step in one-way Analysis of Variance .
this involved actually looking for the variances to Analyze using the Analysis of
Variance technique and the ANOVA probability model.
Step Six :
Computation of mean Squared Values ( or Variances)
For any variance to work there must be two things: the sum of squares of one
form or another, and the total number of observations in each case adjusted by the
appropriate degrees of freedom. We have already accomplished this. We may now wish
to compute the respective variances.
(a)
Mean square due to treatment MST: In our case, this is the variance caused by
the differences in fertilizer regimes. This is sometimes known as the variance
among samples or treatments, or the variance explained by Treatments. The
corresponding Sum of squares due to treatment is SST = 328.
(b)
The Mean Square due to random error MSE. In our case, this is the variance
caused by random errors within each of the treatments and for all the three
treatments. This is sometimes known as the variance within the samples or
treatments, or variance explained by Chance error within the treatments. It is
also called the variance explained by chance (random) error.
These are the two variances which will be needed for the analysis. If
they
are
equal then this indicates that the treatments have no effect on the overall population. On
the other hand, if they are not equal then the treatment has some significant effect on the
values of these means and then variances. This is the logic which was followed by the
discoverer of this test statistic Sir Ronald Fisher (1890 - 1962). This explains why the
test statistic is called the “ F-test ”, after this illustrious statistician.
The total number of observations to be used as the denominator for the
computation of the analysis of variance is found from the degrees of freedom which we
have been having since we learned how to enter into the F-table. This information is
available on page 132 of this module (above). We now bring them forward for immediate
use :-
( i ) V1 d f. =
( ii ) V2 d f.
=
c - 1 d f.


c r  1
= 2
= 3 (5 - 1) = 12 d f.
148
Then we interpret these Degrees of freedom to be the treatment d. f. = V1 d.f. = 2
Error
d.f. = V2 d. f.
= 12 .
Accordingly the mean square due to treatment is computed as :-
MST 
SST
137.2

 68.6
Treatment d . f .
2

"  T2 "
The mean Square Explained by Chance (Random) Error : [this is sometimes called the
Unexplained variance] is computed as :-
MSE 
SSE
190 .8

 15. 9
Error d . f .
12

"  E2 "
Step Seven
Fisher’s ratio, ordinarily called the F-statistic is the ratio between the treatment
variance and the error variance. This is expressed as :-
 T2
 Fcalculated
 E2
or
" Fc "
The nature of the test is that if the ratio is equal to 1.0, then  T2 is equal to  E2 . This
means that the variance due to treatment is equal to that random (chance, error) variance.
In that case, the treatment has no effect on the entire population, since the variance due to
treatment could have occurred by chance! After all it is equal to the random variance.
The mode of this statistic therefore is found at F = 1.0. the distribution is positively
skewed toward the right. The farther the mean due to treatment is from the steepest part
of the curve (around 1.0) the more different this is from 1.0 statistically. The critical
value delineates the value of F beyond which the similarity of the observed value cannot
be judged to be equal to 1.0 at the prescribed Confidence level and the V1, and V2
Degrees of Freedom It looks like Figure 6 - 1.
149
The calculated value of F , designated as Fc is
 T2
68 . 6

 4 . 31 . Compare
2
E
15. 9
this figure with the expected value which was found from the F-tables, at the appropriate
degrees of freedom and Confidence level. The value of this figure is 3.8853. Our
conclusion is that any value of F above 3.883 does not belong to that population whose
modal F-value is 1.0. Any distribution with this kind of F will therefore be rejected.
Accordingly, we reject the null Hypothesis that there is no difference between the
yield caused by different fertilizer regimes 1, 2, and 3; and accept the alternative
hypothesis that there is a statistically significant difference between the three treatments.
Therefore, application of fertilizers affects the maize yields among the three fields. The
investigator would choose the fertilizer regime with the highest mean yield over the five
seasons,. From Table 6 - 1, this is the Type B regime with the mean yield of 82.6 bags
per hectare.
These eight steps comprise all there is in carrying out the One-Way Analysis of
Variance for equal sized samples. For unequal sizes, for example in situations when some
members of the observations are not present see your textbook (King’oriah, 2004, pages
242 to 250) The explanation is simple and straight forward because now we have
mastered the necessary symbolism. The logic is identical and the statistics used are the
same. We now turn to a slightly more complicated form of Analysis of Variance: The
two-way analysis of Variance.
Two-Way Analysis of Variance
Introduction
It is a fact of life that the treatments are not the only causes of variation in any
group of observations. We shall see later, especially when we shall be dealing with a
closely related type of analysis, that there are many causes of variation which affect
samples and related groups of observations. In the fertilizer example, we have allowed
the treatments to be the only causes of variation. By implication, we have held all the
other variables which should have influenced the yield from three fields constant. In that
150
case we find that a large part of the variation still remains unexplained. Consider, for
example the relatively small SST = 137.2 compared to the SSE of 190.8. this means that
the sum of squared deviations due to random error in this population needs to be
disaggregated some more, so that another variable - other than the fertilizer treatment
regime - may perhaps be of interest as affecting yields in this area.
Example
In Kenya, and anywhere else in the world, there are good seasons and bad ones.
Crop yield is best in good seasons with copious rainfall and all other factors that affect
the growth of plants. During bad years crops do not do that well, and the yields are low.
Suppose this investigator wished to test whether the seasons also have some effect on the
maize yield. The following table would be relevant.
Statistically, we say that we have introduced a Blocking variable in our analysis.
This means that while we are investigating the main variable, Fertilizer treatment, we
shall be interested in the effects of a second variable in our experiment, this time the
seasons are the Blocking variable. In this regard, we find that the time we invested in
learning various analytical symbols and notation in the last section will pay dividends
here, because we can quickly analyze the summation expressions to obtain our answers in
a faster manner than if we were to learn the symbols a-fresh each time. The only thing we
need to familiarize ourselves with is how to deal with the blocking variable. Even this is
not any big deal, because the logic is identical. We take off and do our analysis straight
away.
Step One :
Arrange all the given data in a contingency table as shown in Table 6 - 2.
Then formulate the null and the alternative hypotheses for the main variable and
for the blocking variable. This is because we shall end up with two F-calculated
results, each proving a different thing; but in an interacting environment between
151
TABLE 6 - 2 : DIFFERING
MAIZE YIELDS IN THREE DIFFERENT HYPOTHETICAL PLOTS
SUBJECTED TO THREE DIFFERENT FERTILIZER REGIMES DURING DIFERENT SEASONS
SEASONS
FERTILIZER TREATMENT
Yield per
Yield per
Yield per
hectare (bags)
hectare (bags)
hectare
Type A
Type B
(bags) Type
Fertilizer
Fertilizer
C Fertilizer
1
75
81
2
77
3
TOTALS
MEANS
78
234
78
89
80
246
82
85
92
84
261
87
4
83
86
83
252
84
5
76
83
78
237
79
TOTALS: T
396
431
403
T.. = 1230
MEANS
j. : X j
79.2
86.2
80.6
.
all the variables involved. In this arrangement, test the effects of treatments, and then
test the effects of the blocking variable, the seasons.
Ho: There is no difference in the mean maize yield among the different fields
which have been subjected to different fertilizer regimes at Five percent
significance level (95% confidence level).
H0 :  1   2  ....   n
152
X = 82
HA:
There is a statistically significant difference in the mean maize yield
among the different fields which have been subjected to different fertilizer
regimes at Five percent significance level (95% confidence level).
H0 :  1   2  ....   n
Ho|B: Crop yield is not affected by any of the five seasons at 5% significance
level (95% confidence level).
H0 B :  1   2  ....   n
HA|B:
The five seasons have a statistically significant effect on the mean maize
yield at Five percent significance level (95% confidence level).
H0 B :  1   2  ....   n
This means that we have another set of hypotheses for the blocking variable. We seem to
have decided on the Confidence level of our test statistic within the first step as well.
Step Two :
Determine the degrees of Freedom.
Two types of degrees of freedom are to be determined.

The treatment degrees of freedom are as before, only that this time we have to
take into account the existence of a third dimension the Blocking Variable, which
has now been introduced. The treatment dimension is represented by columns as
usual where one d f . is lost.

The blocking
(blocks) dimension, reflecting the influence of seasons, is
represented by rows, where another d f . is lost.

In addition, the rows and columns must be cross-multiplied so that all the error is
accounted for. This is where the difference between a 2-way and a one-way
ANOVA lies.
153

This cross-multiplication is done by adjusting for the lost d f. on the rows and on
the columns.
Consequently the degrees of freedom for the two-way ANOVA are determined as
follows :-
Treatment d f . ( V1 )
C – 1 = 3 - 1 = 2 d f.
=
Error d f. (in two dimensions,
on the treatment and on the
blocking dimension) ( V1 )
=
(c – 1)(r – 1) = ( 3 – 1 )( 5 – 1 ) = 2 
4
=
8 d f.
Therefore, our expected value of the treatment F
(designated as FT ) is
determined from the tables at the appropriate alpha-level (0.05), entering at the
above degrees of freedom into the same table as we used for the one-way analysis
of variance. (King'oriah, 2004, page 490)
We go ahead and straight away we determine the critical value of F due to
treatment, after taking care of dimensional influences of the Blocking Variable
through cross-multiplication of rows and columns (which, in turn, have been
adjusted for the degrees of freedom along their dimensions. (c – 1)( r – 1). this
value of F due to treatment is designated as :FT  F0.05,   c  1 ,
c
 1 r  1 
  F0.05,  3  1 ,
 3  1 5  1
Remember all the small numbers and letters are not mathematical operations, but
indicators or labels which show the kind of
“ F ”
we are talking about.
Accordingly, our F due to treatment “ FT ” is finally and correctly described
as : F0.05,  2 , 8 . The V1 d f. are 2 and the V2 d f. are 8.
154
The significance level is 0.05. Looking up on page 490 of your textbook we find
that this time :FT

F0.05,  2 , 8  4 . 4590
The same approach is adopted for the other kind of F which is required in the twoway ANOVA. This time we mechanically regard the rows (blocking variable) as
some kind of treatment “columns”. We make two adjustments to obtain the
blocking F.
FB  F0.05,   r  1 ,
r
 1 c  1 
  F0.05, 5  1 ,
 5  1 3  1
Study this F configuration very carefully and compare it with the FT.
FT  F0.05,   c  1 ,
c
 1 r  1 
  F0.05,  3  1 ,
 3  1 5  1
The significance level is (as before) 0.05. Looking up on page 490 of your
textbook (King’oriah 2004) we find that this time is F0.05,  4 , 8  3.8378 .
Therefore, we summarize the statement of F due to the blocking variable as :-
FB

F0.05,  4 , 8  3.8378
Step Three :From here the journey is downhill! We calculate the sum of squares SS just like
we did before:
SS

 
c
r
Xij  X
j  1 i  1
Then we obtain the SST:
SST



c
r
2
X j  X
j  1
Thereafter, the SSB :
SSB

X i
r
c
i  1

 X


2
2
Comparing SST and SSB formulae, observe how the places of rows and columns
are changing positions. Note how X  i represents the Row Means on Table 6 - 2,
and X j  represents the Column Means. The computations take place in
155
accordance with the instruction of each formula. SSB is computed in the same
manner as SST, only that this time, the row means are used instead of column
means. This is the importance of differentiating row means from column means
by different symbolic designation. Compare
X  i ( X-bar dot i ) to the column
symbol X j  ( X-bar j dot ). The following is now a summary of the sum of
squares computations.
 
c

SS
r
Xij  X
j  1 i  1
SST

X
c
r
 X
j
j  1
SSB

X i
r
c
 X

i  1


2
2


328
The same values as before.
137 . 2
The same values as before.
 : This is the only statistic we have not computed
2
All the others are the same as those we computed when we dealt
Analysis of Variance. We now go ahead and compute the
with
before.
one-way
same and use it in our
analysis.
SSB

3 
SSB  3 
78  82
4
2
2
 82  82  87  82  84  82  79  82
2
2
 0  5  2   3
2
2
2
2

 3  16  0  25  4  9
 3  54
 162
Having obtained SSB, Sum of Squares due to random error (SSE) which remains after
introducing the new blocking variable, can easily be obtained through subtraction. Note
that the error variance is now reduced from 190.8, which we obtained from One-Way
ANOVA. on page 137.
156
2

SSE = SS – SST – SSB = 328 - 137.2 - 162
Step Four :
=
28.8
Obtain the mean squared deviations for the treatment and blocking
variables. These are calculated in the same manner as for the one-way ANOVA.
Divide the respective Sum of Squares by their corresponding degrees of freedom
to obtain the calculated Fc (for the treatment or the Blocking variable).
MST 
SST
137.2

 68.6
Treatment d . f .
2
MSB 
SSB
162
162


Blocking d . f .
5  1
4
MSE 
SSE
28 .8
28


Error d . f .
5  13  1 8
"  T2 "



40 .5
3.6

"  B2 "
 "  E2 "
Step Five : Finish off by computing the calculated values of the two F-statistics, the
Treatment FB and the Blocking FT.
FT
FB
MST
MSE


MSB
MSE


 T2
 E2
 B2
 E2


68 . 6
 19 . 053
3.6
40 .5
 11. 25
3.6
Compare these with the corresponding critical values obtained from the F-table at
appropriate degrees of freedom. These were obtained on page 143 to 145 above.
F|T
= 4.4959
FT|c = 19.053
F|B
= 3.8378
FB|c = 11.25
157
In both cases we reject the null hypotheses that neither the fertilizer regime nor the
seasons have significant effects on crop yield on the three plots; and accept the null
hypotheses that both the fertilizer regimes and the seasonal variations have some
statistically significant effect on maize crop yield on our three fields at 95% Confidence
level.
EXERCISES
1.
Explain in detail all the steps involved in the performance of an analysis of
variance test for several unequal samples.
2.
Outline the steps you would take to compute F statistics for two-way analysis of
variance.
3.
Test the null hypothesis that the amount of crop yield in bags per hectare
between the year 2000 and 2003 is not due to the plot on which it was grown; and
to the year of observation at 5% alpha level: using two-way analysis of
Variance.
Year
2000
Plot one
87
Plot two
78
Plot three
90
2001
79
79
84
2002
83
81
91
2003
85
83
89
158
CHAPTER SEVEN
LINEAR REGRESSION AND CORRELATION
Introduction
By this time our reader is confident in statistical data analysis, even if at a
rudimentary level. We now go a state farther and try to understand how to test the
statistical significance of data at the same time as testing the relationship between any
two variables of our investigation. In knowing how variables are related with one
another, we will then look for methods of understanding the strength of their relationship.
In the following discussion we shall also learn how to predict the value of one variable
given the trend of the other variable. The two dual statistics which assist us in all these
tasks are called Regression and Correlation.
If any variable changes and influences another, the influencing variable is called
an independent variable. The variable being influenced is the dependent variable, because
its size and effects depend on the other independent variable. the independent variable is
usually called an exogenous variable, because its magnitude is decided outside by other
factors which are out of control of the investigator, and the dependent variable is an
endogenous variable. Its value depends on the vicissitudes of the experiment at hand; it
depends on the model under the control of the researcher.
In referring to the relationship between the dependent variable and the
independent variable, we always say that the dependent variable is a function of the
independent variable. The dependent variable is always denoted by the capital letter Y,
and the independent variable by the capital letter X. (Remember not to confuse these ones
with lower case letters because the lower case letters mean other things, as we shall see
later.) Therefore, in symbolic terms we write :-
Y is a function of X
Y = f(X)
Meaning, The values of Y depend on the values of X.
159
Whenever Y changes with each change in X we say that there is a functional relationship
between Y and X.
Assumptions of the Linear Regression/Correlation model
1.
There must be two populations, and each of these must contain members of one
variable at a time, varying from the smallest member to the largest member. One
population comprises the independent variable and the other the dependent
variable.
2.
The observed values at each level or each value of the independent variable are
one selection out many which could have been observed and be obtained, We say
that each observation of the independent variable is Stochastic - meaning
probabilistic and could occur by chance. This fact does not affect the model very
much because, after all the independent variable is endogenous.
3.
Since the independent variable is stochastic, the dependent variable is also
stochastic. This fact is of great interest to the observers and analysts, and forms
the basis of all analysis using these two statistics. the stochastic nature of the
dependent variable lies within the model or the experiment, because it is the
subject matter of the investigations and the analyses of researchers under any
specific circumstances.
4.
The relationship being investigated between two variables is assumed to be linear.
This assumption will be relaxed later on when we shall be dealing with non-linear
regression correlation in the succeeding chapters.
5.
Each value of the dependent variable resulting from the influence of the
independent variable is random, one of the very many near-equal which could
have resulted from the effect of the same level (or value) of the independent
variable.
6.
Both populations are stochastic, and also normal. In that connection, they are
regarded as bi-variate normal.
Regression Equation of the Linear Form
160
The name Regression was invented by Sir Francis Galton (1822 - 1911), who,
when studying the natural build of men observed that the heights of fathers are related to
those of their sons. Taking the heights of the fathers as the independent variable , he
observed that the heights of their sons tend to follow the trends of the heights of the
fathers. He observed that the heights of the sons regressed about the heights of the
fathers. Soon it came to mean that any dependent variable regressed with the independent
variable.
In this discussion we are interested in knowing how the values of Y regress with
the values of X. This is what we have called the functional relationship between the
values of Y and those of X. The explicit regression equation which we shall be studying
is Y = a + b X.
In this equation, “ a ” marks the intercept, or the beginning of things, where the
dependent variable might have been found by the independent variable before
investigation. This is not exactly the case, but we state it this way for the purposes of
understanding. The value “ b ” is called the regression coefficient. When evaluated, b
records the rate of change of the dependent variable with the changing values of the
independent variable. The nature of this rate of change is that when the functional
relationship is plotted on a graph, “ b ” is the magnitude of the slope of this Regression
Line.
Most of us have plotted graphs of variables in an attempt to investigate their
relationship. If the independent variable is positioned along the horizontal axis and the
dependent variable along the vertical axis, the stochastic nature of the dependent variable
makes all the observations of the same variable take a scatter on the graph. This
scattering of the values of the independent variable is the so-called scatter diagram or in
short the Scattergram. Closely related variables show scatter diagrams with points of
variable interaction tending to regress in one direction, either positive or negative.
Unrelated points of interaction do not show any trend at all. See figure 7 - 1.
The linear Least squares Line
161
Since the scatter diagram is the plot of actual value of Y which have been
observed to exist for every value of X, the locus of the conditional means of Y can be
approximated by eye through the scatter diagram. In our earlier classes the teachers might
have told us to observe the dots or crosses on the scatter diagram and try to fit the curve
by eye. However this is unsatisfactory, it is not accurate enough. Nowadays there are
accurate mathematical methods and computer packages for plotting this line with great
degrees of estimation accuracy and giving various values of ensuring that the estimate is
accurate. The statistical algorithm we are about to learn helps understand these
computations and assess their accuracy and efficacy.
162
Figure 7 - 1 : Scatter Diagrams can take any of these forms
In a scatter diagram, the least squares line lies exactly in the center of all the dots
or the crosses which may happen to be regressing in any specific direction.
( See Figure
7 - 1). The distances between this line and all the dots in the scattergram which are higher
than the scattergram balance those which are lower, and the line lies exactly in the
middle. This is why it is called the conditional mean of Y. The dots on the scatter
diagram are the observed values of Y and those along the line plot those values which for
every dot position on the scattergram define the mean value of Y defined by the
163
corresponding values of X. the differences between the higher points and the conditional
mean line are called the positive deviations, and between the lower points and the
conditional mean line are called the negative deviations.
Now let us use a simple
example to concretize what we have just said.
Example
Alexander ole Mbatian is a maize farmer in Maela area of Narok. He records the
maize yields in debes ( Tins, equivalents of English Bushels) per hectare for
various amounts of a certain type of fertilizer which he used in kilograms per
hectare for each of the ten years from 1991 to 2000. The values on the table are
plotted on the scatter diagram which appears as Figure 7 - 2. It looks that the
relationship between the number of debes produced and the amounts of fertilizer
applied on his farm is approximately linear, and the points look like they fall on a
straight line. Plot the scatter diagram with the amounts of fertilizer (in Kilograms
per hectare) as the independent variable, and the maize yield per hectare and the
dependent variable.
Solution
Now we need a step-by step method of computing the various coefficients which
are used in estimating the position of the regression line. For this estimate we first
of all need to estimate the slope of the regression line using this expression :-
b 
  X  X Y  Y 
 X  X 
i
i
2
i
TABLE 7 - 1: DIFFERENT QUANTITIES OF MAIZE PRODUCED FOR VARYING
AMOUNTS OF FERTILIZER PER HECTARE WHICH IS USED ON THE PLOTS
Year
1991
n
1
X
6
Y
40
1991
2
10
44
1991
3
12
46
164
1991
4
14
48
1991
5
16
52
1991
6
18
58
1991
7
22
60
1991
8
24
68
1991
9
26
74
2000
10
32
80
Where :
Xi
=
Each observation of the variable X ; and in this case each value of the
fertilizer in kilograms used for every hectare.
X = The mean value of X.
Yi = Each observation of the variable Y ; and in this case each value of the
maize
yield in Debes per hectare.
Y

= The mean value of Y.
n
= The usual summation sign

shown in the abbreviated form.
i  1
The Regression/ correlation statistic involves learning how to evaluate the b-coefficient
using the equation b 
  X  X Y  Y  , and to compute various results which
 X  X 
i
i
2
i
can be obtained from this evaluation.
165
Figure 7 - 1: The Scatter Diagram of Maize produced with Fertilizer Used
The steps which we shall discuss involves the analysis of the various parts of this
equation and performing the instruction in the equation to obtain the value of the
coefficient “ b ”. Once the coefficient has been obtained the corresponding other
coefficient in the regression equation “ a ” is easily obtained because it can be expressed
as :
a  Y  bX . In addition other values which will help us in our analysis
will be sought and learned . Table 7 - 2 is the tool we shall use to evaluate the equation
for the coefficient b.
For the estimation of the b-coefficient we use the table 7 - 2 to assist us in the
analysis We must now learn to show the deviations in small case variable representative
symbols such that
X
expression is therefore
i

 X  xi
x
i
and
Y
i

 Y  yi . the numerator for the b
yi , and the denominator for the same expression is
x
2
i
These are the values for the X, Y and XY in deviation form, and the summation sign
is of course an instruction to add all the values involved. Accordingly, the equation for
the b-coefficient is : b 
xy
x
i
i
2
. Use Table 7 - 2 and fill in the values to calculate b.
i
166
TABLE 7 - 2 : CALCULATIONS TO ESTIMATE THE REGRESSION EQUATION FOR THE MAIZE
PRODUCED (DEBES) WITH AMOUNTS OF FERTILIZER USED
YEAR
X
i
X
Y
FERTILIZER
YIELD
(KG.)
PER HA.
i
 X

Y
i
 Y

x
x
i
y
2
i
xi yi
i
(DEBES
1
6
40
- 12
- 17
144
204
2
10
44
-8
- 13
64
104
3
12
46
-6
- 11
36
66
4
14
48
-4
-9
16
36
5
16
52
-2
-5
4
10
6
18
58
0
1
0
0
7
22
60
4
3
16
12
8
24
68
6
11
36
66
9
26
74
8
17
64
136
10
32
80
14
23
196
322
TOTAL
180
570
000
0
576
956
18
57
Means
X,Y
167
Solution (Continued).
Using the values in Table 7 - 2 the solution for the b coefficient is sought in the
following manner :-
b 
xy
x
i
i
2
i
Then the value of

956
 1. 66
576
“ a ”
This is the slope of the regression line.
which statisticians also call
“ b0 ” ( because it is
theoretically the estimate of the value of b in the initial condition of time zero) is
calculated as :-
a  Y  bX  57  166
. 18 .

57  29 .88

27 .12
This is the Y-Intercept.
The estimated regression equation is therefore :-
Yi  27 .12  1. 66 X i
The meaning of this equation is that if we are given any value of fertilizer application
“ X i ” by Ole Mbatian, we can estimate for him how much maize he can expect (in debes
per hectare) of that level fertilizer application using this regression equation. Assume
that he chooses to apply 18 kilograms of fertilizer per hectare. The maize yield he
expects during a normal season (everything else like rainfall, soil conditions, etc., and
other climatic variables remaining constant) estimated using the regression equation will
be :-
Yi  27 .12  1. 66 Xi

27 .12  1. 66 18

57.
The symbols for the calculated values of Y as opposed to the observed values of
Y vary from textbook to textbook. In your Textbook (King’oriah 2004) we use “ Yc ” or
Ycalculated.
Here we are using “ Yi ” pronounced “ Y-hat ” Other books use “ Ye ”
meaning “Yestimated” , and so on; it does not make any difference.
168
Tests of Significance for Regression Coefficients
Once the regression equation has been obtained, we need to test whether the
equation constants “ a ” and “ b ” are significant. This means we need to know whether
they could not have occurred by chance. In order to do this we need to look for the
variance of both parameters and their standard errors of estimate. The variances for “ a ”,
also called the variance for b0 can be estimated by extending Table 7 - 2 and using the
estimated values of Y against the actual values of Y to find each deviation of the actual
value from the estimated value which is found on the regression line.. All the deviations
are then squared and the sum of squared-errors of all the deviations are calculated. Once
this is done, the variance of the constants and their standard errors of estimate of any one
of the two constants can be easily computed. Let us now carry out this exercise to
demonstrate this process. In Table 7 - 3, the estimated values of y are in the column
labeled Yi . The deviations of the observed values of Y from the estimated values of Y are
2
on the column labeled “e i ”. Their squared values are found in the column labeled ei .
These, and a few others in this table are the calculations required to calculate the standard
error of estimating the constants b 0 = a and b 1 = b.
The estimated values Yi are found in Table 7 - 3 on the third column from the
right. These have been computed using the equation Yi  27 .12  1. 66 X i . Simply
substitute the observed values of X i into the equation and solve the equation to find the
estimated values of Yi . Other deviation figures in this table will be used later in our
analysis. The variance of the intercept is found using the equation:
169
TABLE 7- 3 : COMPUTED VALUES OF Y AND THE ASSOCIATED DEVIATIONS (ERRORS)
YEAR
X
Xi
2
Y
xi2
yi
yi
2
Yi
ei
ei
2
1
6
36
40
144
- 17
289
37.08
2.92
8.5264
2
10
100
44
64
- 13
169
43.72
0.28
0.0784
3
12
144
46
36
- 11
121
47.04
- 1.04
1.0816
4
14
196
48
16
-9
81
50.36
- 2.36
5.5696
5
16
256
52
4
-5
25
53.68
- 1.68
2.8224
6
18
324
58
0
1
1
57.00
1.00
1.0000
7
22
484
60
16
3
9
63.64
-3.64
13.2496
8
24
576
68
36
11
121
66.96
1.04
1.0816
9
26
676
74
64
17
289
70.28
3.72
13.8384
10
32
1024
80
196
23
529
80.24
- 0.24
0.0576
47.3056
TOTAL
180
3816
570
576
1634
YI
e   X
n  k n x
2
s
2
b0

i
2
i
2
.
i
In this equation, n = number of observations
k = degrees of freedom due to the interaction of two variables.
Other values can be found in Table 7 - 3. Let us now use the equation :-
170
e   X
n  k n x
2
s

2
b0
i
2
2

40.3056
3816


10  2 10576
2

40.3056
10  2576
i
i
s

2
b1
n
e
2
i
 k  xi

3. 92
0 . 01
Having found the variances of these constants their standard errors are obviously the
square roots of these figures.
sb 0 
3. 92

sb1 
1. 98
0.01
 0 .10
Let us now test the number of standard errors that each of the two constants estimate the
slope of the regression line. This means we compute each of their t-values and compare it
with a critical
t
at 5% alpha level. If the
t values resulting from these constants
exceed the expected critical value t then we conclude that each of them is significant.
The calculated t value for these parameters is :-
t0 
t1 
Since both
b0  b0
Sb
and
27.12  0
 13. 7
1. 98
0
b1  b1
Sb


1. 66

01
.
16 . 6
1
t = 2.306, with 8 degrees of freedom at 5% level of significance we
conclude that both values of the intercept and the slope are significant at 5% level.
171
The Coefficient of Determination and the Correlation Coefficient
Using this maize-fertilizer example some measure of the strength of relationship
can be derived from the data on Table 7 - 3. This measure of the strength of relationship
is known as the Coefficient of Determination, from the fact that it determines how related
some data observation series is to the other data available within the independent
variable. We begin by computing the coefficient of non-determination which is the ratio
of the sum of squared error between the predicted values of Yi and the observed values
Y, to the squared error between the actual observed values of Y and the mean of Y .
The Sum of Squared error between the Yi and Y =
e
The Sum of squared error between the Y and Y =
y
The coefficient of non-determination =
e
y
2
i
2
i
2
i
2
i

= 47.3036
= 1634
47.31
 0 . 0290
1634
This coefficient of non-determination is the proportion or the probability of the
variation between the two variables X and Y which is not explained by the changes in the
independent variable X. The Coefficient of determination is the complement of this
coefficient of non-determination. It is the proportion, or the probability of the variation
between X and Y which is explained by the changes in the independent variable X.
Therefore, the Coefficient of Determination “ R2 ” is calculated using the following
technique :-
R
2
 1 
e
y
2
i
2
i

1 
47.31
 1  0 . 0290
1634
 0 . 9710
This is a very strong relationship. About 97.1% of the changes in the maize yield (in
debes per Hectare) within Mr. Mbatian’s farm is explained by the quantities of fertilizers
per acre applied on his farm. It also means that the regression equation which we have
defined as Yi  27 .12  1. 66 X i explains about 97.1% of the variation in output.
172
The remaining 3% or thereabouts, (approximately
2.9%), is explained by other
environmental factors on his farm which have not been captured in the model.
Figure 7 - 2 :
The estimated regression line
In any analysis of this kind, the strength of the relationship between X and Y is
measured by means of the size of Coefficient of Determination. This coefficient varies
between Zero value of no relationship at all to 1.0000 value of a perfect relationship. The
example we have on Mr. Mbatian’s farm is that of a near perfect relationship, which
actually shows the observed data clinging very closely to the regression line that we have
constructed; as shown in Figure 7 - 2.
The other value which is used very frequently for theoretical work in Statistics is
the Correlation Coefficient. This is sometimes called Pearson’s Product moment of
Correlation, after its discoverer, Prof. Karl Pearson (1857 - 1936). He also invented the
Chi-Square Statistic and many other analytical techniques while he was working at the
173
Galton Laboratory of the University of London. From the computation above you will
guess that he is obviously the inventor of all these measures which we have just
considered.
The Correlation Coefficient is the square root of the Coefficient of determination.
We shall use both measures extensively in Biostatistics from now on. Let us now
compute the Correlation Coefficient.
r

R
2
r


1 
e
y
2
i
2

1  0 . 0290

0 . 9710
i
0. 9710  0. 9854
The measure is useful in determining the nature of the slope of the regression line. A
negative relationship has a negatively sloping regression line and a negative Correlation
Coefficient. A positive relationship has a positive Correlation Coefficient. In our case the
measure is positive. This means that the more of this kind of fertilizer per hectare which
is applied on Mr. Mbatian’s farm the more maize yield in terms of debes per hectare that
he realizes at the end of each season.
Computation Example
Having discussed the theory involved in the computation of various coefficients
of regression and correlation we need to try an example to illustrate the techniques of
computing the relevant coefficients quickly and efficiently.
Example
The following data had been obtained for the time required by a drug quality
control department to inspect outgoing drug tablets for various percentages of
those tablets found defective.
174
Percent defective
Inspection Time in
minutes
17
9
12
7
8
10
14
18
19
6
48
50
43
36
45
49
55
63
55
36
(a)
Find the estimated Regression line Yi  a  b X i
(b)
Determine the sum of deviations about this line for each of the ten observations.
(c)
Test the Null Hypothesis that change in inspection time has no significant effect
on the percentage rate of the drug tablets found defective using analysis of
variance.
(d)
Use any other test statistic to test the significance of the correlation coefficient.
Solution
1.
The relevant data and preliminary computations are arranged in Table 7 - 4.
2.
The following simple formulae assist in the solution of this kind of problems.
We already know that the deviations of X and Y from their means are defined in
the following manner :-
X
3.

Y
 X  x
i
i
 Y

 y
The shortcut computations will make use of these deviation formulas to compute
various figures which ultimately lead the definition of the regression equation
Yi  a  b X i .
(a)
To find the sum of squared deviations of all the observations of X from the mean
value X , and those of squared deviations of Y from Y , we use the following
shortcut expressions :-
x
2

X
2
i

 X 
n
2
,
175
y
2

Y
i
2

 Y 
n
2
(b)
The deviations of the cross multiplication between X and Y are found using the
expression
 xy

XY

 X  Y  .
n
TABLE 7 - 4 : PERCENT OF TABLETS FOUND DEFECTIVE FOR INSPECTION TIME IN MINUTES
Percent Found
Defective ( Y )
1
Time in
Minutes
(X)
48
2
X2
Y2
XY
17
2304
289
816
50
9
2500
81
450
3
43
12
1849
144
516
4
36
7
1296
49
252
5
45
8
2025
64
360
6
49
10
2401
100
490
7
55
14
3025
196
770
8
63
18
3969
324
1134
9
55
19
3025
361
1045
10
36
6
1296
36
216
Total
480
120
23690
1644
6049
Means
X  48
Y  12
Observation
(c)
If we were to remember these three equations we can then have at our disposal a
very powerful tool for the fast computation of the regression coefficients.
To find the slope coefficient “ b ” we use the results of the expressions in ( a )
and ( b ) above.
 xy
b 
 x2
Then “ a ” coefficient can be found easily through the equation :-
a  Y  bX
176
Then the correlation coefficient is found using the following expression
4.
 xy
 x y

r
2
2
The figures to be used in these computations are to be found in Table 7 - 4.
Whenever you are faced with a problem of this nature, it is prudent if you perform
the tabulations of your data as in Table 7 - 4, and then follow this with the
tabulation of data using these simple formulas. We now demonstrate the immense
power which is available in the memorization of the simple formulas we have
demonstrated in ( a ) ( b ) and ( c ) above :-

x2 
y
 xy

2

 Xi 2 
Y
XY
2
i


 X 
2
 23690 
n
 Y 
2
120 2
 1644 
 204
10
n
 X  Y 
b 
 6049 
n
 xy
x

2
480 2
 650
10
480  120
 289
10
289
 0 . 445
650
a  Y  bX  12  0.445  48   9 . 36
Accordingly, the regression equation is: -
Yi  a  b X i   9.36  0.445 X .
5.
The values of the Regression Equation are relevant in farther tabulation of the
figures which will assist us to derive farther tests. They are recorded on the thrd
column from the left of Table 7 - 5.
177
TABLE 7 - 5 : COMPUTATION OF DEVIATIONS AND THE SUM OF SQUARED DEVIATIONS
Time in minutes
(X)
Percent Found
Defective ( Y )
Yi
D  Y  Yi
D2
48
17
12.00
5.00
25.0000
50
9
12.89
- 3.89
15.1321
43
12
9.78
2.22
4.9284
36
7
6.66
0.34
0.1156
45
8
10.67
- 2.67
7.1289
49
10
12.45
- 2.45
6.0025
55
14
15.12
- 1.12
1.2544
63
18
18.68
- 0.68
0.4624
55
19
15.12
3.88
15.0544
36
6
6.66
- 0.66
0.4356
D
Sum of squared Deviations
2
=
75.5143
Using the data in Table 7 - 5, you can see how fast we have been able to compute
the important measures which take a lot of time to compute under ordinary
circumstances. Obviously, it is faster by computer, since there are proprietary packages
which are designed for this kind of work. However, for learning and examination
purposes, this method has a lot of appeal. On is able to move quickly, and learn quickly at
the same time.
6.
It is now very easy to compute the standard error of estimate, which helps us to
build a probability model along the regression equation so that we can see how
our regression line fits as an estimator of our actual field situation. What we need now is
to find a method of calculating the standard error of estimate. Given all the data in Table
7 - 5 we can use the expression below to find the standard error of estimate of the
regression equation..
178
Standard Error of estimate
D
=
 Y

75.5243
8

2
n  2


75.5243
10  2
9.4405375  3073
.
If we state the hypothesis that there is no significant difference between the observed
values and the calculated values of Y for each value of X we can build a two tail t
distribution model centered on the regression line.
Confidence level
This is done by choosing the
C = 0.95, and a 0.05 alpha level. Since we need the upper and lower
tails on both sides of the regression equation, we divide the alpha level by two to obtain
0.025 on either side.
For the 0.05 alpha level, we obtain the appropriate value from the usual t-tables
on page 498 of your textbook (King’oriah, 2004). We find that our t-probability model on
both sides of our regression equation is built by the following critical value of t , using
the two-tail model. Remember this table in your Textbook has two alternatives, the twotail and the single tail alternative. We use the columns indicated by the second row of the
table for the two-tail model and find that :t , 10  2   t 0.05, 8  2.306 .
We now have a probability model which states that for any observed value of Y to
belong to the population of all those values estimated by the regression line, it should not
lie more than 2.306 standard errors of estimate on either side of the regression line.
The t-value from the table can be useful if it is possible to compute the t-position
for every observation. This is done by asking ourselves how many standard errors of
estimate each observation lies away from the regression line. We therefore need a
formula for computing the actual number standard errors for each observation. The
observed values and the calculated values are available on Table 7 - 5.
179
The expression for computing the individual t-values for each observation is ti 
ti 
Di
 Y

Yi  Y
 Y
Di
 Y
.
. Using this expression all the observations can be located and
their distance on either side of the regression line can be calculated. This is done at Table
7 - 7.
TABLE 7 - 7: COMPUTATION OF DEVIATIONS, THE SUM OF SQUARED DEVIATIONS AND
T-VALUES FOR EACH OBSERVATION
Time in
minutes
(X)
Percent Found
Defective ( Y )
Yi
D  Y  Yi
D2
48
17
12.00
5.00
25.0000
1.6270
50
9
12.89
- 3.89
15.1321
- 1.2530
43
12
9.78
2.22
4.9284
0.7220
36
7
6.66
0.34
0.1156
0.1106
45
8
10.67
- 2.67
7.1289
- 0.8689
49
10
12.45
- 2.45
6.0025
- 0.7973
55
14
15.12
- 1.12
1.2544
- 0.3645
63
18
18.68
- 0.68
0.4624
- 0.2213
55
19
15.12
3.88
15.0544
1.2626
36
6
6.66
- 0.66
0.4356
- 0.2148
Sum of squared Deviations
D
2
t 
Di
 Y
= 75.5143
From the t-values on the right-most column of Table 7 - 7, we find that there is not a
single observation which lies more that the expected value of :t , 10  2   t 0.05, 8  2.306
180
This tells us that there is a very good relationship between X and Y. This
relationship as outlined by the regression equation did not come by chance. The
regression equation is a very good predictor of actually what is happening in the field.
This also means that whatever parameters of the regression line which we have computed
they represent the actual situation regarding the changes in Y which are caused by the
changes in X. We can confidently say at 95% confidence level that actual percentage of
defective tablets which is found the production line depends on the inspection time in
minutes. We may want to instruct our quality control staff to be more vigilant with the
inspection so that our drug product may have a few defective tablets as humanly and
technically possible.
Analysis of Variance for Regression and Correlation
Statisticians are not content with only finding the priori values of statistical
computations. They always are keen on making double sure that what they report is not
due to mere chance. Another tool which they employ for the purpose of data verification
is what we learned in Chapter Six. This is what we shall call the F-test in our discussion.
In regression analysis we are also interested in changes within the dependent
variable which are caused by each change in the independent variable. Actually the string
of values of the independent variable is analogous to Treatment which we learned in
Analysis of Variance. Each position or observation of the independent variable is a
treatment, and we are interested to know the impact of each one of them on the
magnitude of the value of the dependent variable each time.
Analysis of variance for the Regression/Correlation operates at the highest level
of measurement (the ratio level) while the other statistic which we considered in Chapter
Six operates at all the other lower levels of measurement.
Use of analysis of variance in the regression correlation analysis tests the null
hypothesis at whatever confidence level that there is no linear relationship between the
independent variable and what we fancy to be the dependent variable. The null
hypothesis is that the variation in the dependent variable happened by chance, and is not
due to the effects of the independent variable. The alternative hypothesis is that what has
181
been discovered in the initial stages of regression/correlation analysis has not happened
by chance. The relationship is statistically significant. Therefore, To use the F-test we
assume:1.
A normal distribution for the values of Y for each changing value of X. any
observed value of Y is just one of those many which could have been observed.
This means that the values of Y are stochastic about the regression line.
2.
All the values of the independent variable X are stochastic as well, and therefore
the distribution is bi-variate normal.
( a ) The null hypothesis is that there is no relationship between X and Y.
( b ) Also there is no change in Y resulting from any change in X.
3.
In symbolic terms, the null and the alternative hypotheses of the
regression/correlation analysis could be stated in the following manner to reflect
all the assumptions we have made :-
(i)
H : 
0
Y1
  Y2  .....   Yn
 No change is recorded in variable Y as a result
of the changing levels of the variable X .
( ii )
H : 
A
Y1
  Y2  .....   Yn
 there is some statistically significant change
recorded in variable Y as a result of the changing levels of the variable X .
4.
( a )
The total error in F-Tests comprises the explained variation and the
unexplained variation. This comprises the sum of squared differences between
every observed value of the dependent variable and the mean of the whole string
of observations of the dependent variable.
SS = TOTAL ERROR = EXPLAINED ERROR + UNEXPLAINED ERROR
(b)
The error caused by each observation which we regard as an individual
treatment is what is regarded as the explained error.
SST = EXPLAINED ERROR = VARIATION IN Y CAUSED BY
EACH VALUE OF X.
182
(c)
The residual error is the unexplained variation due to random
circumstances.
SSE = TOTAL ERROR - EXPLAINED ERROR
SSE =
SS
-
SST
In our example SS is the sum of squared differences which are recorded in Table
D
7 - 7 as
2
= 75.5143. The percentage of this explained error, or the
probability out of 1.0 can be computed by multiplying this raw figure by the
coefficient of determination.
SS T  r 2
D
2
The value of r 2 is easily obtainable from some calculation using the values at
  x y
 x  y
2
page 164. Mathematically, this is expressed as r

2
2
2
.
Using our data :-
  x y
r2
 289 , and

  x y
2
 289  289 2
2
289 2
 0 . 630
650  204
This is the coefficient of determination, which indicates what probability or
proportion of the total variation explained by the X-variable, or the treatment.
Therefore, the explained error = SS T  r 2
SS

SS T 
D
2
 D = 75.5143
r  D  0. 630  75.5143 
 0. 630  75.5143
2
2
2
SS T  47 .574009
183
47 .574009
The error due to chance
SSE

= SSE
= SS
- SST
75.5143  47 .574009  27 . 940291
SSE  27 . 940291
5.
Degrees of Freedom
Total degrees of freedom: caused by the total variation of Y caused by all the
environment.
SS d f. = n - 1 = 10 - 1 = 9 d f.
SS d f. = 9 d f.
Unexplained degrees of freedom are lost due to the investigation of parameters
in two dimensions. Here we lose two degrees of freedom :-
SSE d f. = n - 2
= 10 - 2 = 8 d f.
SSE d f. = 8 d f.
Treatment degrees of freedom is the difference between the total degrees of
freedom and the unexplained degrees of freedom.
SST d f. = SS d f. - SSE d f.
= (10 - 1) - (10 - 2) = 9 - 8 = 1.0
SST d f. = 1.0 d f.
6.
This means that we have all the important ingredients for computing the
calculated F-statistic and for obtaining the critical values from the F-Tables, as
we have done before. Observe the following data and pay meticulous attention to
the accompanying discussion because at the end of all this we shall come to an
important summary The summary of all what we have obtained so far can be
recorded in the table which is actually suitable for all types of Analysis of
variance - called ANOVA table.
184
TABLE 7 - 8 : ANOVA TABLE
SOURCE OF
VARIATION
SUM OF SQUARES
FOR A BI-VARIATE REGRESSION ANALYSIS
DEGREES OF
FREEDOM
VARIANCES
(MEAN SQUARES)
CALCULATED
Fc
Total
SS
Explained by
Treatment
Unexplained
  D2
=75.5143
N - 1 = 10
SS d f. = 9 d f.
SS T  47 .574009
SST d f. = 1.0 d f.
MST
47 .574009
 47 .57
1 d f.
SSE  27 . 940291
SSE d f. =
10 - 2 = 8 d f.
MSE
27 . 940291
 3. 493
8 d f.
MST
MSE
47 .57

3.493
Fc 
= 13.586
Study the summary table carefully. You will find that the computations included in its
matrix are what is actually the systematic steps in computing the F-statistic.
The critical value of F is obtained from page 490 at 5% significance level and [1,
n - 2]
degrees of freedom.
F0.05, 1, n  2   F0.05, 1, 8  5. 3172
F0.05, 1, 8  5. 3172
7.
Compare this F0.05, 1, 8  5. 3172 to the calculated F-value. Fc  13.586 . You
will find that we are justified to reject the null hypothesis that the changes is the
values of X do not have any effect on the changes of the values of Y at 5%
significance level. In our example this means the more watchful the quality
control staff are, the more they can detect the defective tablets at 5% significance
level.
Activity
Do all the necessary peripheral reading on this subject and attempt as many
examples in your Textbook as possible. Try to offer interpretations of the
calculation results like we have in this chapter. It is only with constant exercise
185
that one can master these techniques properly; and be able to apply them with
confidence and be able to interpret most data which comes from proprietary
computer packages.
EXERCISES
1.
For a long time scholars have postulated that the predominance of agricultural
labor force in any country is an indication of the dependence of that country on
primary modes of production, and as a consequence, the per-capita income of
each country that has a preponderance of agricultural labor force should be low. A
low level of agricultural labor force in any country would then indicate high percapita income levels for that country.
Using the data given below and the linear regression/correlation model determine
what percentage of per-capita income is determined by agricultural labor force.
Agricultural
labor force
(Millions)
Per capita
income
(US. $ 00)
2.
9
10
8
7
10
4
5
5
6
8
7
4
9
5
8
6
8
8
7
7
12
9
8
9
10
10
11
9
10
11
An urban sociologist practicing within Nairobi has done a middle- income
survey, comprising a sample of 15 households, with a view to determining
whether the level of education of any middle-income head of household within
this City determines the annual income of
their families. The following is the
result of his findings :-
Education
level
Annual
income  K.
Shs100,000)
7
12
8
12
14
9
18
14
8
12
17
10
16
10
13
18
32
28
24
22
32
36
26
26
28
28
32
30
20
18
186
(i)
Compute your bi-variate correlation coefficient and the coefficient of
determination.
( ii )
Use the shortcut formulae for Regression/correlation analysis to
equation of the regression line of income (Y) on
compute
the
education level ( X ).
( iii ) By what means can the sociologist confirm that there is indeed a relationship?
(Here you should describe any one of the methods of testing the significance of
your statistic.)
3.
A survey of 12 couples is done on the number of children they have Y as
compared to the number of children the had stated previously they would have
liked to have X.
(a)
Find the regression equation on this phenomenon, computing all the appropriate
regression coefficients.
(b)
What is the correlation coefficient, the Coefficient of determination, Coefficient
of non-determination and that of Alienation with regard to this experiment? What
is your interpretation on all these?
Couple 1
2
3
4
5
6
7
8
9
10
11
12
Y
4
3
0
2
4
3
0
4
3
1
3
1
X
3
3
0
2
2
3
0
3
2
1
3
2
187
CHAPTER EIGHT
PARTIAL REGRESSION, MULTIPLE LINEAR REGRESSION
AND CORRELATION
Introduction
There are situations where only one variable is not enough to explain all the
variation in the dependent variable. In this case we are interested in telling how much of
the total variation in the dependent variable can be explained by each of the variable
which we suspect has some effect on the variability within the dependent variable. We
now introduce a situation where ore than one variable is influencing the dependent
variable. In that case we shall have one major assumption that all the independent
variables are truly independent and none affects the other independent variable. This
means that there is no multi-collinearity among the independent variables.
In this case also the dependent variable is assumed to be stochastic and to be
determined by all the independent variables in the model. In that case, a simple regression
model which represents the interaction of the two variables Yi  a  b X i will not be
adequate. In the general case, the model which is applicable in the multi-variate case is :-
Yi  a  b1 X1  b2 X 2  .....  b k X k ,
in the situation where X k
are independent variables, and b k are the changes in the
dependent variable Y with respect to each of the X k independent variables.
Partial Correlation
To be able to understand the concept of multiple regression and correlation we
need to understand the concept of partial correlation. This comes about when the analyst
examines how much one independent variable is affecting the dependent variable while
all the effects of other variables are held constant. This control is achieved by adjusting
the values of the dependent variable to account for the disturbing effects of all the other
independent variables. For convenience all variables are identified by labeling them with
188
Arabic number subscripts, with the dependent variable being the first variable, and all the
others following from X2 , X3 , .... , X k . This means that the relationship between X1
and any other variable, say variable number 3, is designated as r13 , etc.
Partial correlations are designated in orders. These depend on how many variables
are controlled in the investigation. The zero order means that no variables are controlled.
The first order means the effects of one variable are controlled while we investigate the
effects of another two variables. The first order partial correlation therefore involves
three variables. The order of control goes up in that manner.
The designation of the correlation coefficient reflects this type of control also.
The general equation for the first order partial correlation coefficient is :-
ri jk 
  
ri j  ri k rk j
1  ri k
2
1  rk j
2
Take the example of variables 1 and 2 holding the effects of variable 3. This equation in
real terms becomes :-
r123 
r12  r13 r23 
1  r13
2
1  r23
2
In that case we can interpret the correlation coefficient as :-
Explained variation between i and j
ri j  k

having taken away the explanation of i and j on k
Total variation between i and j having accounted
for the effects of i and k ; and j and k
Computing Partial Correlation Coefficients
To accomplish our interesting exercise we begin by computing first order
correlation coefficients between the dependent variable and all the independent variables
189
involved in the relationship investigation. Once we have done this, the rest of the exercise
is easy because it is merely a question of filling in the correlation coefficients in the
equation :
ri jk 
  
ri j  ri k rk j
1  ri k
2
1  rk j
2
.
Let us use climatic data example given in your textbook, pages 367 - 373. (King’oriah,
2004)
Example
After observing the weather for a long time in the Mid-Western United States we
develop a strong feeling that a cold January and a cold February are always
followed by a cold March. We feel that this is true because of the following
reasons :1.
If a strong high pressure cell develops in Western Canada, this is when it is very
cold in Indiana and Illinois throughout January. This high pressure cell remains
strong throughout the winter and spring. Conversely, warm Januarys are
experienced when this high pressure cell is weak over Western Canada.
2.
If the ground becomes chilled and frozen in January, the cold ground lowers the
temperatures of all the subsequent air masses adjacent to it (immediately above
it). This causes intense cold to subsist throughout February and March.
Test the Hypothesis that a cold January ( X 2), and a cold February ( X 3) is
always followed by a cold march ( X 1) using the climatic data gathered between
1950 and 1964 in Table 8 - 1. ( Source : Prof. J.C. Hook, Indiana State
University, 1979. )
Computational techniques
This kind of model can be tackled using a few simple steps which aim to
accomplish the instructions of the main equation. The rest of the task involves
interpreting the results of the computation. Finally there are significance tests to be done
190
TABLE 8 - 1 : MEAN TEMPERATURES IN DEGREES FAHRENHEIT * OF THE TOWN OF
WINAMAC, INDIANA
X2
X3
Year
January
February
March
1950
33.5
26.8
34.1
1951
26.6
28.9
35.6
1952
28.5
31.9
35.4
1953
30.1
32.7
38.4
1954
27.2
37.1
34.1
1955
24.1
28.3
37.4
1956
26.5
29.0
37.6
1957
18.9
33.6
37.1
1958
26.4
20.7
35.3
X1
***The observer died and a new one came in 1962
1962
19.4
25.2
35.1
1963
13.3
18.1
41.1
1964
29.0
27.4
36.9
TOTALS
303.5
340.7
438.1
8019.39
9980.91
16038.55
X
x
2
i
2
i
Standard
Deviations
 x 1 x2
 x 3 x1
Correlation
Coefficients
343.3691667
 2  5. 349214636
307.8691667
 2  5. 065151912
44.2491667
 1  1. 910268355
– 70.189 166 66
– 44 . 939166 66
r12   0 .569 425 4725

r13   0 . 385 025 3892
 0 .569

 0 . 385
r23   0 . 393 2802 
 0 . 393
*Changing the data to degrees Celsius will have the same effect because this would involve a simple
arithmetic transformation of data.
( Source : Prof. J.C. Hook, Indiana State University, 1979. )
191
in order to tell whether or not the relationship came by chance at specified significance
levels.
We must caution that these days there are computer packages which aid in the
computation of statistics of this kind, and which give exact results, saving considerable
amounts of labor. However, this does not mean the end of learning, because what matters
is not the accomplishment of the computation of the statistic. What matters is the
interpretation of the results. One needs to understand very clearly how and why any
statistical algorithm works; in order to guarantee correct interpretation of the data which
may come out of the “mouth” of the computer. It is the view of this author that no matter
how much we advance technologically, we shall not stop investigating why our
environment around us works - particularly in areas which affect so many of us - the area
of research and interpretation of research data. Therefore, a small amount of patience in
learning such a statistic as this one pays a lot of dividends.
We must mention that there is literature which aims at teaching the computer
application techniques, and which assist in the solution of problems like this one. To
understand these techniques, one also needs to understand how each of these statistics
works, whether or not the computer is available to assist in menial tasks.
Step One
We begin by drawing and completing a table like Table 8 - 1. We must do all the
“housekeeping” tasks of completing all the cells of the table. These tasks involve
nothing new. We need to compute all the Zero-order correlation coefficients and
standard deviations in the exact manner as we learned in the first chapters of this
module. Learners should make sure that they compute these measures step by
step. Teachers must make sure that the learners understand how to solve the zeroorder measures by setting appropriate exercises to that effect.
The zero-order correlation coefficients show relatively weak relationships
between the independent variables and the dependent variable, ( 3 )
temperatures.
r12   0 .569 425 4725

 0 .569 ,
r23   0 . 393 2802 
 0 . 393
192
March
r13  r31   0 . 385 025 3892

 0 . 385
The question being asked in each of the three cases is whether any month which is the
independent variable has any effect on the March temperature, the dependent variable.
The strength of these relationships is tested by means of the size of the coefficients of
determination computed out of the three correlation coefficients. Square these
coefficients each time to obtain their corresponding coefficients of determination. (Verify
all these measures yourself.)
Step Two
The second step involves the computation of the partial correlation coefficients
using the first order measures found in Step 1. These are tricky areas but not
complicated as long as you follow the stipulations of the formula meticulously,
inserting the relevant values in their correct positions. Compute the correlation
between January and March controlling for the February effects. This is
designated as “ r 12 . 3 ”.
r123 

r12  r13 r23 
1  r13
2
1  r23
2
 0 . 417695
0 .851775 0 .845551
 0 .569   0 . 3850 . 393

1   0 . 385

 0 . 4921833
1  0 . 393
2

2
 0 . 492
Comparing the result with the zero-order correlation coefficient we find a less
negative relationship. This could imply that we are getting somewhere, but does
not help us very much. Now, compute the relationship between March and
February while controlling for the effects of January, which we have already
investigated. This will give us the individual impact of the February temperatures,
so that we can determine the direction of this effect, and also the direction.
193
r132 
r13  r12 r23 
1  r12
2
1  r23
2
 0 .161383
0 . 676239 0 .845551

 0 . 385   0 .5690 . 393

1   0 .569

 0 . 2134213
1  0 . 393
2

2
 0 . 213
Again, we have a less negative relationship between the variation of February
temperatures and March temperatures. any farther tests of significance are not
really necessary. We shall perform these finally when we complete the
computation technique.
The model is not yet complete. However, we needed to compute these measures
at first level as a way to obtaining the final equation and the final model where all the
appropriate variables are controlled. We now proceed to examine the multiple
simultaneous interactions of all the variables which can be used to explain the changes in
the dependent variable -March temperatures.
This second step shows that we have not concluded or resolved our problem. The
following step will examine whether multiple simultaneous interaction of all the variables
will explain the changed in the temperatures of the dependent variable - the month of
March
Step Three
Examine the size of the nature and the sizes of the standard deviations of
the individual variables, especially the dependent variable. Use the standard
deviations of all the observed values of the dependent variable X1 and adjust this
standard deviation by multiplying it with the coefficients of alienation between
the dependent variable and the independent variable. This adjustment is done to
see how much of the variation in the independent variable remains unexplained by
the correlation coefficients which we have computed, The formula for doing this
kind of adjustment is :-
SY  X   Y
1  r Y X
194
2
Here we must remember that the Coefficient of Alienation
variation of the dependent variable X
1


1  r 2 tells us what the
cannot be explained by the variations in any of
the independent variables. Beginning with the dependent variable X1 itself the standard
deviation indicates all the variations in X1 which are not explained by the entire
environment in which it operates - including both the independent variables. Its standard
deviation represents the total variation in X 1 .
If we use the expression which we have just learned “ SY  X   Y
1  r Y X
2
”,
and multiply it with the coefficient of alienation - that variation which is not explained by
the effects of each of the independent variables - we are able to isolate the standard error
of estimating the dependent variable X 1 given the effects of each of these variables.
Individual standard deviations of all the variables involved are the zero-order
standard deviations. the standard errors of estimate between any two variables are the
first order standard errors, and so on. To be able to find the total effect of all independent
variables on the predictability of the dependent variables, a method is required where the
total effect of all the independent variables is calculated. In a three-variable model like
this one the first order standard errors of estimating X
1
given the effects of X
2
is
computed using the formula :-
S12  1 1  r12
2
We may now use this to compute the standard error of estimating variable X
1
given the effects of X 2 .
 1  1920
.
,
r
  0.569 .
12
S1 2  1 1  r12
2
Therefore r12   0 . 323761 .
2
 1. 920 1  0. 323761
 1.578888
 1.579
The effect of the action of the second variable X 3 can be included by multiplying
what we have just computed
“
S1 2 
alienation between this second variable X
195
1.579 ” by partial coefficient of
3
and the dependent variable, while
holding the effects of the first independent variable; because, after all we have just
taken into account these effects and therefore to include them again amounts to
double-counting. The result of this process is the multiple standard error of
estimate, which is found in the following manner :-
S12  23  1 1 
r12 2
1
r1322
We must not the position of dots in this formula. The formula represents the
standard error of variable X 1 given the effects of X 2 and X 3 .
It means that after accounting for the variables X
2
and X 3 , there is still sone
net variation which remains unexplained. Variable X
means of the expression
variable X
2
within
2
is accounted for by
2
S12  1 1  r12 . What remains unexplained by
the variable X
1
is farther explained by variable
X
3
after taking into account that variable X 2 has done its “ job ” beforehand. This
is the meaning of the dot expression in the
“ S12  23  1 1 
r12 2
1
r1322
1
r1322
part of the expression
”
This first order standard error of estimate is the key to the computation of the
Coefficient of Multiple Determination. It comprises the crude standard deviation
of X
1
.
(the dependent variable) which is denoted as “  1  1920
”. Then this
value is corrected twice . First we examine the effects of
X
2
, which is
represented by S12  1 1  r12 , and then finally we examine the effect of
2
X 3 given that the variable X
2
has been allowed to operate. This later effect is
represented by
S12  23  1 1 
r12 2
196
1
r1322
.
S12  23 using the figures which we have calculated
We now set out to compute
above. These are summarized as follows :-
 1  1920
.
,
r
2
12
 1  1920
.
,
  0 . 323761 .
r
2
132
r
12
  0.569 .
  0 . 2134213
 0 . 0455486
2
S1 23  1. 920 1  0 . 323761 1  0 . 0455486
 1. 920 0. 676239
0. 9544213
 1. 920  0.8223375  0 . 9769602  1.5425108
 1 . 543
Compare this result with the following results :-
 1  1920
.
.......... Zero order standard error of estimate
S1 2 
1.579
..........First order standard error of estimating X 1 given of X 2
S1 23 
1.543
..........Second Order standard error of estimating X
given the
1
joint simultaneous effects of X 2 and X 3 .
We can see that each time we include the partial effects of additional variable, the
standard error becomes smaller and smaller. This means that given the joint simultaneous
action of X 2 and of X 3 we reduce the band around the least squares line within which
we may expect to find the regressing values of X
2
. The estimation of X
2
becomes
more and more accurate each time.
Step Four
The Coefficient of Multiple Determination
This coefficient is computed using the variation of
explained by either of X
2
or X
3
X
1
which cannot be
acting together on X
1
. This variation is
deducted from the total variation of X 1 , and the formula for all this is :-
R 21 23  1 
197
 21 . 23
 21
Intuitively, this expression reads “The multiple coefficient of determination of the
net variation in X 1 which cannot be accounted for by the effects of X 2 and X 3
working together. This is the difference between all possible variation which can
take place (1.0) and the ratio of the net variation unexplained by X
2
and X 3 to
the total variation within X 1 . Taking the same expression :-
R
2
1 23
 21.23
= coefficient of multiple determination of the variation in
 1 
 21
X 1 given the effects of X 2 and X 3
 21 . 23 = The net variation in X 1 which cannot be explained by the variation
in X 2 and X 3 ,
1
and:
= The total variance in X 1 , both explained and unexplained. As usual, it
is the square of the crude standard deviation of X 1 , or the variance
of
X1.
Step Five: Coefficient of Multiple non-Determination and Determination
Consequently :-
S 21.23

2
Variance within X 1 unexplained by X 2 and X 3
Total Variation within X 1

1
This expression is equal to the coefficient of Multiple non-Determination. It defined the
fraction of the variation in
action of
X
2
and
X 1 which cannot be explained by the joint simultaneous
X
3
. The difference between the total variation or the total
determination which is possible (1.0000) and this coefficient is actually the Coefficient of
Determination. This is given by the formula
R 21 23  1 
 21.23
 21
Using the data at our disposal we already have computed : S1 23 
the standard error of the variable
X 1 given the effects of both X 2
198
1.543 , and therefore
and X 3 can be evaluated as :-
1.5425108
S 21 23 
 1. 543  2.3793397
2
We also know that from the computations in Table 8 - 1 that :-
 12 
1. 92068355 2
 3. 6874303
Consequently:-
R 21 23  1 
 21.23
2.3793397
 1
 0 . 3547431  0 . 355
2
 1
3.6874303
This means that the variation in March temperatures
explained by January temperatures
(X
2
)
X
1
which can be jointly
and February temperatures
(X
3
)
is
about 35%.
This is quite a high ratio using only two variables to explain the variation of a
third variable. Other variables in the environment which affect March temperatures
1
(X
) need to be sought and included in the model to see whether we can account for a
higher fraction of variation in
X 1 . For the time being we may be satisfied that this is
all we can manage. However, before we leave this trend of argument we need to not that
r
2
12
 0 . 324 explains about 32.4% of the variation Step Five: Coefficient of Multiple
non-Determination and Determination
.
r
r
2
31
 0.148 explains about 14.8% of the variation in
X1 .
 0 . 355 explains about 35.5% of the variation in
X1 .
2
1 23
this indicates that a multiple linear regression model has a lot to offer in explaining the
variations in march temperatures. It is expected that the introduction of any other
variables such as the nature of the ground and the elevation of slope, the nature of the
land use the strength of the prevailing winds with the accompanying wind-chill factors,
etc can account for more variation in March temperatures in Winamac Indiana.
199
The Multiple correlation Coefficient is obviously the square root of the coefficient of
multiple determination :R 21 23  0 . 3547431 ;
R 123 
0. 3547431  05956031
.
 0596
.
There is no sign attached to this coefficient, because it comprises a multiple relationship
within some variables which may have positive and others which may have negative
correlation. These effects cancel in a multi-variate situation.
Step Five: Looking for the Regression Coefficients
The process of computing the partial slope coefficients ( B s ) begins with the
computation of the slopes accounting for the interaction of
X
1
. Note that we
are using capital letters instead of lower case letters ( b s ). This is to differentiate
between the coefficients resulting from multi-activity of the combined
independent variables from those of only one independent variable.
Generally the formula for the multiple B given the effects of the two variables on the
independent variable is determined in a similar manner as that of the multiple coefficient
of determination and correlation which we have computed above. The formula goes as
hereunder :-
Bi  j k 
  
 B  B 
Bi j  Bi k Bk j
jk
kj
Specifically :-
B123 
  
 B  B 
B12  B13 B32
23
32
This is one of the few cases where we shall not need to verify the formula because of the
advanced form of mathematics involved. However, we can see that to have a pure effect
200
on any B while controlling for the effects of all the other variables a similar operation is
performed on the zero Bs of other variables as that on the zero-order
r
when
computing partial correlation coefficients.
We begin by computing b12 ; as follows :
b12 
x x
x
1 2
2
2
Readers are expected to look for the relevant values to fill in this equation from
Table 8 - 1. In so-doing, solve for
x x
1 2
and
x
2
2
in the following
discussion as an exercise. Once this is done you will find that the series of the
following computations are possible :-
x x
1 2
  70 .18917 ,
This gives us the value of
x
2
2
 343. 36917 ;
b12 
 70 .18917
.
343. 36917
b13 
 44 . 93917
.
307 .86917
b23 
127 .86917
.
343. 36917
b 32 
127 .86917
.
307 .86917
b12 = – 0.2044131
Similarly:-
x x
1 3
  44 . 93917 ,
This gives us the value of
x
2
3
 307 .86917 ;
b13 = – 0.1459684.
Also :-
x x
2 3
 127 .86917 ,
This gives us the value of
x
2
3
 343. 36917 ;
b23 = 0.3723956.
Then :-
x x
3 2
 127 .86917 ,
x
2
3
 307 .86917 ;
This gives us the value of b32 = 0.4154335.
201
Once these have been done, the computed values of b are inserted in their
appropriate places within the equation
for the computation of partial B
coefficients.
Step six: Calculating the Partial Regression Coefficients and the regression equation
This step is accomplished through the use of the equation which we discussed
before : Bi  j k 
   .
 B  B 
Bi j  Bi k Bk j
jk
In that regard, the partial B between the
kj
first and the third variable wile controlling for the effects of the second one is :-
B122 


 0 . 2044131   0 .14596840 . 4154335
b12  bi13 b32 

b23 b32 
0 . 37239560 . 4154335
Calculating the final fraction we get B 12 . 3 = – 0.9295533.
B132 


 0 .1459684   0 . 20441310 . 3723956
b13  b12 b23 

b32 b23 
0 . 41543350 . 3723956
Calculating the final fraction we get B 13 . 2 = – 0.4515833.
The calculation of the partial slope coefficients opens the door for our stating the
intercept coefficient which goes like this :-
a1.23  X 1  B12.3 X 2  B13.2 X 3
Obviously, to compute the multiple a we have to compute the respective means.
Using the data from Table
8 - 1, we find that
X 1  36.50833 ,
X 2  25. 29167 , and X 3  28. 391667 . (Please investigate that this is
the correct position regarding these figures). Substitute these figures in their
correct positions within the equation a1.23  X 1  B12.3 X 2  B13.2 X 3 , and
202
the value for a1.23 is finally computed to be
a1.23  72.84 .
Our regression equation (which you must verify through your own computation)
therefore becomes :-
X1c  a 1.23  B12.3 X 2  B13.2 X 3
Which in actual terms of our example becomes :-
X 1c  72.84   0.92296 X 2   0 . 452 X 3
 72.84  0.92296 X 2  0 . 452 X 3
This means that given any array of X2 and X3 figures, for each year between 1950
and 1964 we can be able to compute the predicted (or the calculated) values of
March (X1) temperatures.
Significance Tests
Similar tests are available for the multi-variate case to those of the bi-variate
regression/correlation. They test Whether the changes in the dependent variable are
necessarily influenced by the changes in the two independent variables. to be able to
carry out these tests analogous assumptions for this kind of statistic to those of the
bi-variate case are made, otherwise the tests would be invalid.
1.
There must be normal distributions for each of the variables involved, including
the dependent variable. this means that the distribution of all the variables is
multi-variate normal.
2.
Apart from the relationship between them and the dependent variable, all the
independent variables are not related to one another.
3.
Each observation of each of the independent variables is one of the many that
could have been possible.
203
When all the three conditions are possible we conclude that the model obeys the
condition of Heteroscedasticity. The null hypotheses being tested are that all the means of
the independent variable are equal and remains so, no matter what the application of the
independent variable could be. This means that the model is amenable to the application
of Analysis of Variance and other significance tests.
Analysis of Variance
Once the estimated values of the regression equation for X1 are available, they are
compared to the observed values as we did in the bi-variate case. In that regard, we must
compute the necessary measure for the computation of the F-Statistic.
D
2
 The sum of squared differences between the observed values of the
dependent variable and the computed values of the same variable.
R12.23  D 2  The coefficient of multiple determination, recording the explained
variation in X1 caused by a simultaneous changes in all the variables
involved in the model. In this case they are X2 and X3. this is the same as the
sum of squared deviations due to treatment SST.
1  R12.23  D 2  The total unexplained variation, even when we have accounted
for the effects of the two independent variables. This is the same as SSE.
Total degrees of freedom are the same as the number of paired cases less one
degree of freedom. SS d f . = N - 1. the capital N reflects the multi-variate nature
of the investigation; as opposed to the lower-case n which represents the bivariate case.
2.
Recall the ANOVA relationship which is expressed as :-
SS - SSE = SST
204
This means that the grand sum of squares (SS) less the sum of squares due to error
(SSE) is equal to the sum of squares due to Treatment. In this regard, you can also
treat the degrees of freedom this way.
SS d f. - SSE d f.
= SST d f
This is also true for the multi-variate case where :-
SS d f . = N - 1 ,
SST d f. = k where k is the number of independent variables . These are the
degrees of freedom explained by treatment.
SS d f. - SSE d f.
This means that
= k or SST d f.
N
 1  SSEd f  k or degrees of freedom explained by
treatment.
Consequently SSEd f   N  1  k .
With this amount of data available to us we can construct the ANOVA table to
summarize our findings. In this table we can at the same time calculate the F-value,
(F calculated) which is compared to the F-value obtainable from the tables at 95%
confidence level. This critical value of F is designated as :-

 N  k  1
F , k ,
In our case it happens to be F0.05, 2,
formula for F we find F

R2 k ,
N
33 When we actually compute the F using our

 k  1
1  R
2
k

0.355  33
0 . 645  2

11. 715
 9 . 08 .
1. 29
We note that the calculated value of F is larger than the expected value . therefore we
reject the null hypothesis of the two variables having no effect on the independent
205
variable X1 and accept the alternative hypothesis that the two independent variables have
a significant impact on the dependent variable X1 .
TABLE 8 - 2 : ANOVA TABLE FOR MULTI-VARIATE LINEARR RELATIONSHIP TEST
Degrees of
Freedom
Sum of Squares
SS 
Total
D
Mean Square
Fc
N  1
2
R2 D2
F
Explained by
Treatment
Unexplained
SST  R 2  D 2

SSE  1  R 2
 D
k
2
MST 
N  k  1

R2 D2
1  R  D
2

R 2  D 2   N  k  1

k 1  R2
N  k  1
D
R 2  N  k  1
1  R  k
Using this table, one does not need to go to all the pains we have undergone above,
2
because we can use the formula R  N  k2  1 right away to compute the calculated
1  R  k
value of F . This is compared with the expected value of F which is sought from the

 N  k  1 .
206
2
N  k  1

table F , k ,
 D
k
2
MSE 
k
1  R2

2
2
EXERCISES
1.
A crop breeder would like to know what the relationship os between crop yield
X1 (dependent Variable) and both the amounts of fertilizer used on the farm X2,
and the amounts of insecticide used on his farm X2, after taking into account his
activities between 1991 and 2000.
Year
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
X1
40
44
46
48
52
58
60
68
74
80
X2
6
10
12
14
16
18
22
24
26
32
X3
4
4
5
7
9
12
14
20
21
24
Using any of the techniques of Multiple regression and Correlation and those of partial
regression and correlation that you have learned above :-
(a)
Calculate the multiple regression coefficients for the given data, and test their
significance
(b)
Compute the coefficient of multiple determination, multiple non- determination
and the multiple correlation coefficient
(c)
Using the F-test, do you think that the relationship between the two independent
variables and the dependent variable came by chance ?
2.
The accompanying data was collected on maize in a study of phosphate
response, base saturation, and silica relationship in acid soils. Percentage of
response is measured in the difference between yield on plots on receiving
phosphates and those not receiving phosphates, divided by the yield on plots
receiving no phosphates and multiplied by 100. Therefore, there is a correlation
between Y and Xi . The variable X2, labeled BEC is base exchange capacity.
Consider this as a regression problem with two independent variables X1 and X2.
207
Response
to
Phosphates
(%)(Y)
Yields
in
Kilograms per
Ha. ( X1 )
Saturation
of
BEC (%) ( X2 )
pH of the soil
( X3 )
88
80
42
37
37
20
20
18
18
4
2
2
-2
-7
844
1678
1573
3025
653
1991
2187
1262
4624
5249
4258
2943
5092
4096
67
57
39
54
46
62
69
74
69
76
80
79
82
85
5.75
6.05
5.45
5.70
5.55
5.00
6.40
6.10
6.05
6.15
5.55
6.40
6.55
6.50
(a)
Write the least squares equation, and estimate the regression coefficients.
(b)
Test the significance of the regression using analysis of variance F-tests .
(c)
Construct a 95% confidence interval in the multiple regression coefficient, and on
the intercept.
208
CHAPTER NINE
NON-LINEAR CORRELATION AND REGRESSION
Introduction
Linear models have their own limitations. The relationship between any two
variables may not necessarily be linear. If on the first sight using the scatter diagram the
dots or crosses seem to be scattered in some manner that is not clearly linear it may
imply, not the lack of relationship, but
the fact that the relationship between the
dependent variable and the independent variable are not linear, and that the relationship is
non-linear. In this connection, therefore, we need to discuss a technique which is able to
analyze this kind of data and bring out the results from a non-linear relationship, and the
regression along a non-linear trend. This chapter is an extension of the
regression/correlation techniques to answer some of the difficulties.
Logarithmic Transformations and Curve fitting
Consider the following hypothetical data in Table 9 - 1, which relates the
proportion of working population in service industries and the Gross Domestic Product
per Capita in twelve countries. We need to investigate the nature of correlation between
the number of employees in service industries and per capita income of 12 countries.
Firstly, we must arrange data in form of a table and do all the elementary calculations.
Initial examination of our statistics give us a correlation coefficient of r = 0.85
and r 2 = 0.73. These are initial indications of the strength of the relationship of the two
variables. Which means that we need to test whether indeed the relationship is linear,
because when data is plotted on a scatter diagram (Figure 9 - 1) the dots show evidence
of lying on a curve. Therefore, we need to fit a curve to our data. The easiest technique of
fitting a curve to this data is to use logarithmic methods. After obtaining a logarithmic
equation then data is translated again into its actual anti-logarithmic form and a fresh
curve is plotted to indicate the real nature of data. The advantages of the logarithmic
methods is that the least squares methods which we have hitherto discussed can be used
without any modification of data. After transforming the data semi-logarithmically, and
using the logarithms of the dependent variables we obtain a straight line graph given in
209
Figure 9 - 3, which fits the data better than Figure 9 - 2. The latter is a scatter diagram
drawn for the untransformed data on both axes.
TABLE 9 - 1: PROPORTION OF WORKING POPULATION IN SERVICE INDUSTRIES
(Source: King’oriah, 2004, p.333)
Country
Number
GDP per Capita
(Hundreds of
Dollars)
(X)
2.0
1.2
14.8
8.3
8.4
3.0
4.8
15.6
16.1
11.5
14.2
14.0
113.9
1
2
3
4
5
6
7
8
9
10
11
12
Totals
 x,  y
Number of
Employees
(per Thousand
in service
Industries) (Y)
12.0
8.0
76.4
17.0
21.3
10.0
12.5
97.3
88.0
35.0
38.6
47.3
453.4
X2
Y2
XY
4.00
1.44
219.04
68.89
70.56
9.00
23.04
243.36
259.21
132.25
201.64
196.00
1428.43
144.00
64.00
5836.96
289.00
453.69
100.00
156.25
9467.29
7744.00
625.00
1489.96
2237.29
28607.44
24.00
9.60
1130.72
141.10
178.92
30.00
60.00
1517.88
1416.80
287.50
548.12
662.20
6006.84
347.3292
2
xy
1703.3183
r, r 2
0.8531411
11, 476.477
0.7278497
210
Figure 9 - 1 : Scatter diagram of employment data
Figure 9 - 2 : Straight-line curve on employment data
211
Figure 9 - 3: Results of logarithmic curve fitting. Data has been transformed
Semi-logarithmically and transformed back. The results of the
Regression equation have been used to fir the curve into the data
We can try to do this exercise using the data on table 13 - 2. The relevant equation here is
Log Y  0 . 2881  1.8233 X
This is the equation whose results are plotted on Figure 9 - 2. Figure 9 - 3 is the results of
the re-transformation the Y-values into their usual real number form (looking for the antilogarithms of the results), and then using the results of this to fit a curve into the scatter
diagram. In real number form, if you remember that logarithms describe the numbers in
terms of their powers of 10, the actual regression equation which realistically fits this
data is :Y  1941
.
X
1.8233
Double-logarithmic transformation is where the figures on both the X and the Y variables
are transformed into their logarithms. Their regression equation is computed using the
212
transformed figures. Thereafter the actual regression equation is computed by looking for
the anti-logarithms of all the data for the dependent variable and the independent
variable. Learners may try to do this as an exercise. Let us now tabulate the data which
was used for semi-logarithmic curve fitting at Table 9 - 2.
The equations for fitting the regression lines involve the computation of the
regression coefficients in the usual manner - the way we have learned in Chapter 8. Any
of the methods available in practice including the use of computers can be used to obtain
the relevant results. Hereunder is the shortcut equation for the calculation of the bcoefficient using semi-logarithmic methods and the tabulated data in Table 9 - 2. We give
the shortcut equation for the computation of the b-coefficient which is identical to the
ones we considered in Chapter 8.
logb 
 X  log Y 
 N  X 
N  X log Y 
N X 2
2
TABLE 9 - 2. ANALYSIS OF THE DATA WHICH WAS USED FOR SEMI-LOGARITHMIC
CURVE FITTING (Source: King’oriah, 2004, p.337)
Country
Number
(X)
(Y)
Log Y
X ( log Y)
X2
1
2
3
4
5
6
7
8
9
10
11
12
2.0
1.2
14.8
8.3
8.4
3.0
4.8
15.6
16.1
11.5
14.2
14.0
12.0
8.0
76.4
17.0
21.3
10.0
12.5
97.3
88.0
35.0
38.6
47.3
1.1647
0.8156
3.5461
1.5139
1.7646
1.0000
1.2032
3.9535
3.7811
1.9541
2.5173
2.8053
2.1584
1.0837
27.8699
10.2123
11.1586
3.0000
5.2651
31.0144
31.3065
16.0759
22.5297
23.4486
4.00
1.44
219.04
68.89
70.56
9.00
23.04
243.36
259.21
132.25
201.64
196.00
Totals
113.9
453.4
26.0184
185.1231
1428.43
213
LogY  2
Like the slope coefficient “b ” above, The intercept, or the a-coefficient can also be
computed using the data in Table 9 - 2 and the equation which is also a mathematical
identity of the equation we encountered in Chapter Eight:-
a 
 log Y
 log b  X
N
Compare this expression with a  Y  b X . When data is fitted into both equations
the following real-value expressions are the result :-
logb 
log b 
 X  log Y 
 N  X 
N  X log Y 
N X 2
2

.
121851231
  113.917.1131
2
121428.43  113. 9
272 . 2971
 0.0653
4167.95
We also insert data into the expression for the constant :-
a 
 log Y
 log b  X
N

17.1131  0. 0653113.9
 0.8063
12
The regression equation, in its untransformed configuration is :log Y  0 .8063  0.0653 X
We transform this equation into its real-number form to obtain the actual model which
defines the relationship of our data :
Y  0 . 64011.162 X . This is the equation which
has been used to fit the least squares line onto the scatter diagram at Figure 9 - 3.
Like we said above, it is possible to transform all data recorded for both variables
X and Y logarithmically. Table 9 - 3 is a record of all the data which has been transformed
214
logarithmically involving both variables. This time we are investigating the relationship
between hypothetical data recording the relationship between all people employed in
Coffee industry and the foreign exchange earning of a coffee growing country. We expect
that if coffee is the main export good for the country it is grown using large numbers of
human resources, and that the larger the number, the more the coffee produced, and the
more the foreign exchange earning in this country.
This time the relevant equation for the slope coefficient is as follows :-
b 
N  log X log Y 
N  log X 
log a 
and for the intercept is
2
 log Y
 log X  log Y 
  log X 
2
 log b  log X
N
Now we fit in the real data into the shortcut equations :-
b 
1220 .5555  10.611722 .8053
129.4326  10.61172
Using your calculator you can verify that b = 1.8233.
The equation of the intercept is log a 
log Y  b  log X
N
, which we fill with data in
the following manner :log a 
22 .8053  1.823310 . 6117
12
215
TABLE 9 - 3. ANALYSIS OF THE DATA WHICH WAS USED FOR DOUBLE-LOGARITHMIC
CURVE FITTING (Source: King’oriah, 2004, p.341)
Employment in
Coffee
(X)
(000,000 People)
5.8
6.3
6.5
6.8
7.6
8.0
8.0
8.5
8.7
8.6
9.0
9.1
Foreign
Exchange
Earning
(Y)
($ 000,000 )
48.8
58.2
59.9
62.7
72.3
82.1
82.5
93.5
99.1
100.0
114.6
115.2
Log X
( Log X ) 2
LogY
LogY  2
log X log Y 
0.7634
0.7993
0.8129
0.8325
0.8808
0.9031
0.9031
0.9294
0.9395
0.9345
0.9542
0.9590
0.5828
0.6389
0.6608
0.6931
0.7758
0.8156
0.8156
0.8638
0.8827
0.8733
0.9105
0.9197
1.6884
1.7649
1.7774
1.7973
1.8591
1.9143
1.9165
1.9708
1.9961
2.0000
2.0591
2.0614
2.8507
3.1149
3.1592
3.2303
3.4563
3.6645
3.6730
3.8841
3.9844
4.0000
4.2399
4.2494
1.2889
1.4107
1.4448
1.4963
1.6375
1.7288
1.7308
1.8317
1.8753
1.8690
1.9648
1.9769
10.6117
9.4326
22.8053
43.5067
20.2555
Using your calculator you can verify that a = 0.2881.
The double-logarithmic regression equation is : log Y  log a  b log X . Translated into
actual logarithmic data, the equation reads :log Y  0.2881  18233
.
log X
The scatter diagram for the double logarithmic data is as given in Figure 9 - 4. When we
find the anti-logarithm of the data and draw the least-squares curve we find that, first of
all the actual regression equation is :
Y  1. 941 X 1.82331 , and the actual least squares
curve is as at Figure 9 - 5.
216
Non-Linear Regression and Correlation
Polynomial Model Building
A polynomial is a function consisting of successive powers of the independent variable
as in the following expression :Y  a  b X  c X 2  d X 3  ....  k X n
This is an expression of an n th degree polynomial. The first degree polynomial is the
familiar equation Y  a  b X . The degree of the equation is determined by the
exponent or the power of the polynomial.
The nature of the least-squares line is determined by polynomial exponents on top
of the independent variable. Each degree of the polynomial determines the undulation of
the least-squares curve. The polynomial equation is one of these mathematical equations
which could be used to describe the wave pattern of the curve. In general the degree of
the polynomial generates a least-squares curve of n - 1 bends of the wavy curve. Since
the higher-order polynomials can exhibit a positive slope at one point and negative at
another they exhibit wide characteristics of relationship between variables, other than
linear transformations, which do not have these important characteristics.
Example
Consider this person who wishes to investigate the relationship of land values to the
distance from the town center as given in your textbook. (King’oriah, 2004, page 344).
217
Figure 9 - 4 : The Scatter Diagram for the Double-Logarithmic Data in Table 9 - 3.
218
Figure 9 - 5: Actual Least-squares Curve for Employees in Coffee Industry
and Foreign Exchange.
The following is actual data which was collected from the field by Prof (Dr.)
Evaristus M. Irandu, early in the 1980s, as he was doing his master’s degree in
Geography of the University of Nairobi, and is cited in (King’oriah, 2004, page 344 page 358). Follow his trend of thought and find the relationship between the distance and
land values from the city centre of Mombasa by fitting a curve to the given data and
calculating all the relevant statistics which are estimated by the given data in Table 9 - 4
Prof. Irandu’s scatter diagram, after some extensive data analysis, is given in
Figure 9 - 6. After the necessary tabulation he used the shortcut equation below for
computing the correlation coefficient. This equation is identical to any you may have
used before, and has been known to bring forth accurate results, even without significant
rounding error problems.
219
r

n XY 
N X 2 
 
 X Y
 X    N  Y   Y 
2
2
2


When we obtain the relevant figures from Table 9 - 4 we have the following results :-
r

r

r

15697.4  61.1341
15325. 97  61.1 2  1514,235


 341
2

10461  208351
.
4889.55  3733.21213525  116281
 10374 .1
  0 . 9783094
1156.34  97244
The Coefficient of Determination “ r
2
r   0 . 978;
r 2  0.957
” which is shown in the last equation, revealed
that as much as 95.7% of the change in land values as one moves away from the city
center - from Mwembe Tayari area, or even from Ambalal House - is explainable by the
distance from this city center.
Again, using his data as laid out in Table 9 - 4, and the shortcut formula for the
computation of the regression coefficient we may compute the b-Coefficient. The
formula, as usual is :b

N X Y 
N X 2
 X  Y 
  X 
2
The formula is identical to any that we have used before; and can be filled with actual
data as follows :-
220
b
15697.4  61.1341
15325.97  61.1 2


10461  20835.1
4889 .55  3733.21
Table 9 - 4: Analysis of the variation of land values with distance
( Source : Adopted from E.M. Irandu, 1982 )
(X)
Distance
(Y)
Land Values/ha.
( Sh.20,000 )
0.4
0.8
1.3
2.1
2.3
3.1
3.0
4.2
4.7
5.3
6.1
59
55
54
40
41
28
18
16
9
6
5
X2
Y2
6.2
7.2
3
2
0.16
0.64
1.69
4.41
5.29
9.61
9.00
17.64
22.09.
28.09
37.21
38.44
51.84
6.9
7.5
3
2
47.61
52.25
9
4
20.7
15.0
341
325.97
14,235
697.4

 61. 1
221
3481
3025
2916
1600
1681
784
324
256
81
36
25
XY
9
4
23.6
44.0
70.2
84.0
94.2
86.8
54.0
67.2
42.3
31.8
30.5
18.6
14.4
Figure 9 - 5: Variation of land values with distance
( Source : Adopted from E.M. Irandu, 1982 )
b

 10374 .1
  89714963;
1156 . 34
b   8 . 975
The interpretation of this coefficient means that a unit change in distance travelled from
the city center in Kilometers will cause a decrease of (897.5  20)/- = Sh. 17,950/form any figure - given the linear relationship assumption.
The Intercept “ a ” is easily computed using the data on the table and the usual
formula for the calculation of the same :-
a

Y  b X
N

341   8 . 9714963  611
.
 59 . 277228
15
222
This intercept figure revealed that the land values at 0.4 Kilometers from the city center
(which center is defined to be around the area spreading somehow from Ambalal House
to Mwembe Tayari) in 1982 were Shs. ( 20 59.277228/-) per hectare. This is
approximately Sh. 1,185, 544 per hectare. This decline in land values is depicted in
Figure 9 - 6.
Testing for Non-Linear Relationship
The nature of the data in the scatter diagram made Prof. Irandu curious. He
needed to test whether a non-linear relationship existed, whose regression equation (if
calculated) can be used to define a better fitting non-linear least squares curve. His
equation of choice was a second degree polynomial function :
Y  a  bX  c X 2 .
To accomplish this, he used a set of Normal Equations which are the equations of choice
analyze the values of polynomial functions. In order to do this, he tabulated his data
according to Table 9 - 5. This allowed him to fill in the relevant values within the
Normal Equations and to compute various statistics which define the equation of the
polynomial.
These equations can be solved using simultaneous equation methods or matrix algebra.
(For a detailed solution of the equations see King’oriah, 2004, pages 350 - 354.)
The Normal equations for the second degree polynomial curve are :-
Y  Na  b  X  c X
 XY  a X  b  X 
 X Y  a X  b  X
2
2
2
2
3
cX3
 cX4
All we need to solve these equations is to substitute the respective values from the tables.
These values are available at the ends of the columns of Table 9 - 5. Try to see if you
could identify them and use simultaneous methods to solve the normal equations. Once
223
you have identified the values at the bottom of the table you will begin your solutions
with the following figures for each equation :-
Figure 9 - 6 : A straight-line Regression line fitted on data which
was analyzed in Table 9 - 4
Actualizing the normal equations we find that :341. 00  15 a 
611 b

. c
32597
.............................( i )
697. 40  61.1a  325. 97 b  1966 . 721c ........................( ii )
2262.24  325.97 a  1966.721b  12 , 758 . 639 c .............( iii )
224
TABLE 9 - 5; DATA FOR MANIPULATING NORMAL EQUATIONS IN COEFFICIENT
COMPUTATIONS OF LINEAR LEAST SQUARES POLYNOMIAL CURVE
(X)
0.4
0.8
1.3
2.1
2.3
3.1
3.0
4.2
4.7
5.3
6.1
6.2
7.2
6.9
7.5

 61. 1
(Y)
59
55
54
40
41
28
18
16
9
6
5
3
2
3
2
X2
0.16
0.64
1.69
4.41
5.29
9.61
9.00
17.64
22.09.
28.09
37.21
38.44
51.84
47.61
52.25
341
325.97
3481
3025
2916
1600
1681
784
324
256
81
36
25
9
4
9
4
X3
0.064
0.512
2.197
9.261
12.167
29.791
27.000
74.088
103.823
148.877
226.981
238.328
373.428
528.509
391.875
X4
0.256
0.4096
2.8561
19.4481
27.9841
92.3521
81.0000
311.1696
487.9691
789.0481
1384.5841
1477.6336
2687.3856
2266.7121
2730.0625
XY
23.6
44.0
70.2
84.0
94.2
86.8
54.0
67.2
42.3
31.8
30.5
18.6
14.4
20.7
15.0
X 2Y
9.44
35.20
91.26
176.40
216.89
269.08
162.00
282.24
192.81
168.54
186.05
115.32
103.68
142.83
104.50
14,235
1966.721
12,758.639
697.4
2262.24
Y2
Source : Adopted from E.M. Irandu, 1982 . [Any errors in the interpretation of all
this data in this text are mine, and not Prof. Irandu’s.]
If you follow the argument in your textbook you will come to the least squares curve and
equation answer :Y  70.37  165319
.
X  0 . 966 X 2
The curve resulting from this equation minimizes the sum of squared deviations between
the observed values of Y for each value of X. Using this equation; we can predict the
values of Y for each value of X. The values and the process to obtain them are tabulated
in Table 9 - 6.
225
Table 9 - 6: Tabulation for the computation of the Least-Squares Line for
Yc
X
0.4
0.8
1.3
1.4
1.7*
2.1*
2.3
3.1
3.0
4.2
4.7
5.3
6.1
6.2
7.2
6.9
7.5
Intercept
a
70.370931
=
–bX
= – 16.5319 X
– 6.61296
– 13.22552
– 21.491462
– 23.144652
– 28.10422
– 34.716699
– 38.02337
– 49.59570
– 51.248871
– 69.43398
– 77.69993
– 87.61907
– 100.84459
– 102.49778
– 114.07011
– 119.02968
– 123.98925
70.37 – 16 . 5319X + 0 . 966 X 2
2
X
0.16
0.64
1.69
1.96
2.89
4.41
5.29
9.61
9.00
17.64
22.09.
28.09
37.21
38.44
51.84
47.61
52.25
cX2
= 9066X2
0.1450594
0.580278
1.5321905
1.7769783
2.6201364
3.9982013
4.7960283
8.1595944
8.7126336
15.992805
20.027271
25.467001
33.735390
34.850534
43.164254
46.999264
47.370979
Y
59
55
54
/
/
40
41
28
18
16
9
6
5
3
2
3
2
341
Y
59
55
54
/
/
40
41
28
18
16
9
6
5
3
2
3
2
Yc
63.903030
57.725649
50.411660
49.003259*
44.886847*
39.652142
37.143589
28.934825
27.834694
16.929756
12.698272
8.218862
3.261731
2.723685
– 0.534925
– 1.659485
6.247340
Y c **
(Rounded)
63.9
57.7
50.4
49.0*
44.9*
39.7
37.1
28.9
27.8
16.9
12.7
8.2
3.3
2.7
– 0.5
– 1.7
– 6.2
340.9
( Source : Adopted from E.M. Irandu, 1982 )
*
Figures in these rows have been computed using the regression equation to enable close plotting of
regression curve on the scatter diagram.
**
Values of
Yc have been rounded to assist in approximating the position on ordinary graph
paper. This is not necessary when a computer program is used for plotting the regression curve.
When the values of Yc are plotted against those of X which are available in Table 9 - 6,
an almost-perfect fit of the least squares line is obtained as in Figure 9 - 7. The
computations in Table 9 - 6 are not mandatory always because in most cases we use
appropriate computer packages to arrive at the necessary statistics for the estimation of
the least-squares line. This computation has been done to ensure that we understand what
is involved in this work. As an exercise the learner may wish to verify the figures and
plot the graph on figure 9 - 7.
226
Figure 9 - 7 : The non-linear Regression line using a First order polynomial equation on
Land values in Mombasa ( Source : Adopted from E.M. Irandu, 1982 )
Significant Tests Using Non-Linear Regression/Correlation Methods
The coefficient of Non-Linear Determination
Like the other coefficient of determination for the linear regression and
correlation which we have discussed before, this coefficient tells us the amount of
variation which is explained by the changes in the independent variable. The easiest and
most easily understandable method - among the several which are available - is the one
involving the computation of all the estimated values of
Y
using the regression
equation which we have just computed. The difference between these and the observed
values of Y, and then all the sum of squared differences, and the mean-square (variance)
due to the regression is computed. This variation represents the explained variation in the
227
model - the variation explained by the independent variable. The Total Variation is the
crude variance of the dependent variable. In principle, the ratio of the explained variance
to that of the total variance is the coefficient of determination. In this kind of analysis,
the variance is denoted by :-
p
Explained Variation


Total Variation
 A2
2
2
Y X  X
r
In this connection:
2
 P2  the variance of the predicted values of Y.
 A2  the variance of the observed values of Y.
Table 9 - 7 shows the values of the dependent variable ( Y ) which were observed listed
against those which have been computed using the equation for the regression curve
obtained using the regression equation. Within the table rounded figures have been used
to simplify the computation.
Usual methods of calculating the variance of predicted values of Y are employed.
The examples we give below give the shortcuts for computing these variances in the
same manner as we obtained the regression equations above. In that connection, the
figures at the bottom of Table 9 - 7 are used to formulate the ratios of the shortcuts in the
following manner :-

2
P

 Y2 
 P  YP
 

N


2

15087 .11  340.9 2 
 N  1  
  15  1

15




15087 .11  116,212 .81
  15  1
15
 P2  

 P2 
 6741. 7133
  481.55095
14
228
TABLE 9 - 7 : DATA AND COMPUTATIONS USED FOR OBTAINING THE TOTAL
VARIANCE OF Y AND THE EXPLAINED VARIANCE OF Y.
(Y)
Land Values/ha.
( Sh.20,000 )
Yc
Y2
Yc 2
59
55
54
40
41
28
18
16
9
6
5
3
2
63.9
57.7
50.4
39.7
37.1
28.9
27.8
16.9
12.7
8.2
3.3
2.7
– 0.5
3481
3025
2916
1600
1681
784
324
256
81
36
25
9
4
4083.21
3329.29
2540.16
1576.09
1376.41
835.21
772.84
285.61
161.29
67.24
10.89
7.29
0.25
3
2
– 1.7
– 6.2
9
4
2.89
38.44
341
340.90
14,235
15087.11
You will note that there is a negative sign in the variance which we have obtained.
Variances normally do not have negative signs, but the value of this one has been
influenced by (among other things) the fact that we are dealing with a second degree
polynomial function. The sign should not worry us if the actual variance is also of the
same sign. This is what is happening in this case, as hereunder:-

2
A

 Y2 
 A  YA
 

N


2

14 , 235 . 11  341 2 
 N  1  
  15  1

15




229
14 , 235  116,281
 
  15  1 
15

 A2 

6803. 0667
14
 48593333
.
The ratio of the predicted variance to that of the actual variance is the coefficient of nonlinear determination.
 P2
 481.55095

 0 . 9909815 
2
A
 485. 9333
r
2
Y X  X
2
 0 . 991
The square root of this one is the correlation coefficient r = 0 . 996.
It is evident that using the non-linear model we can explain
99.1%
of the
variation in the independent variable ( X ) and its square. Using the linear model we were
able to explain only 95.7% of the variation in
Y
through the variation in
X . We
conclude that in this case the non-linear polynomial curve is the best model representing
how land values declined from Mombasa City center to the periphery at the time of the
field investigation.
EXERCISES
1.
Discuss the limitations of the bi-variate linear regression/correlation model and
compare it to some of the available non-linear models that help solving the
problems associated with these limitations.
2.
Explain what you mean by :(a)
Arithmetic Transformation
(b)
Semi-logarithmic Transformation
(c)
Double-logarithmic Transformation
230
3.
The following data array records the changing pressure of oxygen at 25o C when
filled into various volumes.
Volume in
Liters
Pressure in
Atmospheres
3.25
5.00
5.71
8.27
11.50
14.95
17.49
20.35
22.40
7.34
4.77
4.18
2.88
2.07
1.59
1.36
1.17
1.06
231
CHAPTER TEN
NON-PARAMETRIC STATISTICS
Need for Non-Parametric Statistics
The techniques which we have discussed so far, especially those involving
continuous distributions have stressed underlying assumptions for which the techniques
are valid. These techniques are for the investigation of parameters and for testing
hypotheses concerning them. They are called parametric, and their main concern is with
statistics whose distribution is normal.
A considerable amount of data is such that the underlying distribution is not easily
specified. To handle such data we need distribution-free statistics which are not
dependent on the distribution of the parent population. these are what are called NonParametric Statistics. If we do not specify the nature of the parent population then we
will not deal with parameters like we have hitherto done. this means that non-parametric
statistics compare distributions rather than parameters. they may be sensitive to changes
in location, in spread, or in both. We will not try to maintain a distinction between
distribution-free and non-parametric statistics. Rather, we shall collect them under the
same fold-title and call them non-parametric statistics. This kind of statistics has a
number of advantages.
1.
When it is only possible to make weak assumptions about the nature of the
distributions underlying the data.
2.
When it is difficult to categorize the data because of adequate scale of
measurement.
3.
When it is only possible to rank the data but not to measure such data accurately
because of the weak scale of measurement underlying experimental design and
data collection methods.
The only disadvantage about non-parametric statistics is that they are not good
accurate estimators of parameters, because the distribution of their parent population is
unknown. They are therefore better avoided if there are alternative parametric statistics
which can be used with greater effects.
232
Chi-Square and the Test for Goodness of Fit
We have already dealt with this kind of statistic, in our foregoing discussion. This
is not because it falls within the parametric category, but because of its immense
importance and frequency of applications in simple experiments. It was found prudent
that the learners should be exposed to this statistic early so that they could compare it
with Analysis of Variance, which falls immediately afterwards. No scale of measurement
is required to define the categories, although some scale may exist and may be used. The
probabilities may be determined by theory or may be estimated from the results of the
analysis.
The Chi-Square test operates at low levels of measurements. Categories are purely
nominal and the data is not used to determine rations and probabilities. We tested the null
hypothesis that a normal distribution is involved. The normal parameters had to be
estimated before cell probabilities could be computed. The goodness of fit was something
to do with the fitness of low-measurement data into the expected distribution, however
crude. Remember how we shifted gears to a higher level of measurement - analysis of
variance, and then on to Regression/correlation, where we used the normal distribution as
the underlying assumption. We were then in Parametric Statistics proper, which included
the normal distribution tests, F-tests, and Regression/correlation measures.
The Median Test for one sample
Recall when we said the median is the counterpart of the mean, especially when
the end-observations are expected to have heavy influence on the central tendency of
data. In that connection, the median is computed using the observation of the middle
category in any range of data. Using the Median test, we test the hypothesis that any set
of n-randomly drawn measurements came from some parent population with a specified
median. The scale of measurement must be at least interval because the median cannot be
determined using any other lower scale - like nominal or ordinal scales.
Given any problem where this test must be applied, the differences between all
observations and the expected median ( D
i
) must be determined and ranked. If any
233
difference is zero, it is disregarded. If ties occur among ranks of the items involved the
average rank for each tied item is used. After the ranking exercise, assign the proper sign
for each difference, negative if the observation is below the median and positive if above
the median. You may then arrange them in absolute value increasing order. Add all the
values of the positive-ranked observations and those which are negatively ranked.
Compare the two sums of the rank values. The test statistic is based on the smaller of the
two types of rank totals regardless of whether it is for the positive or for the negative
ranks. This smallest rank is called R, and is the one which is used in Table A-8 ( page
502 of your textbook; King’oriah 2004) to accept or reject the hypothesis.
The hypothesis is rejected if the observed values of R exceed the tabulated values
of R for the specific n-values and alpha level.
Definition of the Test
Assumptions
Data consists of n measurements : X1 , X2 , X3 , ........... , X n.
D1 denotes the difference between X i and the hypothesized median X M .
Then :1.
Each Di must be a continuous random variable.
2.
The distribution of each Di must be symmetric.
3.
All these measurements X i :
i = 1, 2, ........ ,
n must represent a random
sample from the population distribution.
4.
The measurement scale must be at least interval (to enable the computation of the
median, because the median cannot be computed at any lower scale conveniently.
Hypotheses
Ho :
The population median is X M
HA :
The population median is not X M
Test Statistic
Determine the differences Di  X  X , i  1, 2, ..... , n .If any Di  0 drop it
234
from the set and decrease the number n by one. Rank the absolute values
D i . If ties
occur among the ranks, average the ranks of the items involved in the tie and use the
average as a rank of each tied item. Each rank is suffixed with the sign of the difference
corresponding to it. Let R  be the total of the positive ranks, and let R  be the total of
the negative ranks. The test statistic is the smaller of the two categories of ( R  or R  ).
Designate this number with the symbol R .
Decision Rule
Reject the Null hypothesis when R exceeds W1   / 2 or when R is less than W /2 . these
critical values W1   / 2 and W /2 are given in Table A - 8 of your textbook. (King’oriah,
2004, Page 502) Otherwise, accept the hypothesis.
Example
The manager of a large motor firm dealing in the newest models of agricultural
small-load pickups, randomly selects ten of these petrol powered pickups. He
subjects them to petrol consumption mileage tests, and finds that their
consumption in terms of kilometers per liter is as in the Table 10 - 1 below. Test
the null hypothesis using the median test (at 0.10 significance level) that the
median of the population petrol consumption rate is 30 Kilometers per Hour.
Solution
Eliminate D2  0 .
R   3. 5
R   41.5
Since R   3.5 is the smaller than R   41.5 ,
R  is the value of the test
statistic. From Table A - 8 of your textbook (King’oriah, 2004, Page 502),
W / 2  W0.05  9 with n = 9, as we eliminate observation 2.
W1   / 2  W0.95  36 .
Since R is less than 9 we reject the null hypothesis that the population median is
30 kilometers per liter.
235
TABLE 10 - 1: PETROL CONSUMPTION RANKINGS OF 10 PICKUPS
Measurement
Median
Di
Di
Rank
24.6
–5.4
5.4
7
30.0
0.0
0.0
28.2
– 1.8
1.8
2
– 2.6
2.6
3.5
– 3.2
3.2
5
23.9
– 6.1
6.1
8
22.2
– 7.8
7.8
9
26.4
– 3.6
3.6
6
32.6
2.6
2.6
3.5
28.8
– 1.2
1.2
1
27.4
26.8
X M = 30
Notice the treatment of the tied ranks D4  D9  2 . 6 . These two absolute
differences tie for the ranks of 3 and 4; Thus each is assigned the average rank of 3.5.
Also, note that the first assumption of the test does not allow for ties. Since Di must be
continuous, ties should not occur. If the Di s are not continuous we can use the test as an
approximate test by dealing with the ties in the manner described above.
Two observations are important regarding this example. First, if we can assume
that the distribution of kilometers per liter is symmetric, then the median test is the same
as testing the hypothesis that the mean kilometers per liter is 30 kilometers. Secondly,
data have been measured on the ratio scale, and the t-test given in Chapter Three is
appropriate if we assume that the distribution of kilometers per liter is symmetric and
normal; or if the sample average is approximately normal via the Central Limit Theorem.
236
The Mann-Whiney Test for Two Independent Samples
The test is designed to test whether two random samples have been drawn from
the same, or from different populations. It is based in the fact that if the two independent
samples have been drawn from the same population the average ranks of the scores of
their scores should be approximately equal. If the average ranks of one sample is much
bigger or much smaller than that of the second sample, then this shows that both samples
come from different populations.
Assumptions :The two samples are independent. They also consist of continuous random
variables even if the actual observations are discrete. The measurement scale is at least
ordinal. The test is designed to test the null hypothesis that both samples have been drawn
from the same population distribution against the alternative hypothesis that the two
random samples come from different population distributions.
Example
A farmer is breeding one kind of exotic steers using two different methods of
feeding. One group is fed using zero-grazing methods, and the other is released
freely onto his grass paddocks for the same period as the first type. He would like
to see whether resulting body-weights resulting from the two feeding methods are
identical.
After some time, he selects ten steers from each method of feeding, and
records the body- weight of each of the two types of steers. He would like to
know whether the body weight of both types of steers is the same; and if the two
feeding methods are equally suitable for rearing his steers. The following table is
a record of body weights ten of each type of steers. Test the null hypothesis that
the feeding method is same at 5% alpha level.
Solution
1.
Mix up the body-weight observations and order them from the smallest to the
largest. Underline the body-weight from the first group in order to retain the
identity within the mixed group. A mean rank has been assigned to each of the
237
ties in body-weight; and where this is the case, the next rank up is skipped. The
next highest observation is given the next but one rank. What now remains is the
computation of S-value for the observations from the first group. The underlined
scores come from Group one, and those not underlined from Group Two. We
now go ahead and compute the statistic S using Group one.
2.
The statistic S 
10
R
1 1
Xi
.
You may use any of these sums to compute the T statistic T
T  S 
10
R
1 1
Xi
The null hypothesis is that the two samples have been drawn from the same
distribution at 5% alpha level. It is rejected if the value of T obtained using the
above formula is less than the one expected and found from Table A - 9 of your
Textbook ( page 503, King’oriah, 2004). We compute the T-value using the
formula :-
S 
10
R
1 1
Xi
 1.5  3  4  5.5  7  8  9.5  9.5  11  15
 74
The T-value which has been found within the array of the table is 24. The farmer
should have rejected the hypothesis if the calculated T-value was less than :T  nm  W0.025 
1010
 24  76
(Not available from the table but computed from
the formula as we have done here)
238
TABLE 10 - 3
X1
50

50
X2
1.5
55

58
62

62
3
4
5.5
65

67
70

70
7
8
9.5
50

75
11
78
13
80
81 

82 
14
15
16
84
17
88
18
90
19
91
20
12
We find that the calculated value of T = 74 is neither smaller than the
expected value of 24, nor greater than the expected value of 76. It lies between the
two limits. We should have accepted the null hypothesis that the two groups are
not different, they come from the same population. This is an indication that the
two feeding methods are equally good.
239
Wilcoxon Signed Rank Test for Two matched Samples
This statistic is very useful in behavioral sciences. It enables the researcher to
make ordinal judgments between any two pairs. The statistic is designed to test the null
Hypothesis that the two means are equal against the alternative one that they are
different.
Assumptions
1.
that the data consists of n-matched pairs of observations from randomly selected
samples X and Y and D i are differences between any ( X i ,
2.
Y i ).
Each difference is a member of a continuous variable and any observation of D
i
is a result of chance. This means that each D i is a random variable.
3.
Each distribution of all the differences is symmetric
4.
All paired observations represent a random sample of pairs from a bi-variate
distribution. Very many other pairs might as well have been chosen from the same
population.
5.
The scale of measurement is at least ordinal
Procedure
If
D
i
is the difference between any matched pair under the two different
treatments, then D
i
are ranked in the order of their absolute values. After that
assign the sign of the difference show that each D i originated from the negative
or from the positive side; depending on which sample observation was bigger than
the other one at each time.
If the treatments A and B are equivalent and there is no difference
between samples A and B one would expect to find a good mix of the signs
having a minus sign and those having a plus sign. In that case, the two sums
should be about equal if the null hypothesis of no difference between samples is
to be accepted.
240
If the sum of the positive ranks is very much different from the sum of the
negative ranks, we should expect that the first treatment
treatment B
A
differs from
and therefore reject the hypothesis.
On the other hand the same rejection would follow if the sum of the negative
ranks predominated; meaning that B is very much different from A .
After the ranking exercise one would be required to sum all the ranks bearing
each sign. If the R  is the sum of all the positive-ranked values and R  the sum
of all the negative-ranked values, one would examine the smaller value among the
two sums no matter whether it belongs to the negative or the positive ranks.
6.
Reject the null hypothesis if the smallest rank value sum ( R ) is less than Wa 2 .
This is obtained using the formula :-
W1  
2

n n  1
 W
2
2
This is found in the same manner as we found the one for Median, and is
available in Table A - 8 of your Textbook ( King’oriah, 2004 page 502 ). Now let
us do the actual example.
Example
On professor of applied psychology selects 10 students randomly from a group of
senior classmen to determine whether a speed-reading course has improved their
reading speeds. Their readings in terms of words per minute is recorded in the
following manner :-
241
TABLE 10 : 4 RESULTS OF A STANDARD SPEED-READING TEST
Student
Before
After
1
210
235
2
160
385
3
310
660
4
410
390
5
130
110
6
260
260
7
330
420
8
185
190
8
100
140
10
500
610
Assist the professor to test the null hypothesis that the two population-means, one
for before, and the other for after the course, are equal at 5% alpha level.
Solution
Take the differences of the two observations from both groups and rank them as
illustrated in table 10 - 5. Notice that the sixth difference is eliminated, because
there is no difference between the two groups here. This reduces the number of
paired scores to ones used in our test to Di = 9. There are two positive-ranked
values 2.5 + 2.5 = 5. All the other rank values add to 40.0. Notice that the signs
of the ranks are derived from the signs of Di . In that case we have R   5. 0 and
R   40 . 0
This means that in our speed-reading course we shall use the test statistic
involving the smaller of the two values R  5. 0 . If we now refer to the table we
find that :-
  0.05 , and
 2  0 . 025
1   2  0 . 975 ,
242
n  9
TABLE 10 - 5 : RANKED DIFFERENCES OF THE RESULTS OF A
STANDARDIZED SPEED-READING COURSE
Student
Di
Ranks of Di
1
- 25
4
2
- 225
8
3
- 350
9
4
5
 20

 20
2.5
6
0
-
7
- 90
6
8
-5
1
8
- 30
5
10
- 100
7
Examining the table we find that the expected R-value is 6. Then we compute the
R-value for W1  
W0.975 
2

n n  1
 W 2 . The computation goes on as follows :2
n n  1
9 9  1
90
 W0.025 
 6 
 6
2
2
2
 45  6

39 .
Conclusion
The computed value of R   5. 0 is smaller than the expected value at n = 9;
which is 6.0. It also lies outside the interval demarcated by the end-limits :n  n  1 2  45
243
We therefore reject the null hypothesis that the speed reading course is ineffective
and accept the null hypothesis that the professor’s speed reading course is
effective.
The Kruskal-Wallis Test for Several Independent samples.
This test is an extension of the Mann-Whitney test when there are more than two
populations. It is non-parametric analog of the parametric single factor, completely
randomized design analysis of variance method. The data for the test must be in the
following form :-
TABLE 10 - 6 :
ARRANGEMENT OF DATA FOR THE KRUSKAL-WALLIS
TEST FOR SEVERAL INDEPENDENT SAMPLES
Sample 1
Sample 2
..........
Sample k
X 11
X 11
..........
X 11
X 11
X 11
..........
X 11
.
.
.
X 11
.
.
.
X 11
.
.
.
X 11
..........
..........
The total number of observations is given by N 

k
i 1
ni . The test depends on the
ranks, and is similar to the Mann-Whitney test. Assign ranks from 1 to N
to all N
observations when they have been ordered from the smallest to the largest,
disregarding
from which of the k populations the observations came. Let
ranks assigned to the i th sample :-
Ri 
ni
 R X ,
ij
j 1
244
Ri  Ri n i
R i be the sum of the
 
Where R X i j is the rank assigned to X i j . If the Ri s are not the same it supports the
null hypothesis that k samples came from the same population. then it indicates that one
or more populations are likely comprised of larger values than the rest. If they occur, the
ties are treated as in Mann-Whitney test.
Definition of the test :Assumptions :
1.
The k-random samples are mutually independent
2.
All random variables X i j are continuous.
3.
The measurement scale is at least ordinal.
Hypotheses
Ho: The k-population distributions are equal
HA: At least one population tends to yield larger observations than the rest.
The Test Statistic
12
N  N  1

T
k

i  1
2
 1
  n i  N  1
 2
, where :ni
n i = i th sample size, N 
Ri 
ni
 R  X ,
ij

k
i 1
ni
i  1, 2, .......... , k , and :
j 1

R Xij

= The rank assigned to observation X i j .
Table A - 10 at the back of your textbook gives the critical T-values at exact
significance levels 
for
k = 3 and samples up to and including a sample
245
of
k  3,
size five. If
and /or
n  5 for at least one sample the  2
distribution with k - 1 degrees of freedom may be used to find the approximate
critical T-value.
We shall now work out an example for which Table A - 10 is appropriate. The
Chi-Square approximation for the critical T-value appears to be good, even if k and the
ni
are only slightly larger than 3 and 5, respectively. For example, if k = 6 and we
have an alpha level   0.05 , the critical T-value is  02.05, k  1  5  11.1. (See the ChiSquare Table A - 6 on page 499 of your textbook (King’oriah, 2004, page 499 ). We
reject the null hypothesis if the T-value is greater than 11.1. If the populations differ, but
only in location, then the Kruskal-Wallis test is equivalent to testing the equality of the
k-population means for the equivalence of means which is identical to testing for
significance using the Analysis of Variance.
Example
A manager wishes to study the production of output of three machines A, B and
C. The hourly output of each machine is measured for five randomly selected
hours of operation. From the data given in Table 10 - 7, test the null hypothesis at
5% significance level that the three population distributions are equal, using the
Kruskal Wallis Test.
Solution
The data are ordered as in Table
10
-
8 and ranked accordingly. The
computations are carried out as hereunder, to determine the T-value for the
necessary comparison.
R1

2  4  85
.  10 .5  13  38
R2

1  3  5  6 .5  15.5
R3

6 .5  85
.  10 .5  12  14  51.5
And therefore :-
246
T






2
51.5  1 24 15
12  38  1 25 15


14 15 
5
4

0. 057  0. 05  56.5625  39 . 2   5. 233
2

51.5  1 25 15
5
TABLE 10 - 7 : PRODUCTION OUTPUT OF THREE MACHINES
Observation
Machine A
Machine B
Machine c
1
25
18
26
2
22
23
28
3
31
21
24
4
26
*
25
5
20
24
32
From Table A - 10 in the appendix of your textbook ( King’oriah, 2004, page 506 ), the
critical T-value is 5.6429. Notice that this value corresponds to :n1  5 ,
n2  5 , n3  4
But the order of the sample sizes does not affect the critical value. Since T = 6.233 is not
greater than 5.6429, we accept the null hypothesis that the three population distributions
are equal.
247
2



TABLE 10 - 8 : ORDERED DATA FOR PRODUCTION OUTPUT
OF THREE MACHINES
__________________________________
Rank
A
B
C
1 ................................... 18
2 ................... 20
3 ................................... 21
4 ................... 22
5 ................................... 23
6.5 ................................ 24
....... 24
8.5 ............... 25
....................... 25
10.5 .............. 26
....................... 26
12 .................................................... 28
13 .................. 31
14 .................................................... 32
__________________________________
248
Rank Correlation Coefficient: Spearman’s Rho.
This is one the non-parametric equivalent of the Pearson’s Product moment of
Correlation which we discussed in Chapter seven. It is a measure of association in at least
an ordinal scale; so that it is possible to rank the observations under study in two ordered
series. The rankings of the two scores are compared by observing the differences between
the first variable “ X ” and the second variable “ Y ”. These differences are squared
and then added. Finally the results are manipulated in order to obtain a statistic which
equals to 1.0 if they are in perfect concordance; or
– 1.0
if they are in perfect
disagreement. The derivation of this statistic involves applying the ordinary formula for
the product moment of Correlation to ranks of the two variables instead of to raw
observations. Consequently, the resulting statistic denoted as “the Spearman’s Rho” and
designated with the symbol
“ r s ” or the Greek letter “  ” . the interpretation of the
statistic is broadly analogous to that of the Correlation Coefficient, although in this case it
is only the relationships of the underlying distributions that are relevant.
Definition of the test
Assumptions
1.
The n-pairs
X
i
, Yi

represent a random sample drawn from a bi-variate
population distribution of continuous random variables X and Y.
2.
The measurement scale is at least ordinal
Hypotheses


Ho: The X i , Yi are uncorrelated
HA: Either there is a tendency for larger values of X to be paired with larger
values of Y or there is a tendency for the smaller values of X to be paired
with larger values of Y .
The Test Statistic
3.
 
Let R X i
be the ranks of X i
 
and the R Yi
be the ranks of Yi . Ties are
handled as usual - assign to each tied value the average of the ranks that would
249
have been assigned had there been no ties. Spearman’s Rho - the correlation
measure and the test statistic is given by :-


1  6
n

 R X   R X  
i


2
i
n n2  1
i  1
The decision Rule
4.
Reject the null hypothesis if the measure  is greater than 1   
measure  is less than  
Table A - 11
2
, where  
2
and 1   
2
2
or if the
are given in
of your textbook. If there are no ties the Spearman’s rho can be
calculated from the product moment correlation coefficient :-
X
n
r
by replacing X i s



  Y

 X Yi  Y
 n
 X i  X
i  1

and
i
i  1
2
n
i
i  1
 Y
1 2
2



Yi s with their ranks.
Example
“Nyama Choma” (Roast meat) is a big public health problem in this
country. Men and women have been eating it for a long time, and now are
succumbing to debilitating maladies which are associated with the overconsumption of proteins and fats together with alcohol. This causes the formation
of hard calcites within the joints, impairing efficient bone manipulation during
locomotion and causing gout and great hardship within the family among breadwinners. In addition, the existence of high density lipids in the blood causes
cholesterol problems which ultimately cause heart diseases, stroke and associated
illnesses.
250
Men and women in this country are eating Nyama Choma in great
quantities; because beef, goat meat and mutton are relatively cheap and readily
available. In addition the dishes accompanying the roast-meat like Kachumbari,
Mukimo, Ughali, Mutura (and others) are easy to prepare, very sweet and
delicious.
In an attempt to attack this unfortunate habit of Nyama-Choma abuse
through public education, Dr. Asaph Mwalukware intends to study the meateating habits among men and women of this country, so that he can determine
which group of people to target with his information dossiers about the
debilitating habits of continuous eating of Nyama Choma and combining this with
copious consumption of alcohol.
Dr. Mwalukware feels that in any home, husbands have the habit of eating
more Nyama than their wives (because of hanging out with “buddies” within pubs
and night-clubs until late in the evening, while the wives wait at home), but
cannot say for sure this is the case. He would like to test the null hypothesis of no
difference in roast-meat-eating habits, using ten randomly-selected couples in
Maua Municipality, Kenya.
Each couple is asked what their opinion is about habitual nyama chomaeating habits, and to rate their feelings from “ 0 ” (Strongly dislike and condemn
Nyama in the strongest terms possible) to “ 100 ” ( I love the and adore the habit,
Nyama is delicious! “ Poa! ” How can you do without it? ) The ten pairs of
ratings are shown in Table 10 - 9 (Source: hypothetical: Data adapted from
Pfafenberger and Patterson : page 680)
251
TABLE 10 - 9: PREFERENCE FOR EATING NYAMA-CHOMA AMONG TEN
RANDOMLY SELECTED COUPLES IN MAUA MUNICIPALITY.
1
Husband
( Xi )
90
Wife
( Yi )
70
2
100
60
3
75
60
4
80
80
5
6
60
75
75
90
7
8
85
40
100
75
9
10
95
65
85
65
Couple
Solution
The ranks of all X i s
and
Yi s are given in Table 10 -10, and the value of Rho
is computed using the equation immediately below the table.


1  6
n

 R X   R X  
i  1


1 

1 

6 8  4
i


2
i
n n2  1
2
 10  1.5
2
 4 .5  1.5


10 10 2  1
 6161
990

1  0 . 976  0 . 024
252
2
 ....  3  3
2

TABLE 10 - 9: RANKS FOR THE EATING NYAMA-CHOMA EATING
PREFERENCE AMONG TEN RANDOMLY
SELECTED COUPLES IN MAUA MUNICIPALITY.
1
Husband
( Xi )
8
Wife
( Yi )
4
2
10
1.5
3
4.5
1.5
4
6
7
5
6
2
4.5
5.5
9
7
8
7
1
10
5.5
9
10
9
3
8
3
Couple
This value is obviously insignificant; because it is very nearly zero, given the fact
that the Spearman’s Rank Correlation Coefficient ranges from Zero to either 1.0 (for very
strong positive relationship); and from zero to - 1 (for very strong negative relationship).
We might as well conclude that there is no significant relationship between the two group
ratings. However, since we are testing whether the ratings of husbands and those of wives
are not correlated we begin with identifying the mean of no correlation, which is 0.0000.
We then look for a way of building a confidence interval at appropriate significance
levels which may help us determine the rejection areas. That means we have to be able to
compute the number of standard errors from zero within which the parameter rho may be
found for there to be no correlation. We look at our t-tables for ( n = 10 - 2 ) and an
alpha level (two-tail ) of 0.05/2. We find that :-
t 
2, 0  2


t
0.025, 10  2  0.025
253
 2 . 306
For there to be a significant relationship we must be able to use Rank Correlation
Coefficient and compute either a t-value which is greater than 2.036. Otherwise the
relationship is gravitating around 0.0000 correlation, and therefore non-existent or
insignificant.
This means that preference for eating Nyama-Choma is not shared jointly by
husband and wife. Each person has individual opinion about the importance of the meal.
We therefore come to a conclusion that husbands and wives are likely to make
independent judgments regarding the meal of their choice. Dr. Mwalukware’s target
members of Maua population must be selected on other criteria than gender. He may
perhaps choose to investigate Miraa eaters and non-miraa eaters, married and unmarried
men or women, owners and non-owners of businesses, and others.
EXERCISES
1.
A drug company is interested in determining how the chemical treatment for a
specific form of cancer changes body temperature. Ten patients with the disease
are selected at random from a set of patients under experimental control.
Their temperatures are measured before and after taking the treatment. the data,
given in degrees Fahrenheit are listed below :-
Patient
1
2
3
4
5
6
7
8
9
10
Before
98.2
98.4
98.0
99.0
98.6
97.5
98.4
100.0
99.2
98.6
After
99.4
101.2
97.6
99.8
98.0
98.4
98.4
102.3
101.6
98.8
Test the null hypothesis that the two population means are equal at 1%
significance level, using the Wilcoxon Signed Rank Test.
2.
(a)
Using the Mann-Whitney test at 0.05 alpha level and the data below, test
the null hypothesis that the two samples come from the same population.
(b)
Given the crop yield/Fertilizer figures below, use the analysis of variance
method to test the null hypothesis of the equality of the three populations,
254
and compare your results with Kruskal-Wallis test results. What farther
assumptions must you make to use the analysis of variance F-test data.
3.
Observation
1
2
3
4
5
6
7
8
Fertilizer A
80.5
76.4
93.2
90.6
84.7
81.2
78.5
82.0
Fertilizer B
95.4
84.7
88.1
98.2
101.6
88.6
96.4
97.3
A production manager suspects that the level of production among specific class
of workers in the firm is related to their hourly pay. The following data is
collected on eight randomly selected workers.
Worker
Hourly Pay ( X ) ( $ )
Production
1
2
3
4
5
6
7
8
3.10
2.50
4.45
2.75
5.00
5.00
2.90
4.75
50
20
62
30
75
60
42
60
(a)
Calculate the Spearman’s Rho for the data.
(b)
Calculate the Pearson’s product moment of correlation coefficient for this
data. Outline what assumptions are necessary to compute the Pearson’s r .
(c)
Using the Spearman’s rho test the null hypothesis that the values of X and
Y are uncorrelated
255
REFERENCES
Class Texts
King’oriah, George K. (2004), Fundamentals of Applied Statistics, Jomo Kenyatta
Foundation, Nairobi.
Steel, Robert G.D. and James H. Torrie, (1980); Principles and Procedures of Statistics,
A Biometric Approach. McGraw Hill Book Company, New York
Additional References
Irandu, Evaristus Makunyi, “The Road Network in Mombasa Municipal Area: A Spatial
Analysis of its Effects on Land Values, Population Density and Travel Patterns”.
(1982) Unpublished M.A. Thesis, University of Nairobi.
Gibbons, Jean D., (1970) Non-Parametric Statistical inference. McGraw-Hill Book
Company, New York (N.Y.)
Keller, Gerrald, Brian Warrack and Henry Bartel; Statistics for Management and
Economics. (1994) Duxbury Press, Belmont (California).
Levine, David M., David Stephan, (et. al.) (2006), Statistics for Managers. Prentice Hall
of India New Delhi.
Pfaffenberger, Roger C., Statistical Methods for Business and Economics. (1977),
Richard D. Irwin, Homewood, Illinois (U.S.A)
Salvatore, Dominick, and Derrick Reagle; (2002) Statistics and Econometrics, McGraw
Hill Book company, New York (N.Y.)
Siegel, Sidney C., Non-Parametric Statistics for the Behavioral Sciences. (1956)
McGraw Hill Book Company, New York
Snedecor, George W. and William G. Cochran; Statistical Methods (1967) Iowa
University Press, Ames, Iowa. (U.S.A.)
256
257