Download Business Statistics

Document related concepts

Probability wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Business Statistics:
Revealing Facts From Figures
URL for this site is:
http://ubmail.ubalt.edu/~harsham/Business-stat/opre504.htm
Interactive Online Version
Europe Mirror Site
I am always happy to help students who are not enrolled in my
courses with questions and problems. But unfortunately, I don't
have enough time to respond to everyone. Thank you for your
understanding.
Professor Hossein Arsham
MENU







Course Information (for students enrolled in my class)
Introduction
Towards Statistical Thinking For Decision Making Under
Uncertainties
Probability for Statistical Inference
Topics in Business Statistics
Statistical Books List
Interesting and Useful Sites
 Introduction
Towards Statistical Thinking For Decision Making Under
Uncertainties
The Birth of Statistics
What is Business Statistics
Belief, Opinion, and Fact
Kinds of Lies: Lies, Damned Lies and Statistics
Probability for Statistical Inference
Different Schools of Thought in Inferential Statistics
Bayesian, Frequentist, and Classical Methods
Probability, Chance, Likelihood, and Odds
How to Assign Probabilities
General Laws of Probability
Mutually Exclusive versus Independent Events
Entropy Measure
Applications of and Conditions for Using Statistical Tables
Relationships Among Distributions and Unification of Statistical
Tables






Normal Distribution
Binomial Distribution
Poisson Distribution
Exponential Distribution
Uniform Distribution
Student's t-Distributions
Topics in Business Statistics
Greek Letters Commonly Used in Statistics
Type of Data and Levels of Measurement
Sampling Methods
Number of Class Intervals in a Histogram
How to Construct a Box Plot
Outlier Removal
Statistical Summaries





Representative of a Sample: Measures of Central
Tendency
Selecting Among the Mean, Median, and Mode
Quality of a Sample: Measures of Dispersion
Guess a Distribution to Fit Your Data: Skewness &
Kurtosis
A Numerical Example & Discussions
What Is So Important About the Normal Distributions
What Is a Sampling Distribution
What Is Central Limit Theorem
What Is "Degrees of Freedom"
Parameters' Estimation and Quality of a 'Good' Estimate
Procedures for Statistical Decision Making
Statistics with Confidence and Determining Sample Size
Hypothesis Testing: Rejecting a Claim
The Classical Approach to the Test of Hypotheses
The Meaning and Interpretation of P-values (what the data say)
Blending the Classical and the P-value Based Approaches in Test
of Hypotheses
Conditions Under Which Most Statistical Testings Apply



Homogeneous Population (Don't mix apples and
oranges)
Test for Randomness: The Runs Test
Lilliefors Test for Normality
Statistical Tests for Equality of Populations Characteristics




Two-Population Independent Means (T-test)
Two Dependent Means (T-test for paired data sets)
More Than Two Independent Means (ANOVA)
More Than Two Dependent Means (ANOVA)
Power of a Test
Parametric vs. Non-Parametric vs. Distribution-free Tests
Chi-square Tests
Bonferroni Method
Goodness-of-fit Test for Discrete Random Variables
When We Should Pool Variance Estimates
Resampling Techniques: Jackknifing, and Bootstrapping
What is a Linear Least Squares Model
Pearson's and Spearman's Correlations
How to Compare Two Correlations Coefficients
Independence vs. Correlated
Correlation, and Level of Significance
Regression Analysis: Planning, Development, and Maintenance
Predicting Market Response
Warranties: Statistical Planning and Analysis
Factor Analysis
Interesting and Useful Sites (topical category)
Selected Reciprocal Web Sites
Review of Statistical Tools on the Internet
General References
Statistical Societies & Organizations
Statistics References
Statistics Resources
Statistical Data Analysis
Probability Resources
Data and Data Analysis
Computational Probability and Statistics Resources
Questionnaire Design, Surveys Sampling and Analysis
Statistical Software
Learning Statistics
Econometric and Forecasting
Selected Topics
Glossary Collections Sites
Statistical Tables
Introduction
This Web site is a course in statistics appreciation, i.e.
to acquire a feel for the statistical way of thinking. An
introductory course in statistics designed to provide
you with the basic concepts and methods of statistical
analysis for processes and products. Materials in this
Web site are tailored to meet your needs in business
decision making. It promotes think statistically. The
cardinal objective for this Web site is to increase the
extent to which statistical thinking is embedded in
management thinking for decision making under
uncertainties. It is already an accepted fact that
"Statistical thinking will one day be as necessary for
efficient citizenship as the ability to read and write."
So, let's be ahead of our time.
To be competitive, business must design quality into
products and processes. Further, they must facilitate a
process of never-ending improvement at all stages of
manufacturing. A strategy employing statistical
methods, particularly statistically designed
experiments, produces processes that provide high
yield and products that seldom fail. Moreover, it
facilitates development of robust products that are
insensitive to changes in the environment and internal
component variation. Carefully planned statistical
studies remove hindrances to high quality and
productivity at every stage of production, saving time
and money. It is well recognized that quality must be
engineered into products as early as possible in the
design process. One must know how to use carefully
planned, cost-effective experiments to improve,
optimize and make robust products and processes.
Business Statistics is a science assisting you to
make business decisions under
uncertainties based on some numerical and
measurable scales. Decision making process must be
based on data neither on personal opinion nor on
belief.
Know that data are only crude information and not
knowledge by themselves. The sequence from data to
knowledge is: from Data to Information, from
Information to Facts, and finally, from Facts to
Knowledge. Data becomes information when it
becomes relevant to your decision problem.
Information becomes fact when the data can support
it. Fact becomes knowledge when it is used in the
successful completion of decision process. The
following figure illustrates the statistical thinking
process based on data in constructing statistical
models for decision making under uncertainties.
Knowledge is more than knowing something technical.
Knowledge needs wisdom, and wisdom comes with age
and experience. Wisdom is about knowing how
something technical can be best used to meet the
needs of the decision-maker. Wisdom, for example,
creates statistical software that is useful, rather than
technically brilliant.
The Devil is in the Deviations: Variation is an
inevitability in life! Every process has variation. Every
measurement. Every sample! Managers need to
understand variation for two key reasons. First, so that
they can lead others to apply statistical thinking in day
to day activities and secondly, to apply the concept for
the purpose of continuous improvement. This course
will provide you with hands-on experience to promote
the use of statistical thinking and techniques to apply
them to make educated decisions whenever you
encounter variation in business data. You will learn
techniques to intelligently assess and manage the risks
inherent in decision-making. Therefore, remember
that:
Just like weather, if you cannot control
something, you should learn how to measure and
analyze, in order to predict it, effectively.
If you have taken statistics before, and have a feeling
of inability to grasp concepts, it is largely due to your
former non-statistician instructors teaching statistics.
Their deficiencies lead students to develop phobias
for the sweet science of statistics. In this respect,
the following remark is made by Professor Herman
Chernoff, in Statistical Science, Vol. 11, No. 4, 335350, 1996:
"Since everybody in the world thinks he can
teach statistics even though he does not
know any, I shall put myself in the position
of teaching biology even though I do not
know any"
Plugging numbers in the formulas and crunching them
has no value by themselves. You should continue to
put effort into the concepts and concentrate on
interpreting the results.
Even, when you solve a small size problem by hand, I
would like you to use the available computer software
and Web-based computation to do the dirty work for
you.
You must be able to read off the logical secrete in any
formulas not memorizing them. For example, in
computing the variance, consider its formula. Instead
of memorizing, you should start with some whys:
i. Why we square the deviations from the mean.
Because, if we add up all deviations we get always
zero. So to get away from this problem, we square the
deviations. Why not raising to the power of four (three
will not work)? Since squaring does the trick why
should we make life more complicated than it is. Notice
also that squaring also magnifies the deviations,
therefore it works to our advantage to measure the
quality of the data.
ii. Why there is a summation notation in the formula.
To add up the squared deviation of each data point to
compute the total sum of squared deviations.
iii. Why we divide the sum of squares by n-1.
The amount of deviation should reflects also how large
is the sample, so we must bring in the sample size.
That is, in general larger sample size have larger sum
of square deviation from the mean. Okay. Why n-1 and
not n. The reason for it is that when you divide by n-1
the sample's variance provide a much closer to the
population variance than when you divide by n, on
average. You note that for large sample size n (say
over 30) it really does not matter whether you divide
by n or n-1. The results are almost the same and
acceptable. The factor n-1 is the so called the "degrees
of freedom".
This was just an example for you to show as how to
question the formulas rather than memorizing them. If
fact when you try to understand the formulas you do
not need to remember them, they are parts of your
brain connectivity. Clear thinking is always more
important than the ability to do a lot of arithmetic.
When you look at a statistical formula the formula
should talk to you, as when a musician looks at a piece
of musical-notes he/she hears the music.How to
become a statistician who is also a musician?
The objectives for this course are to learn statistical
thinking; to emphasize more data and concepts, less
theory and fewer recipes; and finally to foster active
learning using, e.g., the useful and interesting Websites.
Some Topics in Business Statistics
Greek Letters Commonly Used as Statistical Notations
We use Greek letters in statistics and other scientific
areas to honor the ancient Greek philosophers who
invented science (such as Socrates, the inventor of
dialectic reasoning).
Greek Letters Commonly Used as Statistical Notations
alpha beta ki-sqre delta mu nu pi rho sigma tau theta
2



   



Note: ki-square (ki-sqre, Chi-square), 2, is not the
square of anything, its name imply Chi-square (read,
ki-square). Ki does not exist in statistics. I'm glad that
you're overcoming all the confusions that exist in
learning statistics.
The Birth of Statistics
The original idea of "statistics" was the collection of
information about and for the "State".
The birth of statistics occurred in mid-17th century. A
commoner, named John Graunt, who was a native of
London, begin reviewing a weekly church publication
issued by the local parish clerk that listed the number
of births, christenings, and deaths in each parish.
These so called Bills of Mortality also listed the causes
of death. Graunt who was a shopkeeper organized this
data in the forms we call descriptive statistics, which
was published asNatural and Political Observation Made
upon the Bills of Mortality. Shortly thereafter, he was
elected as a member of Royal Society. Thus, statistics
has to borrow some concepts from sociology, such as
the concept of "Population". It has been argued that
since statistics usually involves the study of human
behavior, it cannot claim the precision of the physical
sciences.
Probability has much longer history. It originated from
the study of games of chance and gambling during the
sixteenth century. Probability theory was a branch of
mathematics studied by Blaise Pascal and Pierre de
Fermat in the seventeenth century. Currently, in
21st centuray, probabilistic modeling are used to
control the flow of traffic through a highway system, a
telephone interchange, or a computer processor; find
the genetic makeup of individuals or populations;
quality control; insurance; investment; and other
sectors of business and industry.
New and ever growing diverse fields of human
activities are using statistics, however, it seems that
this field itself remains obscure to the public. Professor
Bradley Efron expressed this fact nicely:
During the 20th Century statistical thinking and
methodology have become the scientific
framework for literally dozens of fields including
education, agriculture, economics, biology, and
medicine, and with increasing influence recently
on the hard sciences such as astronomy, geology,
and physics. In other words, we have grown from
a small obscure field into a big obscure field.
For the history of probability, and history of statistics,
visit History of Statistics Material. I also recommend
the following books.
Further Readings:
Daston L., Classical Probability in the Enlightenment,
Princeton University Press, 1988.
The book points out that early Enlightenment thinkers
could not face uncertainty. A mechanistic, deterministic
machine, was the Enlightenment view of the world.
Gillies D., Philosophical Theories of Probability,
Routledge, 2000. Covers the classical, logical,
subjective, frequency, and propensity views.
Hacking I., The Emergence of Probability, Cambridge
University Press, London, 1975.
A philosophical study of early ideas about probability,
induction and statistical inference.
Peters W., Counting for Something: Statistical
Principles and Personalities, Springer, New York, 1987.
It teaches the principles of applied economic and social
statistics in a historical context. Featured topics include
public opinion polls, industrial quality control, factor
analysis, Bayesian methods, program evaluation, nonparametric and robust methods, and exploratory data
analysis.
Porter T., The Rise of Statistical Thinking, 1820-1900,
Princeton University Press, 1986.
The author states that statistics has become known in
the twentieth century as the mathematical tool for
analyzing experimental and observational data.
Enshrined by public policy as the only reliable basis for
judgments as the efficacy of medical procedures or the
safety of chemicals, and adopted by business for such
uses as industrial quality control, it is evidently among
the products of science whose influence on public and
private life has been most pervasive. Statistical
analysis has also come to be seen in many scientific
disciplines as indispensable for drawing reliable
conclusions from empirical results.This new field of
mathematics found so extensive a domain of
applications.
Stigler S., The History of Statistics: The Measurement
of Uncertainty Before 1900, U. of Chicago Press, 1990.
It covers the people, ideas, and events underlying the
birth and development of early statistics.
Tankard J., The Statistical Pioneers, Schenkman Books,
New York, 1984.
This work provides the detailed lives and times of
theorists whose work continues to shape much of the
modern statistics.
What is Business Statistics?
In this diverse world of ours, no two things are exactly
the same. A statistician is interested in both
the differences and the similarities, i.e. both
patterns and departures.
The actuarial tables published by insurance companies
reflect their statistical analysis of the average life
expectancy of men and women at any given age. From
these numbers, the insurance companies then calculate
the appropriate premiums for a particular individual to
purchase a given amount of insurance.
Exploratory analysis of data makes use of numerical
and graphical techniques to study patterns and
departures from patterns. The widely used descriptive
statistical techniques are: Frequency Distribution
Histograms; Box & Whisker and Spread plots; Normal
plots; Cochrane (odds ratio) plots; Scattergrams and
Error Bar plots; Ladder, Agreement and Survival plots;
Residual, ROC and diagnostic plots; and Population
pyramid. Graphical modeling is a collection of powerful
and practical techniques for simplifying and describing
inter-relationships between many variables, based on
the remarkable correspondence between the statistical
concept of conditional independence and the graphtheoretic concept of separation.
The controversial "Million Man March on Washington"
was in 1995 demonstrated the size of a rally can have
important political consequences. March organizers
steadfastly maintained the official attendance
estimates offered by the U. S. Park Service (300,000)
were too low. Is it?
In examining distributions of data, you should be able
to detect important characteristics, such as shape,
location, variability, and unusual values. From careful
observations of patterns in data, you can generate
conjectures about relationships among variables. The
notion of how one variable may be associated with
another permeates almost all of statistics, from simple
comparisons of proportions through linear regression.
The difference between association and causation must
accompany this conceptual development.
Data must be collected according to a well-developed
plan if valid information on a conjecture is to be
obtained. The plan must identify important variables
related to the conjecture and specify how they are to
be measured. From the data collection plan, a
statistical model can be formulated from which
inferences can be drawn.
Statistical models are currently used in various fields of
business and science. However, the terminology differs
from field to field. For example, the fitting of models to
data, called calibration, history matching, and data
assimilation, are all synonymous with parameter
estimation.
Know that data are only crude information and not
knowledge by themselves. The sequence from data to
knowledge is: from Data to Information, from
Information to Facts, and finally, from Facts to
Knowledge. Data becomes information when it
becomes relevant to your decision problem.
Information becomes fact when the data can support
it. Fact becomes knowledge when it is used in the
successful completion of decision process. The
following figure illustrates the statistical thinking
process based on data in constructing statistical
models for decision making under uncertainties.
That's why we need Business Statistics. Statistics arose
from the need to place knowledge on a systematic
evidence base. This required a study of the laws of
probability, the development of measures of data
properties and relationships, and so on.
The main objective of Business Statistics is to make
inference (prediction, making decisions) about certain
characteristics of a population based on information
contained in a random sample from the entire
population, as depicted below:
Business Statistics is the science of ‘good' decision
making in the face of uncertainty and is used in many
disciplines such as financial analysis, econometrics,
auditing, production and operations including services
improvement, and marketing research. It provides
knowledge and skills to interpret and use statistical
techniques in a variety of business applications. A
typical Business Statistics course is intended for
business majors, and covers statistical study,
descriptive statistics (collection, description, analysis,
and summary of data), probability, and the binomial
and normal distributions, test of hypotheses and
confidence intervals, linear regression, and correlation.
The following discussion refers to the above chart.
Statistics is a science of making decisions with respect
to the characteristics of a group of persons or objects
on the basis of numerical information obtained from a
randomly selected sample of the group.
At the planning stage of a statistical investigation the
question of sample size (n) is critical. This course
provides a practical introduction to sample size
determination in the context of some commonly used
significance tests.
Population: A population is any entire collection of
people, animals, plants or things from which we may
collect data. It is the entire group we are interested in,
which we wish to describe or draw conclusions about.
In the above figure the life of the light bulbs
manufactured say by GE, is the concerned population.
Statistical Experiment
In order to make any generalization about a
population, a random sample from the entire
population, that is meant to be representative of the
population, is often studied. For each population there
are many possible samples. A sample statistic gives
information about a corresponding population
parameter. For example, the sample mean for a set of
data would give information about the overall
population mean .
It is important that the investigator carefully and
completely defines the population before collecting the
sample, including a description of the members to be
included.
Example: The population for a study of infant health
might be all children born in the U.S.A. in the 1980's.
The sample might be all babies born on 7th May in any
of the years.
An experiment is any process or study which results in
the collection of data, the outcome of which is
unknown. In statistics, the term is usually restricted to
situations in which the researcher has control over
some of the conditions under which the experiment
takes place.
Example: Before introducing a new drug treatment to
reduce high blood pressure, the manufacturer carries
out an experiment to compare the effectiveness of the
new drug with that of one currently prescribed. Newly
diagnosed subjects are recruited from a group of local
general practices. Half of them are chosen at random
to receive the new drug, the remainder receive the
present one. So, the researcher has control over the
type of subject recruited and the way in which they are
allocated to treatment.
Experimental (or Sampling) Unit: A unit is a person,
animal, plant or thing which is actually studied by a
researcher; the basic objects upon which the study or
experiment is carried out. For example, a person; a
monkey; a sample of soil; a pot of seedlings; a
postcode area; a doctor's practice.
Design of experiments is a key tool for increasing
the rate of acquiring new knowledge–knowledge that in
turn can be used to gain competitive advantage,
shorten the product development cycle, and produce
new products and processes which will meet and
exceed your customer's expectations.
The major task of statistics is to study the
characteristics of populations whether these
populations are people, objects, or collections of
information. For two major reasons, it is often
impossible to study an entire population:
The process would be too expensive or time
consuming.
The process would be destructive.
In either case, we would resort to looking at a sample
chosen from the population and trying to infer
information about the entire population by only
examining the smaller sample. Very often the numbers
which interest us most about the population are the
mean  and standard deviation . Any number -- like
the mean or standard deviation -- which is calculated
from an entire population is called a Parameter. If the
very same numbers are derived only from the data of a
sample, then the resulting numbers are
called Statistics. Frequently, parameters are
represented by Greek letters and statistics by Latin
letters (as shown in the above Figure). The step
function in this figure is the Empirical Distribution
Function (EDF), known also as Ogive, which is used
to graph cumulative frequency. An EDF is constructed
by placing a point corresponding to the middle point
of each class at a height equal to the cumulative
frequency of the class. EDF represents the distribution
function Fx.
Parameter
A parameter is a value, usually unknown (and
therefore has to be estimated), used to represent a
certain population characteristic. For example, the
population mean is a parameter that is often used to
indicate the average value of a quantity.
Within a population, a parameter is a fixed value which
does not vary. Each sample drawn from the population
has its own value of any statistic that is used to
estimate this parameter. For example, the mean of the
data in a sample is used to give information about the
overall mean in the population from which that
sample was drawn.
Statistic: A statistic is a quantity that is calculated from
a sample of data. It is used to give information about
unknown values in the corresponding population. For
example, the average of the data in a sample is used
to give information about the overall average in the
population from which that sample was drawn.
It is possible to draw more than one sample from the
same population and the value of a statistic will in
general vary from sample to sample. For example, the
average value in a sample is a statistic. The average
values in more than one sample, drawn from the same
population, will not necessarily be equal.
Statistics are often assigned Roman letters
(e.g.
and s), whereas the equivalent unknown
values in the population (parameters ) are assigned
Greek letters (e.g. µ, ).
The word estimate means to esteem, that is giving a
value to something. A statistical estimate is an
indication of the value of an unknown quantity based
on observed data.
More formally, an estimate is the particular value of an
estimator that is obtained from a particular sample of
data and used to indicate the value of a parameter.
Example: Suppose the manager of a shop wanted to
know , the mean expenditure of customers in her
shop in the last year. She could calculate the average
expenditure of the hundreds (or perhaps thousands) of
customers who bought goods in her shop, that is, the
population mean . Instead she could use an estimate
of this population mean by calculating the mean of a
representative sample of customers. If this value was
found to be $25, then $25 would be her estimate.
There are two broad subdivisions of statistics:
Descriptive statistics and Inferential statistics.
The principal descriptive quantity derived from sample
data is the mean (
), which is the arithmetic
average of the sample data. It serves as the most
reliable single measure of the value of a typical
member of the sample. If the sample contains a few
values that are so large or so small that they have an
exaggerated effect on the value of the mean, the
sample is more accurately represented by the median - the value where half the sample values fall below and
half above.
The quantities most commonly used to measure the
dispersion of the values about their mean are the
variance s2 and its square root , the standard deviation
s. The variance is calculated by determining the mean,
subtracting it from each of the sample values (yielding
the deviation of the samples), and then averaging the
squares of these deviations. The mean and standard
deviation of the sample are used as estimates of the
corresponding characteristics of the entire group from
which the sample was drawn. They do not, in general,
completely describe the distribution (Fx) of values
within either the sample or the parent group; indeed,
different distributions may have the same mean and
standard deviation. They do, however, provide a
complete description of the Normal Distribution, in
which positive and negative deviations from the mean
are equally common and small deviations are much
more common than large ones. For a normally
distributed set of values, a graph showing the
dependence of the frequency of the deviations upon
their magnitudes is a bell-shaped curve. About 68
percent of the values will differ from the mean by less
than the standard deviation, and almost 100 percent
will differ by less than three times the standard
deviation.
Statistical inference refers to extending your
knowledge obtained from a random sample from the
entire population to the whole population. This is
known in mathematics as Inductive Reasoning. That is,
knowledge of the whole from a particular. Its main
application is in hypotheses testing about a given
population.
Inferential statistics is concerned with making
inferences from samples about the populations from
which they have been drawn. In other words, if we find
a difference between two samples, we would like to
know, is this a "real" difference (i.e., is it present in the
population) or just a "chance" difference (i.e. it could
just be the result of random sampling error). That's
what tests of statistical significance are all about.
Statistical inference guides the selection of appropriate
statistical models. Models and data interact in
statistical work. Models are used to draw conclusions
from data, while the data are allowed to criticize, and
even falsify the model through inferential and
diagnostic methods. Inference from data can be
thought of as the process of selecting a reasonable
model, including a statement in probability language of
how confident one can be about the selection.
Inferences made in statistics are of two types. The first
is estimation, which involves the determination, with a
possible error due to sampling, of the unknown value
of a population characteristic, such as the proportion
having a specific attribute or the average value  of
some numerical measurement. To express the
accuracy of the estimates of population characteristics,
one must also compute the "standard errors" of the
estimates; these are margins that determine the
possible errors arising from the fact that the estimates
are based on random samples from the entire
population and not on a complete population census.
The second type of inference is hypothesis testing. It
involves the definitions of a "hypothesis" as one set of
possible population values and an "alternative," a
different set. There are many statistical procedures for
determining, on the basis of a sample, whether the
true population characteristic belongs to the set of
values in the hypothesis or the alternative.
The statistical inference is grounded in probability,
idealized concepts of the group under study, called the
population, and the sample. The statistician may view
the population as a set of balls from which the sample
is selected at random, that is, in such a way that each
ball has the same chance as every other one for
inclusion in the sample.
Notice that to be able to estimate the population
parameters, the sample size n most be greater than
one. For example, with a sample size of one the
variation (s2) within the sample is 0/1 = 0. An estimate
for the variation (2) within the population would be
0/0, which is indeterminate quantity, meaning
impossible. For working with zero correctly, visit the
Web site The Zero Saga & Confusions With Numbers.
Probability is the tool used for anticipating what the
distribution of data should look like under a given
model. Random phenomena are not haphazard: they
display an order that emerges only in the long run and
is described by a distribution. The mathematical
description of variation is central to statistics. The
probability required for statistical inference is not
primarily axiomatic or combinatorial, but is oriented
toward describing data distributions.
Statistics is a tool that enables us to impose order on
the disorganized cacophony of the real world of
modern society. The business world has grown both in
size and competition. Corporations must perform risky
businesses, hence the growth in popularity and need
for business statistics.
Business statistics has grown out of the art of
constructing charts and tables! It is a science of basing
decisions on numerical data in the face of uncertainty.
Business statistics is a scientific approach to decision
making under risk. In practicing business statistics, we
search for an insight, not the solution. Our search is for
the one solution that meets all the business's needs
with the lowest level of risk. Business statistics can
take a normal business situation and with the proper
data gathering, analysis, and re-search for a solution,
turn it into an opportunity.
While business statistics cannot replace the knowledge
and experience of the decision maker, it is a valuable
tool that the manager can employ to assist in the
decision making process in order to reduce the
inherent risk.
Business Statistics provides justifiable answers to the
following concerns for every consumer and producer:
1. What is your or your customer's Expectation of
the product/service you buy or that you sell? That
is, what is a good estimate for ?
2. Given the information about your or your
customer's expectation, what is the Quality of the
product/service you buy or you sell. That is, what
is a good estimate for ?
3. Given the information about your or your
customer's expectation, and the quality of the
product/service you buy or you sell, does the
product/servive Compare with other existing
similar types? That is, comparing several 's.
Visit also the following Web sites:
What is Statistics?
How to Study Statistics
Decision Analysis
Kinds of Lies: Lies, Damned Lies and Statistics
"There are three kinds of lies -- lies, damned lies, and
statistics." quoted in Mark Twain's autobiography.
It is already an accepted fact that "Statistical thinking
will one day be as necessary for efficient citizenship as
the ability to read and write."
The following are some examples as how statistics
could be misused in advertising, which can be
described as the science of arresting human
unintelligence long enough to get money from it. The
founder of Revlon says "In factory we make cosmetics;
in the store we sell hope."
In most cases, the deception of advertising is achieved
by omission:
1. The Incredible Expansion Toyota: "How can it
be that an automobile that's a mere nine inches
longer on the outside give you over two feet more
room on the inside? May be it's the new math!"
Toyota Camry Ad.
Where is the fallacy in this statement? Taking
volume as length! For example : 3x6x4=72 feet
(cubic), 3x6x4.75=85.5 feet (cubic). It could be
even more than 2 feet!
2. Pepsi Cola Ad.: " In recent side-by-side blind
taste tests, nationwide, more people preferred
Pepsi over Coca-Cola".
The questions are: Was it just some of taste
tests, what was the sample size? It does not say
"In all recent…"
3. Correlation? Consortium of Electric Companies
Ad. "96% of streets in the US are under-lit and,
moreover, 88% of crimes take place on under-lit
streets".
4. Dependent or Independent Events? "If the
probability of someone carrying a bomb on a
plane is .001, then the chance of two people
carrying a bomb is .000001. Therefore, I should
start carrying a bomb on every flight."
5. Paperboard Packaging Council's
concerns: "University studies show paper milk
cartons give you more vitamins to the gallon."
How was the design of experiment? The research
was sponsored by the council! Paperboard sales is
declining!
6. All the vitamins or just one? "You'd have to
eat four bowls of Raisin Bran to get the vitamin
nutrition in one bowl of Total".
7. Six Times as Safe: "Last year 35 people
drowned in boating accidents. Only 5 were
wearing life jackets. The rest were not. Always
wear life jacket when boating".
What percentage of boaters wear life jackets?
Conditional probability.
8. A Tax Accountant Firm Ad.: "One of our
officers would accompany you in the case of
Audit".
This sounds like a unique selling proposition, but
it conceals the fact that the statement is a US
Law.
9. Dunkin Donuts Ad.: "Free 3 muffins when you
buy three at the regular 1/2 dozen price."
References and Further Readings:
200% of Nothing, by A. Dewdney, John Wiley, New
York, 1993. Based on his articles about math abuse in
Scientific American, Dewdney lists the many ways we
are manipulated with fancy mathematical footwork and
faulty thinking in print ads, the news, company reports
and product labels. He shows how to detect the full
range of math abuses and defend against them.
The Informed Citizen: Argument and Analysis for
Today, by W. Schindley, Harcourt Brace, 1996. This
rhetoric/reader explores the study and practice of
writing argumentative prose. The focus is on exploring
current issues in communities, from the classroom to
cyberspace. The "interacting in communities" theme
and the high-interest readings engage students, while
helping them develop informed opinions, effective
arguments, and polished writing.
Visit also the Web site: Glossary of Mathematical
Mistakes.
Belief, Opinion, and Fact
The letters in your course number: OPRE 504, stand
for OPerations RE-search. OPRE is a science of
making decisions (based on some numerical and
measurable scales) by searching, and re-searching for
a solution. I refer you to What Is OR/MS? for a deeper
understanding of what OPRE is all about. Decision
making under uncertainty must be based on facts not
on personal opinion nor on belief.
Belief, Opinion, and Fact
Belief
Opinion
Fact
I'm right This is my view
This is a fact
Self says
Says to others You're wrong That is yours I can prove it to you
Sensible decisions are always based on facts. We
should not confuse facts with beliefs or opinions.
Beliefs are defined as someone's own understanding or
needs. In belief, "I am" always right and "you" are
wrong. There is nothing that can be done to convince
the person that what they believe in is wrong. Opinions
are slightly less extreme than beliefs. An opinion
means that a person has certain views that they think
are right. They also know that others are entitled to
their own opinions. People respect other's opinions and
in turn expect the same. Contrary to beliefs and
opinions are facts. Facts are the basis of decisions. A
fact is something that is right, and one can prove it to
be true based on evidence and logical arguments.
Examples for belief, opinion, and facts can be found in
religion, economics, and econometrics, respectively.
With respect to belief, Henri Poincaré said "Doubt
everything or believe everything: these are two equally
convenient strategies. With either we dispense with the
need to think."
How to Assign Probabilities?
Probability is an instrument to measure the likelihood
of the occurrence of an event. There are three major
approaches of assigning probabilities as follows:
1. Classical Approach: Classical probability is
predicated on the condition that the outcomes of
an experiment are equally likely to happen. The
classical probability utilizes the idea that the lack
of knowledge implies that all possibilities are
equally likely. The classical probability is applied
when the events have the same chance of
occurring (called equally likely events), and the
set of events are mutually exclusive and
collectively exhaustive. The classical probability is
defined as:
P(X) = Number of favorable outcomes / Total
number of possible outcomes
2. Relative Frequency Approach: Relative probability
is based on accumulated historical or
experimental data. Frequency-based probability is
defined as:
P(X) = Number of times an event occurred / Total
number of opportunities for the event to occur.
Note that relative probability is based on the
ideas that what has happened in the past will
hold.
3. Subjective Approach: The subjective probability is
based on personal judgment and experience. For
example, medical doctors sometimes assign
subjective probability to the length of life
expectancy for a person who has cancer.
General Laws of Probability
1. General Law of Addition: When two or more
events will happen at the same time, and the
events are not mutually exclusive, then:
P(X or Y) = P(X) + P(Y) - P(X and Y)
2. Special Law of Addition: When two or more
events will happen at the same time, and the
events are mutually exclusive, then:
P(X or Y) = P(X) + P(Y)
3. General Law of Multiplication: When two or
more events will happen at the same time, and
the events are dependent, then the general rule
of multiplicative law is used to find the joint
probability:
P(X and Y) = P(X) . P(Y|X),
where P(X|Y) is a conditional probability.
4. Special Law of Multiplicative: When two or
more events will happen at the same time, and
the events are independent, then the special rule
of multiplication law is used to find the joint
probability:
P(X and Y) = P(X) . P(Y)
5. Conditional Probability Law: A conditional
probability is denoted by P(X|Y). This phrase is
read: the probability that X will occur given
that Y is known to have occurred.
Conditional probabilities are based on knowledge
of one of the variables. The conditional probability
of an event, such as X, occurring given that
another event, such as Y, has occurred is
expressed as:
P(X|Y) = P(X and Y) / P(Y)
Provided P(y) is not zero. Note that when using
the conditional law of probability, you always
divide the joint probability by the probability of
the event after the word given. Thus, to get P(X
given Y), you divide the joint probability of X and
Y by the unconditional probability of Y. In other
words, the above equation is used to find the
conditional probability for any
two dependent events.
A special case of the Bayes Theorem is:
P(X|Y) = P(Y|X). P(X) / P(Y)
If two events, such as X and Y,
are independent then:
P(X|Y) = P(X),
and
P(Y|X) = P(Y)
Mutually Exclusive versus Independent Events
Mutually Exclusive (ME): Event A and B are M.E if
both cannot occur simultaneously. That is, P[A and B]
= 0.
Independency (Ind.): Events A and B are
independent if having the information that B already
occurred does not change the probability that A will
occur. That is P[A given B occurred] = P[A].
If two events are ME they are also Dependent: P(A
given B) = P[A and B]/P[B], and since P[A and B] = 0
(by ME), then P[A given B] = 0. Similarly,
If two events are Dependent then they are also not ME.
If two events are Dependent then they may or may not
be ME.
If two events are not ME, then they may or may not be
Independent.
The following Figure contains all possibilities. The
notations used in this table are as follows: X means
does not imply, question mark ? means it may or may
not imply, while the check mark means it implies.
Bernstein was the first to discovere that (probabilistic)
pairwise independency and mutual independency for a
collection of events A1,..., An are different notions.
Different Schools of Thought in Inferential Statistics
There are few different schools of thoughts in statistics.
They are introduced sequentially in time by necessity.
The Birth Process of a New School of Thought
The process of devising a new school of thought in any
field has always taken a natural path. Birth of new
schools of thought in statistics is not an exception. The
birth process is outlined below:
Given an already established school, one must work
within the defined framework.
A crisis appears, i.e., some inconsistencies in the
framework result from its own laws.
Response behavior:
1. Reluctance to consider the crisis.
2. Try to accommodate and explain the crisis within
the existing framework.
3. Conversion of some well-known scientists attracts
followers in the new school.
The following Figure illustrates the three major schools
of thought; namely, the Classical (attributed
to Laplace), Relative Frequency (attributed toFisher),
and Bayesian (attributed to Savage). The arrows in this
figure represent some of the main criticisms among
Objective, Frequentist, and Subjective schools of
thought. To which school do you belong? Read the
conclusion in this figure.
Bayesian, Frequentist, and Classical Methods
The problem with the Classical Approach is that what
constitutes an outcome is not objectively determined.
One person's simple event is another person's
compound event. One researcher may ask, of a newly
discovered planet, "what is the probability that life
exists on the new planet?" while another may ask
"what is the probability that carbon-based life exists on
it?"
Bruno de Finetti, in the introduction to his two-volume
treatise on Bayesian ideas, clearly states that
"Probabilities Do not Exist". By this he means that
probabilities are not located in coins or dice; they are
not characteristics of things like mass, density, etc.
Some Bayesian approaches consider probability theory
as an extension of deductive logic to handle
uncertainty. It purports to deduce from first principles
the uniquely correct way of representing your beliefs
about the state of things, and updating them in the
light of the evidence. The laws of probability have the
same status as the laws of logic. These Bayesian
approahe is explicitly "subjective" in the sense that it
deals with the plausibility which a rational agent ought
to attach to the propositions she considers, "given her
current state of knowledge and experience." By
contrast, at least some non-Bayesian approaches
consider probabilities as "objective" attributes of things
(or situations) which are really out there (availability of
data).
A Bayesian and a classical statistician analyzing the
same data will generally reach the same conclusion.
However, the Bayesian is better able to quantify the
true uncertainty in his analysis, particularly when
substantial prior information is available. Bayesians are
willing to assign probability distribution function(s) to
the population's parameter(s) while frequentists are
not.
From a scientist's perspective, there are good grounds
to reject Bayesian reasoning. The problem is that
Bayesian reasoning deals not with objective, but
subjective probabilities. The result is that any
reasoning using a Bayesian approach cannot be
publicly checked -- something that makes it, in effect,
worthless to science, like non replicative experiments.
Bayesian perspectives often shed a helpful light on
classical procedures. It is necessary to go into a
Bayesian framework to give confidence intervals the
probabilistic interpretation which practitioners often
want to place on them. This insight is helpful in
drawing attention to the point that another prior
distribution would lead to a different interval.
A Bayesian may cheat by basing the prior distribution
on the data; a Frequentist can base the hypothesis to
be tested on the data. For example, the role of a
protocol in clinical trials is to prevent this from
happening by requiring the hypothesis to be specified
before the data are collected. In the same way, a
Bayesian could be obliged to specify the prior in a
public protocol before beginning a study. In a collective
scientific study, this would be somewhat more complex
than for Frequentist hypotheses because priors must
be personal for coherence to hold.
A suitable quantity that has been proposed to measure
inferential uncertainty; i.e., to handle the a priori
unexpected, is the likelihood function itself.
If you perform a series of identical random
experiments (e.g., coin tosses), the underlying
probability distribution that maximizes the probability
of the outcome you observed is the probability
distribution proportional to the results of the
experiment.
This has the direct interpretation of telling how
(relatively) well each possible explanation (model),
whether obtained from the data or not, predicts the
observed data. If the data happen to be extreme
("atypical") in some way, so that the likelihood points
to a poor set of models, this will soon be picked up in
the next rounds of scientific investigation by the
scientific community. No long run frequency guarantee
nor personal opinions are required.
There is a sense in which the Bayesian approach is
oriented toward making decisions and the frequentist
hypothesis testing approach is oriented toward science.
For example, there may not be enough evidence to
show scientifically that agent X is harmful to human
beings, but one may be justified in deciding to avoid it
in one's diet.
Since the probability (or the distribution of possible
probabilities) is continuous, the probability that the
probability is any specific point estimate is really zero.
This means that in a vacuum of information, we can
make no guess about the probability. Even if we have
information, we can really only guess at a range for the
probability.
Further Readings:
Land F., Operational Subjective Statistical Methods,
Wiley, 1996. Presents a systematic treatment of
subjectivist methods along with a good discussion of
the historical and philosophical backgrounds of the
major approaches to probability and statistics.
Plato, Jan von, Creating Modern Probability, Cambridge
University Press, 1994. This book provides a historical
point of view on subjectivist and objectivist probability
school of thoughts.
Weatherson B., Begging the question and
Bayesians, Studies in History and Philosophy of
Science, 30(4), 687-697, 1999.
Zimmerman H., Fuzzy Set Theory, Kluwer Academic
Publishers, 1991. Fuzzy logic approaches to probability
(based on L.A. Zadeh and his followers) present a
difference between "possibility theory" and probability
theory.
For more information, visit the Web sites Bayesian
Inference for the Physical Sciences, Bayesians vs. Non-
Bayesians, Society for Bayesian Analysis,Probability
Theory As Extended Logic, and Bayesians worldwide.
Type of Data and Levels of Measurement
Information can be collected in statistics using
qualitative or quantitative data.
Qualitative data, such as eye color of a group of
individuals, is not computable by arithmetic relations.
They are labels that advise in which category or class
an individual, object, or process fall. They are called
categorical variables.
Quantitative data sets consist of measures that take
numerical values for which descriptions such as means
and standard deviations are meaningful. They can be
put into an order and further divided into two groups:
discrete data or continuous data. Discrete data are
countable data, for example, the number of defective
items produced during a day's production. Continuous
data, when the parameters (variables) are measurable,
are expressed on a continuous scale. For example,
measuring the height of a person.
The first activity in statistics is to measure or count.
Measurement/counting theory is concerned with the
connection between data and reality. A set of data is a
representation (i.e., a model) of the reality based on a
numerical and mensurable scales. Data are called
"primary type" data if the analyst has been involved in
collecting the data relevant to his/her investigation.
Otherwise, it is called "secondary type" data.
Data come in the forms of Nominal, Ordinal, Interval
and Ratio (remember the French word NOIR for color
black). Data can be either continuous or discrete.
Level of Measurements
_________________________________________
Nominal
Ordinal
Interval/Ratio
Ranking?
Numerical
difference
no
yes
yes
no
no
yes
Zero and unit of measurement are arbitrary in the
Interval scale. While the unit of measurement is
arbitrary in Ratio scale, its zero point is a natural
attribute. The categorical variable is measured on an
ordinal or nominal scale.
Measurement theory is concerned with the connection
between data and reality. Both statistical theory and
measurement theory are necessary to make inferences
about reality.
Since statisticians live for precision, they prefer
Interval/Ratio levels of measurement.
Visit the Web site Measurement theory: Frequently
Asked Questions
Number of Class Intervals in a Histogram
Before we can construct our frequency distribution we
must determine how many classes we should use. This
is purely arbitrary, but too few classes or too many
classes will not provide as clear a picture as can be
obtained with some more nearly optimum number. An
empirical relationship, known as Sturge's Rule, may be
used as a useful guide to determine the optimal
number of classes (k) is given by
k = the smallest integer greater than or equal to 1 +
3.332 Log(n)
where k is the number of classes, Log is in base 10, n
is the total number of the numerical values which
comprise the data set.
Therefore, class width is:
(highest value - lowest value) / (1 + 3.332 Logn)
where n is the total number of items in the data set.
To have an "optimum" you need some measure of
quality -- presumably in this case, the "best" way to
display whatever information is available in the data.
The sample size contributes to this; so the usual
guidelines are to use between 5 and 15 classes, with
more classes possible if you have a larger sample. You
should take into account a preference for tidy class
widths, preferably a multiple of 5 or 10, because this
makes it easier to understand.
Beyond this it becomes a matter of judgement. Try out
a range of class widths, and choose the one that works
best. (This assumes you have a computer and can
generate alternative histograms fairly readily.)
There are often management issues that come into
play as well. For example, if your data is to be
compared to similar data -- such as prior studies, or
from other countries -- you are restricted to the
intervals used therein.
If the histogram is very skewed, then unequal classes
should be considered. Use narrow classes where the
class frequencies are high, wide classes where they are
low.
The following approaches are common:
Let n be the sample size, then the number of class
intervals could be
MIN {
n, 10 Log(n) }.
The Log is the logarithm in base 10. Thus for 200
observations you would use 14 intervals but for 2000
you would use 33.
Alternatively,
1. Find the range (highest value - lowest value).
2. Divide the range by a reasonable interval size: 2,
3, 5, 10 or a multiple of 10.
3. Aim for no fewer than 5 intervals and no more
than 15.
Visit also the Web site Histogram Applet,
and Histogram Generator
How to Construct a BoxPlot
A BoxPlot is a graphical display that has many
characteristics. It includes the presence of possible
outliers. It illustrates the range of data. It shows a
measure of dispersion such as the upper quartile, lower
quartile and interquartile range (IQR) of the data set
as well as the median as a measure of central location
which is useful for comparing sets of data. It also gives
an indication of the symmetry or skewness of the
distribution. The main reason for the popularity of
boxplots is that they offer a lot of information in a
compact way.
Steps to Construct a BoxPlot:
1. Horizontal lines are drawn at the median and at
the upper and lower quartiles. These horizontal
lines are joined by vertical lines to produce the
box.
2. A vertical lines is drawn up from the upper
quartile to the most extreme data point that is
within a distance of 1.5 (IQR) of the upper
quartile. A similar defined vertical line is drawn
from the lower quartile.
3. Each data point beyond the end of the vertical
line is marked with and asterisk (*).
Probability, Chance, Likelihood, and Odds
"Probability" has an exact technical meaning -- well, in
fact it has several, and there is still debate as to which
term ought to be used. However, for most events for
which probability is easily computed e.g. rolling of a die
the probability of getting a four [::], almost all agree
on the actual value (1/6), if not the philosophical
interpretation. A probability is always a number
between 0 [not "quite" the same thing as impossibility:
it is possible that "if" a coin were flipped infinitely
many times, it would never show "tails", but the
probability of an infinite run of heads is 0] and 1
[again, not "quite" the same thing as certainty but
close enough].
The word "chance" or "chances" is often used as an
approximate synonym of "probability", either for
variety or to save syllables. It would be better practice
to leave "chance" for informal use, and say
"probability" if that is what is meant.
In cases where the probability of an observation is
described by a parametric model, the "likelihood" of a
parameter value given the data is defined to be the
probability of the data given the parameter. One
occasionally sees "likely" and "likelihood", however,
these terms are used casually as synonyms for
"probable" and "probability".
"Odds" is a probabilistic concept related to probability.
It is the ratio of the probability (p) of an event to the
probability (1-p) that it does not happen: p/(1-p). It is
often expressed as a ratio, often of whole numbers;
e.g., "odds" of 1 to 5 in the die example above, but for
technical purposes the division may be carried out to
yield a positive real number (here 0.2). The logarithm
of the odds ratio is useful for technical purposes, as it
maps the range of probabilities onto the (extended)
real numbers in a way that preserves symmetry
between the probability that an event occurs and the
probability that it does not occur.
Odds are a ratio of nonevents to events. If the event
rate for a disease is 0.1 (10 per cent), its nonevent
rate is 0.9 and therefore its odds are 9:1. Note that
this is not the same expression as the inverse of event
rate.
Another way to compare probabilities and odds is using
"part-whole thinking" with a binary (dichotomous) split
in a group. A probability is often a ratio of a part to a
whole; e.g., the ratio of the part [those who survived 5
years after being diagnosed with a disease] to the
whole [those who were diagnosed with the disease].
Odds are often a ratio of a part to a part; e.g., the
odds against dying are the ratio of the part that
succeeded [those who survived 5 years after being
diagnosed with a disease] to the part that 'failed'
[those who did not survive 5 years after being
diagnosed with a disease].
Obviously, probability and odds are intimately related:
Odds = p / (1-p). Note that probability is always
between zero and one, whereas odds range from zero
to infinity.
Aside from their value in betting, odds allow one to
specify a small probability (near zero) or a large
probability (near one) using large whole numbers
(1,000 to 1 or a million to one). Odds magnify small
probabilities (or large probabilities) so as to make the
relative differences visible. Consider two probabilities:
0.01 and 0.005. They are both small. An untrained
observer might not realize that one is twice as much as
the other. But if expressed as odds (99 to 1 versus 199
to 1) it may be easier to compare the two situations by
focusing on large whole numbers (199 versus 99)
rather than on small ratios or fractions.
Visit also the Web site Counting and Combinatorial
What Is "Degrees of Freedom"
Recall that in estimating the population's variance, we
used (n-1) rather than n, in the denominator. The
factor (n-1) is called "degrees of freedom."
Estimation of the Population Variance: Variance in a
population is defined as the average of squared
deviations from the population mean. If we draw a
random sample of n cases from a population where the
mean is known, we can estimate the population
variance in an intuitive way. We sum the deviations of
scores from the population mean and divide this sum
by n. This estimate is based on n independent pieces of
information and we have n degrees of freedom. Each of
the n observations, including the last one, is
unconstrained ('free' to vary).
When we do not know the population mean, we can
still estimate the population variance, but now we
compute deviations around the sample mean. This
introduces an important constraint because the sum of
the deviations around the sample mean is known to be
zero. If we know the value for the first (n-1)
deviations, the last one is known. There are only n-1
independent pieces of information in this estimate of
variance.
If you study a system with n parameters xi, i =,1..., n
you can represent it in a n-dimension space. Any point
of this space shall represent a potential state of your
system. If your n parameters could vary
independently, then your system would be fully
described in a n-dimension hyper-volume. Now,
imagine you've got one constraint between the
parameters (an equation relying your n parameters),
then your system would be described by a (n-1)dimension hyper-surface. For example, in three
dimensional space, a linear relationship means a plane
which is 2-dimensional.
In statistics, your n parameters are your n data. To
evaluate variance, you first need to infer the mean
E(X). So when you evaluate the variance, you've got
one constraint on your system (which is the expression
of the mean), and it only remains (n-1) degrees of
freedom to your system.
Therefore, we divide the sum of squared deviations by
n-1 rather than by n when we have sample data. On
average, deviations around the sample mean are
smaller than deviations around the population mean.
This is because our sample mean is always in the
middle of our sample scores; in fact the minimum
possible sum of squared deviations for any sample of
numbers is around the mean for that sample of
numbers. Thus, if we sum the squared deviations from
the sample mean and divide by n, we have an
underestimate of the variance in the population (which
is based on deviations around the population mean).
If we divide the sum of squared deviations by n-1
instead of n, our estimate is a bit larger, and it can be
shown that this adjustment gives us an unbiased
estimate of the population variance. However, for large
n, say, over 30, it does not make too much of
difference if we divide by n, or n-1.
Degrees of Freedom in ANOVA: You will see the key
parse "degrees of freedom" also appearing in the
Analysis of Variance (ANOVA) tables. If I tell you about
4 numbers, but don't say what they are, the average
could be anything. I have 4 degrees of freedom in the
data set. If I tell you 3 of those numbers, and the
average, you can guess the fourth number. The data
set, given the average, has 3 degrees of freedom. If I
tell you the average and the standard deviation of the
numbers, I have given you 2 pieces of information, and
reduced the degrees of freedom to from 4 to 2. You
only need to know 2 of the numbers' values to guess
the other 2.
In an ANOVA table, degree of freedom (df) is the
divisor in SS/df which will result in an unbiased
estimate of the variance of a population.
df = N - k, where N is the sample size, and k is a small
number, equal to the number of "constraints", the
number of "bits of information" already "used up".
Degree of freedom is an additive quantity; total
amounts of it can be "partitioned" into various
components.
For example, suppose we have a sample of size 13 and
calculate its mean, and then the deviations from the
mean, only 12 of the deviations are free to vary: once
one has found 12 of the deviations, the thirteenth one
is determined. Therefore, if one is estimating a
population variance from a sample, k = 1.
In bivariate correlation or regression situations, k = 2:
the calculation of the sample means of each variable
"uses up" two bits of information, leaving N - 2
independent bits of information.
In a one-way analysis of variance (ANOVA) with g
groups, there are three ways of using the data to
estimate the population variance. If all the data are
pooled, the conventional SST/(n-1) would provide an
estimate of the population variance.
If the treatment groups are considered separately, the
sample means can also be considered as estimates of
the population mean, and thus SSb/(g - 1) can be used
as an estimate. The remaining ("within-group", "error")
variance can be estimated from SSw/(n - g). This
example demonstrates the partitioning of df: df total =
n - 1 = df(between) + df(within) = (g - 1) + (n - g).
Therefore, the simple 'working definition' of df is
‘sample size minus the number of estimated
parameters'. A fuller answer would have to explain why
there are situations in which the degrees of freedom is
not an integer. After, we said all this, the best
explanation, is mathematical in that we use df to
obtain an unbiased estimate.
In summary, the concept of degrees of freedom is used
for the following two different purposes:

Parameter(s) of certain distributions, such as F,
and t-distribution are called degrees of freedom.

Therefore, degrees of freedom could be positive
non-integer number(s).
Degrees of freedom is used to obtain unbiased
estimate for the population parameters.
Outlier Removal
Because of the potentially large variance, outliers could
be the outcome of sampling. It's perfectly correct to
have such an observation that legitimately belongs to
the study group by definition. Lognormally distributed
data (such as international exchange rate), for
instance, will frequently exhibit such values.
Therefore, you must be very careful and cautious:
before declaring an observation "an outlier," find out
why and how such observation occurred. It could even
be an error at the data entering stage.
First, construct the BoxPlot of your data. Form the Q1,
Q2, and Q3 points which divide the samples into four
equally sized groups. (Q2 = median) Let IQR = Q3 Q1. Outliers are defined as those points outside the
values Q3+k*IQR and Q1-k*IQR. For most case one
sets k=1.5.
Another alternative is the following algorithm
a) Compute  of whole sample.
b) Define a set of limits off the mean: mean + k,
mean - k sigma (Allow user to enter k. A typical value
for k is 2.)
c) Remove all sample values outside the limits.
Now, iterate N times through the algorithm, each time
replacing the sample set with the reduced samples
after applying step (c).
Usually we need to iterate through this algorithm 4
times.
As mentioned earlier, a common "standard" is any
observation falling beyond 1.5 (interquartile range)
i.e., (1.5 IQRs) ranges above the third quartile or
below the first quartile. The following SPSS program,
helps you in determining the outliers.
$SPSS/OUTPUT=LIER.OUT
TITLE
'DETERMINING IF OUTLIERS EXIST'
DATA LIST
FREE FILE='A' / X1
VAR LABLE
X1 'INPUT DATA'
LIST CASE
CASE=10/VARIABLE=X1/
CONDESCRIPTIVE
X1(ZX1)
LIST CASE
CASE=10/VARIABLES=X1,ZX1/
SORT CASES BY ZX1(A)
LIST CASE
CASE=10/VARIABLES=X1,ZX1/
FINISH
Statistical Summaries
Representative of a Sample: Measures of Central
Tendency Summaries
How do you describe the "average" or "typical" piece of
information in a set of data? Different procedures are
used to summarize the most representative
information depending of the type of question asked
and the nature of the data being summarized.
Measures of location give information about
the location of the central tendency within a group of
numbers. The measures of location presented in this
unit for ungrouped (raw) data are the mean, the
median, and the mode.
Mean: The arithmetic mean (or the average or simple
mean) is computed by summing all numbers in an
array of numbers (xi) and then dividing by the number
of observations (n) in the array.
The mean uses all of the observations, and each
observation affects the mean. Even though the mean is
sensitive to extreme values, i.e., extremely large or
small data can cause the mean to be pulled toward the
extreme data, it is still the most widely used measure
of location. This is due to the fact that the mean has
valuable mathematical properties that make it
convenient for use with inferential statistical analysis.
For example, the sum of the deviations of the numbers
in a set of data from the mean is zero, and the sum of
the squared deviations of the numbers in a set of data
from the mean is the minimum value.
Weighted Mean: In some cases, the data in the
sample or population should not be weighted equally,
rather each value should be weighted according to its
importance.
Median: The median is the middle value in
an ordered array of observations. If there is an even
number of observations in the array, the median is
the average of the two middle numbers. If there is an
odd number of data in the array, the median is
the middle number.
The median is often used to summarize the distribution
of an outcome. If the distribution is skewed, the
median and the IQR may be better than other
measures to indicate where the observed data are
concentrated.
Generally, the median provides a better measure of
location than the mean when there are some extremely
large or small observations; i.e., when the data are
skewed to the right or to the left. For this reason,
median income is used as the measure of location for
the U.S. household income. Note that if the median
is less than the mean, the data set is skewed to the
right. If the median is greater than the mean, the
data set is skewed to the left.
Mode: The mode is the most frequently occurring
value in a set of observations. Why use the mode? The
classic example is the shirt/shoe manufacturer who
wants to decide what sizes to introduce. Data may
have two modes. In this case, we say the data
are bimodal, and sets of observations with more than
two modes are referred to as multimodal. Note that
the mode does not have important mathematical
properties for future use. Also, the mode is not a
helpful measure of location, because there can be more
than one mode or even no mode.
Whenever, more than one mode exist, then the
population from which the sample came is a mixture of
more than one population. Almost all standard
statistical analyses assume that the population is
homogeneous, meaning that its density is unimodal.
Notice that Excel is a very limited statistical software.
For example, it displays only one mode, the first one.
Unfortunately, this is very misleading. However, you
may find out if there are others by inspection only, as
follow: Create a frequency distribution, invoke the
menu sequence: Tools, Data analysis, Frequency and
follow instructions on the screen. You will see the
frequency distribution and then find the mode visually.
Unfortunately, Excel does not draw a Stem and Leaf
diagram. All commercial off-the-shelf software, such as
SAS and SPSS display a Stem and Leaf diagram which
is a frequency distribution of a given data set.
Quartiles & Percentiles: Quantiles are values that
separate a ranked data set into four equal classes.
Whereas percentiles are values that separate a ranked
the data into 100 equal classes. The widely used
quartiles are the 25th, 50th, and 75th percentiles.
Selecting Among the Mean, Median, and Mode
It is a common mistake to specify the wrong index for
central tenancy.
The first consideration is the type of data, if the
variable is categorical, the mode is the single measure
that best describes that data.
The second consideration in selecting the index is to
ask whether the total of all observations is of any
interest. If the answer is yes, then the mean is the
proper index of central tendency.
If the total is of no interest, then depending on
whether the histogram is symmetric or skewed one
must use either mean or median, respectively.
In all cases the histogram must be unimodal.
Suppose that four people want to get together to play
poker. They live on 1st Street, 3rd Street, 7th Street,
and 15th Street. They want to select a house that
involves the minimum amount of driving for all parties
concerned.
Let's suppose that they decide to minimize the
absolute amount of driving. If they met at 1st Street,
the amount of driving would be 0 + 2 + 6 + 14 = 22
blocks. If they met at 3rd Street, the amount of driving
would be 2 + 0+ 4 + 12 = 18 blocks. If they met at
7th Street, 6 + 4 + 0 + 8 = 18 blocks. Finally, at 15th
Street, 14 + 12 + 8 + 0 = 34 blocks.
So the two houses that would minimize the amount of
driving would be 3rd or 7th Street. Actually, if they
wanted a neutral site, any place on 4th, 5th, or
6th Street would also work.
Note that any value between 3 and 7 could be defined
as the median of 1, 3, 7, and 15. So the median is the
value that minimizes the absolute distance to the data
points.
Now the person at 15th is upset at always having to do
more driving. So the group agrees to consider a
different rule. The decide to minimize the square of the
distance driving. This is the least squares principle. By
squaring, we give more weight to a single very long
commute than to a bunch of shorter commutes. With
this rule, the 7th Street house (36 + 16 + 0 + 64 =
116 square blocks) is preferred to the 3rd Street house
(4 + 0 + 16 + 144 = 164 square blocks). If you
consider any location, and not just the houses
themselves, then 9th Street is the location that
minimizes the square of the distances driven.
Find the value of x that minimizes
(1 - x)2 + (3 - x)2 +(7 - x)2 + (15 - x)2.
The value that minimizes the sum of squared values is
6.5 which is also equal to the arithmetic mean of 1, 3,
7, and 15. With calculus, it's easy to show that this
holds in general.
For moderately asymmetrical distributions the mode,
median and mean satisfy the formula: mode=3
(median) - 2(mean).
Consider a small sample of scores with an even
number of cases, for example, 1, 2, 4, 7, 10, and 12.
The median is 5.5, the midpoint of the interval
between the scores of 4 and 7.
As we discussed above, it is true that the median is a
point around which the sum absolute deviations is
minimized. In this example the sum of absolute
deviation is 22. However, it is not a unique point. Any
point in the 4 to 7 region will have the same value of
22 for the sum of the absolute deviations.
Indeed, medians are tricky. The 50%-50% (abovebelow) is not quite correct. For example, 1, 1, 1, 1, 1,
1, 8 has no median. The convention says that, the
median is 1, however about 14% of the data lie strictly
above it, 100% of the data is
generalizes to other percentiles.
the median. This
We will make use of this idea in regression analysis. In
an analogous argument, the regression line is a unique
line which minimizes the sum of the squared deviations
from it. There is no unique line which minimizes the
sum of the absolute deviations from it.
Quality of a Sample: Measures of Dispersion
Average by itself is not a good indication of quality.
You need to know the variance to make any educated
assessment. We are reminded of the dilemma of the
six-foot tall statistician who drowned in a stream that
had an average depth of three feet.
These are statistical procedures for describing the
nature and extent of differences among the information
in the distribution. A measure of variability is generally
reported with a measure of central tendency.
Statistical measures of variation are numerical values
that indicate the variability inherent in a set of data
measurements. Note that a small value for a measure
of dispersion indicates that the data are concentrated
around the mean; therefore, the mean is a good
representative of the data set. On the other hand, a
large measure of dispersion indicates that the mean is
not a good representative of the data set. Also,
measures of dispersion can be used when we want to
compare the distributions of two or more sets of
data. Quality of a data set is measured by its
variability: Larger variability indicates lower
quality. That is why high variation makes the manager
very worried. Your job, as a statistician is to measure
the variation, and if it is too high and unacceptable,
then it is the job of the technical staff, such as
engineers, to fix the process.
The decision situations with flat uncertainty have the
largest risk. For simplicity, consider the case when
there are only two outcomes one with probability of p.
Then, the variation in the outcomes is p(1-p). This
variation is the largest if we set p = 50%. That is,
equal chance for each outcome. In such a case, the
quality of information is at its lowest level.
Remember, quality of information and variation
are inversely related. Larger the variation in the
data, the lower the quality of the data (i.e.,
information). Remember that the Devil is in the
Deviations.
The four most common measures of variation are
the range, variance, standard deviation,
and coefficient of variation.
Range: The range of a set of observations is the
absolute value of the difference between the largest
and smallest values in the set. It measures the size of
the smallest contiguous interval of real numbers that
encompasses all of the data values. It is not useful
when extreme values are present. It is based solely on
two values, not on the entire data set. In addition, it
cannot be defined for open-ended distributions such as
Normal distribution.
Normal distribution does not have a range. A student
said "since the tails of normal density function never
touch the x-axis, at the same time since for an
observation to contribute to forming the such a curve,
very large positive and negative values must exist" Yet
such remote values are always possible, but
increasingly improbable. This encapsulates the
asymptotic behavior of normal density very well.
Variance: An important measure of variability is
variance. Variance is the average of the squared
deviations of each observation in the set from the
arithmetic mean of all of observations.
Variance =  (xi -
)
2
/ (n - 1), n
2.
The variance is a measure of spread or dispersion
among values in a data set. Therefore, the greater the
variance, the lower the quality.
The variance is not expressed in the same units as
the observations. In other words, the variance is
hard to understand because the deviations from the
mean are squared, making it too large for logical
explanation. This problem can be solved by working
with the square root of the variance, which is called
the standard deviation.
Standard Deviation: Both variance and standard
deviation provide the same information; one can
always be obtained from the other. In other words,
the process of computing a standard deviation always
involves computing a variance. Since standard
deviation is the square root of the variance, it is always
expressed in the same units as the raw data:
For large data set (more than 30, say), approximately
68% of the data will fall within one standard deviation
of the mean, 95% fall within two standard deviations,
and 97.7% (or almost 100% ) fall within three
standard deviations (S) from the mean.
Standard Error: Standard error is a statistic indicating
the accuracy of an estimate. That is, it tells us to
assess how different the estimate ( such as
) is from
the population parameter (such as ). It is therefore,
the standard deviation of a sampling distribution of the
estimator such as
's.
Coefficient of Variation: Coefficient of Variation (CV)
is the relative deviation with respect to size
:
CV is independent of the unit of measurement. In
estimation of a parameter when CV is less than say
10%, the estimate is assumed acceptable. The inverse
of CV; namely 1/CV is called the Signal-to-noise Ratio.
The coefficient of variation is used to represent the
relationship of the standard deviation to the mean,
telling how much representative the mean is of the
numbers from which it came. It expresses the standard
deviation as a percentage of the mean; i.e., it reflects
the variation in a distribution relative to the mean.
Z Score: how many standard deviations a given point
(i.e. observations) is above or below the mean. In
other words, a Z score represents the number of
standard deviations an observation (x) is above or
below the mean. The larger the Z value, the further
away a value will be from the mean. Note that values
beyond three standard deviations are very unlikely.
Note that if a Z score is negative, the observation (x) is
below the mean. If the Z score is positive, the
observation (x) is above the mean. The Z score is
found as:
Z = (x -
) / standard deviation of X
The Z score is a measure of the number of standard
deviations that an observation is above or below the
mean. Since the standard deviation is never negative,
a positive Z score indicates that the observation is
above the mean, a negative Z score indicate that the
observation is below the mean. Note that Z is a
dimensionless value, and is therefore a useful measure
by which to compare data values from two different
populations even those measured by different units.
Z-Transformation: Applying the formula z = (X - )
/ will always produce a transformed variable with a
mean of zero and a standard deviation of one.
However, the shape of the distribution will not be
affected by the transformation. If X is not normal then
the transformed distribution will not be normal either.
In the following SPSS command variable x is
transformed to zx.
descriptives variables=x(zx)
You have heard the terms z value, z test, z
transformation, and z score. Do all of these terms
mean the same thing? Certainly not:
The z value is refereed to the critical value (a point on
the horizontal axes) of the Normal (o, 1) density
function, for a given area to the left of that z-value.
The z test is refereed to the procedures for testing the
equality of mean (s) of one (or two) population(s).
z score of a given observation x in a sample of size n,
is simply (x - average of the sample) divided by the
standard deviation of the sample.
The z transformation of a set of observations of size n
is simply (each observation - average of all
observation) divided by the standard deviation among
all observations. The aim is to produce a transformed
data set with a mean of zero and a standard deviation
of one. This makes the transformed set dimensionless
and manageable with respect to its magnitudes. It also
used in comparing several data sets measured using
different scales of measurements.
Pearson coined the term "standard deviation"
sometime near 1900. The idea of using squared
deviations goes back to Laplace in the early 1800's.
Finally, notice again, that the trandforming raw scores
to z scores does NOT normalize the data.
Guess a Distribution to Fit Your Data: Skewness
& Kurtosis
A pair of statistical measures skewness and kurtosis is
a measuring tool which is used in selecting a
distribution(s) to fit your data. To make an inference
with respect to the population distribution, you may
first compute skewness and kurtosis from your random
sample from the entire population. Then, locating a
point with these coordinates on some widely
used Skewness-Kurtosis Charts (available from your
instructor upon request), guess a couple of possible
distributions to fit your data. Finally, you might use the
goodness-of-fit test to rigorously come up with the
best candidate fitting your data. Removing outliers
improves both skewness and kurtosis.
Skewness: Skewness is a measure of the degree to
which the sample population deviates from symmetry
with the mean at the center.
Skewness =  (xi -
)
3
/ [ (n - 1) S
3
], n
2.
Skewness will take on a value of zero when the
distribution is a symmetrical curve. A positive value
indicates the observations are clustered more to the
left of the mean with most of the extreme values to the
right of the mean. A negative skewness indicates
clustering to the right. In this case we have:
Mean
Median
Mode. The reverse order holds
for the observations with positive skewness.
Kurtosis: Kurtosis is a measure of the relative
peakedness of the curve defined by the distribution of
the observations.
Kurtosis =  (xi -
)
4
/ [ (n - 1) S
4
], n
2.
Standard normal distribution has kurtosis of +3. A
kurtosis larger than 3 indicates the distribution is more
peaked than the standard normal distribution.
Coefficient of Excess Kurtosis = Kurtosis - 3.
A less than 3 kurtosis value means that the distribution
is flatter than the standard normal distribution.
Skewness and kurtosis can be used to check for
normality via the the Jarque-Bera test. For large n,
under the normality condition the quantity
n {Skewness2 / 6 +((Kurtosis - 3)2) / 24)}
follows a chi-square distribution with d.f. = 2.
Further Reading:
Tabachnick B., and L. Fidell, Using Multivariate
Statistics, HarperCollins, 1996. Has a good discussion
on applications and significance tests for skewness and
kurtosis.
Numerical Example & Discussions
A Numerical Example: Given the following, small (n
= 4) data set, compute the descriptive statistics: x1 =
1, x2 = 2, x3 = 3, and x4 = 6.
i
xi
1 1
2 2
3 3
4 6
Sum 12
(xi-
) 2 (xi -
) (xi -2
-1
0
3
0
4
1
0
9
14
) 3 (xi -8
-1
0
27
18
)4
16
1
0
81
98
The mean
is 12 / 4 = 3, the variance is s2 = 14 / 3
= 4.67, the standard deviation is s = (14/3) 0.5 = 2.16,
the skewness is 18 / [3 (2.16) 3 ] = 0.5952, and
finally, the kurtosis is 18 / [3 (2.16) 4 ] = 1.5.
A Short Discussion
Deviations about the mean of a distribution is the
basis for most of the statistical tests we will learn.
Since we are measuring how much a set of scores is
dispersed about the mean , we are
measuring variability. We can calculate the deviations
about the mean and express it as variance 2or
standard deviation . It is very important to have a
firm grasp of this concept because it will be a
central concept throughout your statistics
course.
Both variance 2 and standard deviation  measure
variability within a distribution. Standard deviation  is
a number that indicates how much on average each of
the values in the distribution deviates from the
mean (or center) of the distribution. Keep in mind
that variance 2 measures the same thing as standard
deviation  (dispersion of scores in a distribution).
Variance 2, however, is the average squared
deviations about the mean. Thus, variance 2 is the
square of the standard deviation .
Expected value and variance of
respectively.
are  and 2/n,
Expected value and variance of S2 are 2 and 24 / (n1), respectively.
and S2 are the best estimators for and 2. They
are Unbiased (you may update your estimate);
Efficient (they have the smallest variation among other
estimators); Consistent (increasing sample size
provides a better estimate); and Sufficient (you do not
need to have the whole data set; what you need
are xi and xi2 for estimations). Note also that the
above variance for of S2 is justified only in the case
where the population distribution tends to be normal,
otherwise one may use bootstrapping techniques.
In general, it is believed that the pattern of mode,
median, and mean go from lower to higher in positive
skewed data sets, and just the opposite pattern in
negative skewed data sets. However, for example, in
the following 23 numbers, mean=2.87, median=3, but
the data is positively skewed:
42764353131243121152231
and, the following 10 numbers have
mean=median=mode=4, but the data set is left
skewed:
1234445566
Note also that, most commercial software donot
correctly compute skewness and kurtosis. There is no
easy way to determine confidence intervals about a
computed skewness or kurtosis value from a small to
medium sample. The literature gives tables based on
asymptotic methods for sample sets larger than 100
for normal distributions only.
You may have noticed that using the above numerical
example on some computer packages such as SPSS,
the skewness and the kurtosis are different from what
we have computed. For example, the SPSS output for
the skewness is 1.190. However, for large a sample
size n, the results are identical.
Reference and Further Readings:
David H., Early Sample Measures of
Variability, Statistical Science, 13, 1998, 368-377. This
article provides a good historical accounts of statistical
measures.
Groeneveld R., A class of quantile measures for
kurtosis, The American Statistician, 325, Nov. 1998.
Hosking J., M, Moments or L moments? An example
comparing two measures of distributional shape, The
American Statistician, Vo.l 46, 186-189, 1992.
Parameters' Estimation and Quality of a 'Good'
Estimate
Estimation is the process by which sample data are
used to indicate the value of an unknown quantity in a
population.
Results of estimation can be expressed as a single
value, known as a point estimate; or a range of values,
known as a confidence interval.
Whenever we use point estimation, we calculate the
margin of error associated with that point estimation.
For example, for the estimation of the population
mean , the margin of errors calculated as follows:
±1.96 SE(
).
In newspapers and television reports on public opinion
pools, the margin of error is the margin of "sampling
error". There are many nonsampling errors that can
and do affect the accuracy of polls. Here we talk about
sampling error. The fact that subgroups have larger
sampling error than one must include the following
statement: "Other sources of error include but are not
limited to, individuals refusing to participate in the
interview and inability to connect with the selected
number. Every feasible effort is made to obtain a
response and reduce the error, but the reader (or the
viewer) should be aware that some error is inherent in
all research."
To estimate means to esteem (to give value to). An
estimator is any quantity calculated from the sample
data which is used to give information about an
unknown quantity in the population. For example, the
sample mean is an estimator of the population mean .
Estimators of population parameters are sometimes
distinguished from the true value by using the symbol
'hat'. For example, true population standard
deviation  is estimated (from a sample) population
standard deviation.
Example: The usual estimator of the population mean
is
= xi / n, where n is the size of the sample and
x1, x2, x3,.......,xn are the values of the sample. If the
value of the estimator in a particular sample is found to
be 5, then 5 is the estimate of the population mean µ.
A "Good" estimator is the one which provides an
estimate with the following qualities:
Unbiasedness: An estimate is said to be an unbiased
estimate of a given parameter when the expected
value that of estimator can be shown to be equal to the
parameter being estimated. For example, the mean of
a sample is an unbiased estimate of the mean of the
population from which the sample was drawn.
Unbiasedness is a good quality for an estimate since in
such a case, using weighted average of several
estimates provides a better estimate than each one of
those estimates. Therefore, unbiasedness allows us to
upgrade our estimates. For example is your estimate of
the population mean µ are say, 10, and 11.2 from two
independent samples of equal sizes 20, and 30
respectively, then the estimate of the population mean
µ based on both samples is [20 (10) + 30 (11.2)] (20
+ 30) = 10.75.
Consistency: The standard deviation of an estimate is
called the standard error of that estimate. The larger
the standard error means more error in your estimate.
It is a commonly used index of the error entailed in
estimating a population parameter based on the
information in a random sample of size n from the
entire population.
An estimator is said to be "consistent" if increasing the
sample size produces an estimate with smaller
standard error. Therefore, your estimate is "consistent"
with the sample size. That is, spending more money
(to obtain a larger sample) produces a better estimate.
Efficiency: An efficient estimate is the one which has
the smallest standard error among all other estimators
of equal size.
Sufficiency: A sufficient estimator based on a statistic
contains all the information which is present in the raw
data. For example, the sum of your data is sufficient to
estimate the mean of the population. You don't have to
know the data set itself. This saves a lot of money if
the data has to be transmitted by telecommunication
network. Simply, send out the total, and the sample.
A sufficient statistic t for a parameter is a function
of the sample data x1,...,xn, which contains all
information in the sample about the parameter. More
formally, sufficiency is defined in terms of the
likelihood function for . For a sufficient statistic t, the
Likelihood L(x1,...,xn|) can be written as
g (t | )*k(x1,...,xn)
Since the second term does not depend on , t is said
to be a sufficient statistic for .
Another way of stating this for the usual problems is
that one could construct a random process starting
from the sufficient statistic, which will have exactly the
same distribution as the full sample for all states of
nature.
To illustrate, let the observations be independent
Bernoulli trials with the same probability of success.
Suppose that there are n trials, and that person A
observes which observations are successes, and person
B only finds out the number of successes. Then if B
places these successes at random points without
replication, the probability that B will now get any
given set of successes is exactly the same as the
probability that A will see that set, no matter what the
true probability of success happens to be.
The widely used estimator of the population mean µ
is
= xi/n, where n is the size of the sample and
x1, x2, x3,......., xn are the values of the sample that
have all the above properties. Therefore, it is a "good"
estimator.
If you want an estimate of central tendency as a
parameter for a test or for comparison, then small
sample sizes are unlikely to yield any stable estimate.
The mean is sensible in a symmetrical distribution, as a
measure of central tendency, but, e.g., with ten cases
you will not be able to judge whether you have a
symmetrical distribution. However, the mean estimate
is useful if you are trying to estimate the a population
sum, or some other function of the expected value of
the distribution. Would the median be a better
measure? In some distributions (e.g., shirt size) the
mode may be better. Box-plot will indicate outliers in
the data set. If there are outliers, median is better
than mean as a measure of the central tendency.
If you have a yes/no question you probably want to
calculate a proportion p of yeses (or noes). Under
simple random sampling, the variance of p is p(1-p)/n,
ignoring the finite population correction. Now a 95%
confidence interval is 1.96 [p(1-p)/n]2. A conservative
interval can be calculated assuming p(1-p) takes its
maximum value, which it does when p = 1/2. Replace
1.96 by 2, put p = 1/2 and you have a 95% confidence
interval of 1/n1/2. This approximation works well as
long as p is not too close to 0 or 1. This useful
approximation allows you to calculate approximate
95% confidence intervals.
Conditions Under Which Most Statistical Testing
Apply
Don't just learn formulas and number-crunching: learn
about the conditions under which statistical testing
procedures apply. The following conditions are common
to almost all tests:
1. homogeneous population (see if there are more
than one mode)
2. sample must be random (to test this, perform the
Runs Test).
3. In addition to requirement No. 1, each population
has a normal distribution (perform Test for
Normality)
4. homogeneity of variances. Variation in each
population is almost the same as in the others.
For 2 populations use the F-test. For 3 or more
populations, there is a practical rule known as the
"Rule of 2". In this rule one divides the highest
variance of a sample to the lowest variance of the
other sample. Given that the sample sizes are
almost the same, and the value of this division is
less than 2, then, the variations of the
populations are almost the same.
Notice: This important condition in analysis of
variance (ANOVA and the t-test for mean
differences) is commonly tested by the Levene or
its modified test known as the Brown-Forsythe
test. Unfortunately, both tests rely on the
homogeneity of variances assumption!
These assumptions are crucial, not for the
method/computation, but for the testing using the
resultant statistic. Otherwise, we can do, for example,
ANOVA and regression without any assumptions, and
the numbers come out the same -- simple
computations give us least-square fits, partitions of
variance, regression coefficients, and so on. Only when
testing certain assumptions about independence, and
homogeneous distribution of error terms known as
residuals.
Homogeneous Population
Homogeneous Population: A homogeneous population
is a statistical population which has a unique mode.
To determine if a given population is homogeneous or
not, construct the histogram of a random sample from
the entire population. If there is more than one mode,
then you have a mixture of population. Know that to
perform any statistical testing, you need to make sure
you are dealing with homogeneous population.
Test for Randomness: The Runs Test
A "run" is a maximal subsequence of like elements.
Consider the following sequence (D for Defective, N for
non-defective items) out of a production line:
DDDNNDNDNDDD. Number of runs is R = 7, with n1 =
8, and n2 = 4 which are number of D's and N's
(whichever).
A sequence is random if it is neither "over-mixed" nor
"under-mixed". An example of over-mixed sequence is
DDDNDNDNDNDD, with R = 9 while under-mixed looks
like DDDDDDDDNNNN with R = 2. There the above
sequence seems to be random.
The Runs Tests, which is also known as Wald-Wolfowitz
Test, is designed to test the randomness of a given
sample at 100(1- )% confidence level. To conduct a
runs test on a sample, perform the following steps:
Step 1: compute the mean of the sample.
Step 2: going through the sample sequence, replace
any observation with +, or - depending on wether it is
above or below the mean. Discard any ties.
Step 3: compute R, n1, and n2.
Step 4: compute the expected mean and variance of
R, as follows:
 =1 + 2n1n2/(n 1 + n2).
2 = 2n1n2(2n 1n2-n1- n2)/[[n1 + n2)2 (n1 + n2 -1)].
Step 5: Compute z = (R-)/ .
Step 6: Conclusion:
If z  Z, then there might be cyclic, seasonality
behavior (under-mixing).
If z  - Z, then there might be a trend.
If z  - Z, or z  Z, reject the randomness.
Note: This test is valid for cases for which both n1 and
n2 are large, say greater than 10. For small sample
sizes special tables must be used.
The SPSS command for the runs test:
NPAR TEST RUNS(MEAN) X (the name of the variable).
For example, suppose for a given sample of size 50, we
have R = 24, n1 = 14 and n2 = 36. Test for
randomness at  = 0.05.
The Plugging these into the above formulas we
have  = 16.95,  = 2.473, and z = -2.0 From Z-table,
we have Z = 1.645. Therefore, there might be a trend,
which means that the sample is not random.
Visit the Web site Test for Randomness
Lilliefors Test for Normality
The following SPSS program computes the KolmogrovSmirinov-Lilliefors statistic called LS. It can easily be
converted and run in any other platforms.
$SPSS/OUTPUT=L.OUT
TITLE
'K-S LILLIEFORS TEST FOR NORMALITY'
DATA LIST
FREE FILE='L.DAT'/X
VAR LABELS
X 'SAMPLE VALUES'
LIST CASE
CASE=20/VARIABLES=ALL
CONDESCRIPTIVE X(ZX)
LIST CASE CASE=20/VARIABLES=X ZX/
SORT CASES BY ZX(A)
RANK VARIABLES=ZX/RFRACTION INTO CRANK/TIES=HIGH
COMPUTE Y=CDFNORM(ZX)
COMPUTE SPROB=CRANK
COMPUTE DA=Y-SPROB
COMPUTE DB=Y-LAG(SPROB,1)
COMPUTE DAABS=ABS(DA)
COMPUTE DBABS=ABS(DB)
COMPUTE LS=MAX(DAABS,DBABS)
LIST VARIABLES=X,ZX,Y,SPROB,DA,DB
LIST VARIABLES=LS
SORT CASES BY LS(D)
LIST CASES CASE=1/VARIABLES=LS
FINISH
The output is the statistic LS, which should be
compared with the following critical values after setting
a significance level  (as a function of the sample size
n).
Critical Values for the Lilliefors Test
Significance Level
Critical Value
0.775 / ( n ½ - 0.01 + 0.85 n -½ )
 = 0.15
0.819 / ( n ½ - 0.01 + 0.85 n -½ )
 = 0.10
0.895 / ( n ½ - 0.01 + 0.85 n -½ )
 = 0.05
0.995 / ( n ½ - 0.01 + 0.85 n -½ )
 = 0.025
A normal probability plot will also help you distinguish
between a systematic departure from normality when
it shows up as a curve. In SAS do a PROC UNIVARIATE
NORMAL PLOT. Bera-Jarque test, which is widely used
by econometricians, might also be applicable.
Further Reading
Statistical inference by normal probability paper, by T.
Takahashi, Computers & Industrial Engineering, Vol.
37, Iss. 1 - 2, pp 121-124, 1999.
Bonferroni Method
One may combine several t-tests by using the
Bonferroni method. It works reasonably well when
there are only a few tests, but as the number of
comparisons increases above 8, the value of 't'
required to conclude that a difference exists becomes
much larger than it really needs to be and the method
becomes over conservative.
One way to make the Bonferroni t test less
conservative is to use the estimate of the population
variance computed from within the groups in the
analysis of variance.
t=(
1-
2 )/ ( 2 / n1 + 2 / n2 )1/2
where VW is the population variance computed from
within the groups.
Chi-square Tests
The Chi-square is a distribution, as is the Normal and
others. The Normal (or Gaussian or bell-shaped) often
occurs naturally in real life. When we know the mean
and variance of a Normal then it allows us to find
probabilities. So if, for example, you knew some things
about the average height of women in the nation
(including the fact that heights are distributed
normally, you could measure all the women in your
extended family, find the average height, and
determine a probability associated with your result; if
the probability of getting your result, given your
knowledge of women nationwide, is high, then your
family's female height cannot be said to be different
from average. If that probability is low, then your
result is rare (given the knowledge about women
nationwide), and you can say your family is different.
You've just completed a test of the hypothesis that the
average height of women in your family is different
from the overall average.
There are other (similar) tests where finding that
probability means NOT using Normal distribution. One
of these is a Chi-square test. For instance, if you tested
the variance of your family's female heights (which is
analogous to your previous test of the mean), you
can't assume that the normal distribution is
appropriate to use. This should make sense, since the
Normal is bell-shaped, and variances have a lower limit
of zero. So, while a variance could be any huge
number, it gets bounded on the low side by zero. If
you were to test whether the variance of heights in
your family is different from the nation, a Chi-square
test happens to be appropriate, given our original
above conditions. The formula and procedure is in your
textbook.
Crosstables: The variance is not the only thing for
which you use a Chi-square test for. Often times it is
used to test relationship among two categorical type
data, or independence of two variables, such as
cigarette smoking and drug use. If you were to survey
1000 people on whether or not they smoke and
whether or not they use drugs, you will get one of four
answers: (no,no) (no,yes) (yes,no) (yes,yes).
By compiling the number of people in each category,
you can ultimately test whether drug usage is
independent of cigarette smoking by using the Chi-
square distribution (this is approximate, but works
well). Again, the methodology for this is in your
textbook. The degrees of freedom is equal to (number
of rows-1)(number of columns -1). That is, these many
figures needed to fill in the entire body of the
crosstable, the rest will be determined by using the
rows and columns sum figures.
Don't forget the conditions for the validity of Chisquare test and related expected values greater than 5
in 80% or more of the cells. Otherwise, one could use
an "exact" test, using either a permutation or
resampling approach. Both SPSS and SAS are capable
of doing this test.
For a 2-by-2 table, you should use the Yates correction
to the chi-square. Chi-square distribution is used as an
approximation of the binomial distribution. By applying
a continuity correction we get a better approximation
to the binomial distribution for the purposes of
calculating tail probabilities.
Use a relative risk measure such as the risk ratio or
odds ratio. In the table:
ab
cd
The most usual measures are:
Rate difference a/(a+c) - b/(b+d)
Rate ratio (a/(a+c))/(b/(b+d))
Odds ratio ad/bc
The rate difference and rate ratio are appropriate when
you are contrasting two groups, whose sizes (a+c and
b+d) are given. The odds ratio is for when the issue is
association rather than difference. Confidence interval
methods are available for all of these - though not as
well available in software as should be. If the
hypothesis test is highly significant, the confidence
interval will be well away from the null hypothesis
value (0 for the rate difference, 1 for the rate ratio or
odds ratio).
The risk ratio is the ratio of the proportion (a/(a+b)) to
the proportion (c/(c+d)):
RR = (a / (a + b)) / (c / (c + d))
RR is thus a measure of how much larger the
proportion in the first row is compared to the second
row and ranges from 0 to infinity with  1.00 indicating
a 'negative' association [a/(a+b)  c/(c+d)], 1.00
indicating no association [a/(a+b) = c/(c+d)],
and 1.00 indicating a 'positive' association
[a/(a+b)  c/(c+d)]. The further from 1.00, the
stronger the association. Most stats packages will
calculate the RR and confidence intervals for you. A
related measure is the odds ratio (or cross product
ratio) which is (a/b)/(c/d).
You could also look at the  statistic which is:
 = (2/N)½
where 2 is the Pearson's chi-square and N is the
sample size. This statistic ranges between 0 and 1 and
can be interpreted like the correlation coefficient.
Visit Critical Values for the Chi- square Distribution
Visit also, the Web sites Exact Unconditional
Tests, Statistical tests
Reference:
Fleiss J., Statistical Methods for Rates and Proportions,
Wiley, 1981.
Goodness-of-fit Test for Discrete Random
Variables
There are other tests which might use the Chi-square,
such as goodness-of-fit test for discrete random
variables. Again don't forget the conditions for the
validity of Chi-square test and related expected values
greater than 5 in 80% or more of the cells. Therefore,
Chi-square is a statistical test that measures
"goodness-of-fit". In other words, it measures how
much the observed or actual frequencies differ from
the expected or predicted frequencies. Using a Chisquare table will enable you to discover how significant
the difference is. A null hypothesis in the context of the
Chi-square test is the model that you use to calculate
your expected or predicted values. If the value you get
from calculating the Chi-square statistic is sufficiently
high (as compared to the values in the Chi-square
table) it tells you that your null hypothesis is probably
wrong.
Let Y1, Y 2, . . ., Y n be a set of independent and
identically distributed random variables. Assume that
the probability distribution of the Y i's has the density
function
f o (y). We can divide the set of all possible values of Yi,
i  {1, 2, ..., n}, into m non-overlapping intervals D1,
D2, ...., Dm. Define the probability values p1, p2 ,
..., pm as;
p1 = P(Yi  D1)
p2 = P(Yi D2)
:
:
pm = P(Yi  Dm)
Since the union of the mutually exclusive intervals D1,
D2, ...., Dm is the set of all possible values for the Yi's,
(p1 + p2 + .... + pm) = 1. Define the set of discrete
random variables X1, X2, ...., Xm, where
X1= number of Yi's whose value D1
X2= number of Yi's whose value  D2
:
:
Xm= number of Yi's whose value  Dm
and (X1+ X2+ .... + Xm) = n. Then the set of discrete
random variables X1, X2, ...., Xmwill have a multinomial
probability distribution with parameters n and the set
of probabilities {p1, p2, ..., pm}. If the intervals D1, D2,
...., Dm are chosen such that npi 5 for i = 1, 2, ..., m,
then;
C =  (Xi - npi) 2/ npi. The sum is over i= 1, 2,..., m.
The results is distributed as 2 m-1.
For the goodness-of-fit sample test, we formulate the
null and alternative hypothesis as
Ho : fY(y) = fo(y)
H1 : fY(y)  fo(y)
At the  level of significance, Ho will be rejected in
favor of H1 if
C =  (Xi - npi) 2/ npi is greater than 2
m
However, it is possible that in a goodness-of-fit test,
one or more of the parameters of fo(y) are unknown.
Then the probability values p1, p2, ..., pmwill have to be
estimated by assuming that Ho is true and calculating
their estimated values from the sample data. That is,
another set of probability values p'1, p'2, ..., p'm will
need to be computed so that the values (np'1, np'2,
..., np'm) are the estimated expected values of the
multinomial random variable (X1, X2, ...., Xm). In this
case, the random variable C will still have a chi-square
distribution, but its degrees of freedom will be reduced.
In particular, if the density function fo(y)
has r unknown parameters,
C =  (Xi - npi) 2/ npi is distributed as 2
m-1-r.
For this goodness-of-fit test, we formulate the null and
alternative hypothesis as
Ho: fY(y) = fo(y)
H1: fY(y) fo(y)
At the  level of significance, Ho will be rejected in
favor of H1 if C is greater than
2 m-1-r.
Using chi-square in a 2x2 table requires the Yates's
correction. One first subtracts 0.5 from the absolute
differences between observed and expected
frequencies for each of the 3 genotypes before
squaring, dividing by the expected frequency, and
summing. The formula for the chi-square value in a
2x2 table can be derived from the Normal Theory
comparison of the two proportions in the table using
the total incidence to produce the standard errors. The
rationale of the correction is a better equivalence of
the area under the normal curve and the probabilities
obtained from the discrete frequencies. In other words,
the simplest correction is to move the cut-off point for
the continuous distribution from the observed value of
the discrete distribution to midway between that and
the next value in the direction of the null hypothesis
expectation. Therefore, the correction essentially only
applied to 1 df tests where the "square root" of the chisquare looks like a "normal/t-test" and where a
direction can be attached to the 0.5 addition.
For more, visit the Web sites Chi-Square Lesson,
and Exact Unconditional Tests.
Statistics with Confidence
In practice, a confidence interval is used to express the
uncertainty in a quantity being estimated. There is
uncertainty because inferences are based on a random
sample of finite size from the entire population or
process of interest. To judge the statistical procedure
we can ask what would happen if we were to repeat
the same study, over and over, getting different data
(and thus different confidence intervals) each time.
In most studies investigators are usually interested in
determining the size of difference of a measured
outcome between groups, rather than a simple
indication of whether or not it is statistically significant.
Confidence intervals present a range of values, on the
basis of the sample data, in which the population value
for such a difference may lie.
Know that a confidence interval computed from one
sample will be different from a confidence interval
computed from another sample.
Understand the relationship between sample size and
width of confidence interval.
Know that sometimes the computed confidence interval
does not contain the true mean value (that is, it is
incorrect) and understand how this coverage rate is
related to confidence level.
Just a word of interpretive caution. Let's say you
compute a 95% confidence interval for a mean . The
way to interpret this is to imagine an infinite number of
samples from the same population, 95% of the
computed intervals will contain the population
mean . However, it is wrong to state, "I am 95%
confident that the population mean falls within the
interval."
Again, the usual definition of a 95% confidence interval
is an interval constructed by a process such that the
interval will contain the true value 95% of the time.
This means that "95%" is a property of the process,
not the interval.
Is the probability of occurrence of the population mean
greater in the confidence interval center and lowest at
the boundaries? Does the probability of occurrence of
the population mean in a confidence interval vary in a
measurable way from the center to the boundaries? In
a general sense, normality is assumed, and then the
interval between CI limits is represented by a bell
shaped t distribution. The expectation (E) of another
value is highest at the calculated mean value, and
decreases as the values approach the CI interval limits.
An approximation for the single measurement
tolerance interval is
the mean.
n times confidence interval of
Determining sample size: At the planning stage of a
statistical investigation the question of sample size (n)
is critical. The above figure also provides a practical
guide to sample size determination in the context of
statistical estimations and statistical significance tests.
The confidence level of conclusions drawn from a set of
data depends on the size of data set. The larger the
sample, the higher is the associated confidence.
However, larger samples also require more effort and
resources. Thus, your goal must be to find the smallest
sample size that will provide the desirable confidence.
In the above figure, formulas are presented for
determining the sample size required to achieve a
given level of accuracy and confidence.
In estimating the sample size, when the standard
deviation is not known, one may use 1/4 of the range
for sample of size over 30 as a "good" estimate for the
standard deviation. It is a good practice to compare
the result with IQR/1.349.
A Note on Multiple Comparison via the Individual
Intervals: Notice that, if the confidence intervals from
two samples do not overlap there is a statistically
significant difference, say at 5%. However, the other
way is not true two confidence intervals can overlap
quite a lot yet there is a significant difference between
them. One should examine the confidence interval for
the difference explicitly. Even if the C.I.'s are
overlapping it is hard to find the exact overall
confidence level. However, the sum of individual
confidence levels can serve as an upper limit upper
limit. This is evident from the fact that: P(A and
B)
P(A) + P(B).
Further Reading
Hahn G. and W. Meeker, Statistical Intervals: A Guide
for Practitioners, Wiley, 1991.
Also visit the Web sites Confidence Interval
Applet, statpage.
Entropy Measure
Inequality coefficients used in sociology, economy,
biostatistics, ecology, physics, image analysis and
information processing are analyzed in order to shed
light on economic disparity world-wide. Variability of a
categorical data is measured by the entropy function:
E= -  pi ln(pi)
where, sum is over all the categories and pi is the
relative frequency of the ith category. It is interesting
to note that this quantity is maximized when all pi's,
are equal.
For a rXc contingency table it is E=  pij ln(pij) ( pij) ln((pij) - ( pij) ln((pij)
The sums are over all i and j, and j and i's.
Another measure is the Kullback-Liebler distance
(related to information theory):
((Pi - Qi)*log(Pi/Qi)) =
(Pi*log(Pi/Qi )) + (Qi*log(Qi/Pi ))
or the variation distance
( | Pi - Qi | )/2
where Pi and Qi are the probabilities for the i-th
category for the two populations.
For more on entropy visit the Web sites Entropy on
WWW, Entropy and Inequality Measures,
and Biodiversity.
What Is Central Limit Theorem?
The central limit theorem (CLT) is a "limit" that is
"central" to statistical practice. For practical purposes,
the main idea of the CLT is that the average (center of
data) of a sample of observations drawn from some
population is approximately distributed as a normal
distribution if certain conditions are met. In theoretical
statistics there are several versions of the central limit
theorem depending on how these conditions are
specified. These are concerned with the types of
conditions made about the distribution of the parent
population (population from which the sample is
drawn) and the actual sampling procedure.
One of the simplest versions of the theorem says that
if we take a random sample of size (n) from the entire
population, then the sample mean which is a random
variable defined by  xi / n has a histogram which
converges to a normal distribution shape if n is large
enough (say more than 30). Equivalently, the sample
mean distribution approaches to a normal distribution
as the sample size increases.
In applications of the central limit theorem to practical
problems in statistical inference, however, statisticians
are more interested in how closely the approximate
distribution of the sample mean follows a normal
distribution for finite sample sizes, than the limiting
distribution itself. Sufficiently close agreement with a
normal distribution allows statisticians to use normal
theory for making inferences about population
parameters (such as the mean ) using the sample
mean, irrespective of the actual form of the parent
population.
It can be shown that, if the parent population has
mean and finite standard deviation , then the
sample mean distribution has the same mean but with
smaller standard deviation which is  divided by n½.
You know by now that, whatever the parent population
is, the standardized variable will have a distribution
with a mean = 0 and standard deviation =1 under
random sampling. Moreover, if the parent population is
normal, then z is distributed exactly as a standard
normal variable. The central limit theorem states the
remarkable result that, even when the parent
population is non-normal, the standardized variable is
approximately normal if the sample size is large
enough. It is generally not possible to state conditions
under which the approximation given by the central
limit theorem works and what sample sizes are needed
before the approximation becomes good enough. As a
general guideline, statisticians have used the
prescription that if the parent distribution is symmetric
and relatively short-tailed, then the sample mean
reaches approximate normality for smaller samples
than if the parent population is skewed or long-tailed.
Under certain conditions, in large samples, the
sampling distribution of the sample mean can be
approximated by a normal distribution. The sample
size needed for the approximation to be adequate
depends strongly on the shape of the parent
distribution. Symmetry (or lack thereof) is particularly
important.
For a symmetric parent distribution, even if very
different from the shape of a normal distribution, an
adequate approximation can be obtained with small
samples (e.g., 10 or 12 for the uniform distribution).
For symmetric short-tailed parent distributions, the
sample mean reaches approximate normality for
smaller samples than if the parent population is
skewed and long-tailed. In some extreme cases (e.g.
binomial with ) samples sizes far exceeding the typical
guidelines (e.g., 30 or 60) are needed for an adequate
approximation. For some distributions without first and
second moments (e.g., Cauchy), the central limit
theorem does not hold.
For some distributions, extremely large (impractical)
samples would be required to approach a normal
distribution. In manufacturing, for example, when
defects occur at a rate of less than 100 parts per
million, using a Beta distribution yields an honest CI of
total defects in the population.
Review also Central Limit Theorem Applet, Sampling
Distribution Simulation, and CLT.
What Is a Sampling Distribution
The sampling distribution describes probabilities
associated with a statistic when a random sample is
drawn from the entire population.
The sampling distribution is the probability distribution
or probability density function of the statistic.
Derivation of the sampling distribution is the first step
in calculating a confidence interval or carrying out a
hypothesis test for a parameter.
Example: Suppose that x1,.......,xn are a simple
random sample from a normally distributed population
with expected value and known variance2. Then the
sample mean is a statistic used to give information
about the population parameter is normally distributed
with expected value and variance 2/n.
The main idea of statistical inference is to take a
random sample from the entire population and then to
use the information from the sample to make
inferences about particular population characteristics
such as the mean (measure of central tendency), the
standard deviation (measure of spread)  or the
proportion of units in the population that have a certain
characteristic. Sampling saves money, time, and effort.
Additionally, a sample can, in some cases, provide as
much or more accuracy than a corresponding study
that would attempt to investigate an entire populationcareful collection of data from a sample will often
provide better information than a less careful study
that tries to look at everything.
One must also study the behavior of the mean of
sample values from a different specified populations.
Because a sample examines only part of a population,
the sample mean will not exactly equal the
corresponding mean of the population . Thus, an
important consideration for those planning and
interpreting sampling results is the degree to which
sample estimates, such as the sample mean, will agree
with the corresponding population characteristic.
In practice, only one sample is usually taken (in some
cases a small "pilot sample" is used to test the datagathering mechanisms and to get preliminary
information for planning the main sampling scheme).
However, for purposes of understanding the degree to
which sample means will agree with the corresponding
population mean , it is useful to consider what would
happen if 10, or 50, or 100 separate sampling studies,
of the same type, were conducted. How consistent
would the results be across these different studies? If
we could see that the results from each of the samples
would be nearly the same (and nearly correct!), then
we would have confidence in the single sample that will
actually be used. On the other hand, seeing that
answers from the repeated samples were too variable
for the needed accuracy would suggest that a different
sampling plan (perhaps with a larger sample size)
should be used.
A sampling distribution is used to describe the
distribution of outcomes that one would observe from
replication of a particular sampling plan.
Know that estimates computed from one sample will be
different from estimates that would be computed from
another sample.
Understand that estimates are expected to differ from
the population characteristics (parameters) that we are
trying to estimate, but that the properties of sampling
distributions allow us to quantify, based on probability,
how they will differ.
Understand that different statistics have different
sampling distributions with distribution shape
depending on (a) the specific statistic, (b) the sample
size, and (c) the parent distribution.
Understand the relationship between sample size and
the distribution of sample estimates.
Understand that the variability in a sampling
distribution can be reduced by increasing the sample
size.
See that in large samples, many sampling distributions
can be approximated with a normal distribution.
To learn more, visit the Web sites Sample,
and Sampling Distribution Applet
Applications of and Conditions for Using
Statistical Tables
Some widely used applications of the popular statistical
tables can be categorized as follows:
Z - Table: Tests concerning µ for one or twopopulation based on their large size random sample(s),
(say,  30, to invoke the Central Limit Theorem).
Test concerning proportions, with large size random
sample size n (say, n 50, to invoke a convergence
theorem).
Conditions for using this table: Test for randomness
of the data is needed before using this table. Test for
normality of the sample distribution is also needed if
the sample size is small or it may not be possible to
invoke the Central Limit Theorem.
T - Table: Tests concerning µ for one or twopopulation based on small random sample size (s).
Tests concerning regression coefficients (slope, and
intercepts), df = n - 2.
Notes: As you know by now, in test of hypotheses
concerning , and construction of confidence interval
for it, we start with  known, since the critical value
(and the p-value) of the Z-Table distribution can be
used. Considering the more realistic situations when
we don't know  the T-Table is used. In both cases we
need to verify the normality of the population's
distribution, however, if the sample size n is very
large, we can in fact switch back to Z-Table by the
virtue of the central limit theorem. For perfectly normal
population, the t-distribution corrects for any errors
introduced by estimating  with s when doing
inference.
Note also that, in hypothesis testing concerning the
parameter of binomial and Poisson distributions for
large sample sizes, the standard deviation is known
under the null hypotheses. That's why you may use the
normal approximations to both of these distributions.
Conditions for using this table: Test for randomness
of the data is needed before using this table. Test for
normality of the sample distribution is also needed if
the sample size is small or it may not be possible to
invoke the Central Limit Theorem.
Chi-Square - Table: Tests concerning 2 for one
population based on a random sample from the entire
population.
Contingency tables (test for independency of
categorical data).
Goodness-of-fit test for discrete random variables.
Conditions for using this table: Tests for
randomness of the data and normality of the sample
distribution are needed before using this table.
F - Table: ANOVA: Tests concerning µ for three or
more populations based on their random samples.
Tests concerning 2 for two-population based on their
random samples.
Overall assessment in regression analysis using the Fvalue.
Conditions for using this table: Tests for
randomness of the data and normality of the sample
distribution are needed before using this table for
ANOVA. Same conditions must be satisfied for the
residuals in regression analysis.
The following chart summarizes statistical tables
application with respect to test of hypotheses and
construction of confidence intervals for meanand
variance 2 in one pr comparison of two or more
populations.
Further Reading:
Kagan. A., What students can learn from tables of
basic distributions, Int. Journal of Mathematical
Education in Science & Technology, 30(6), 1999.
Statistical Tables on the Web:
The following Web sites provide critical values useful in
statistical testing and construction of confidence
intervals. The results are identical to those given in
statistic textbooks. However, in most cases they are
more extensive (therefore more accurate).




Normal Curve Area
Normal Calculator
Normal Probability Calculation
Critical Values for the t-Distribution


Critical Values for the F-Distribution
Critical Values for the Chi- square Distribution
Read also
Kanji G., 100 Statistical Tests, Sage Publisher, 1995.
Relationships Among Distributions and Unification of
Statistical Tables
Particular attention must be paid to a first course in
statistics. When I first began studying statistics, it
bothered me that there were different tables for
different tests. It took me a while to learn that this is
not as haphazard as it appeared. Binomial, Normal,
Chi-square, t, and F distributions that you will learn
about are actually closely connected.
A problem with elementary statistical textbooks is that
they not only don't provide information of this kind, to
permit a useful understanding of the principles
involved, but they usually don't provide these
conceptual links. If you want to understand
connections between statistical concepts, then you
should practice in making these connections. Learning
by doing statistics lends itself to active rather than
passive learning. Statistics is a highly interrelated set
of concepts, and to be successful at it, you must learn
to make these links conscious in your mind.
Students often ask: Why T- table values with d.f.=1
are much larger compared with other d.f. values?
Some tables are limited, what should I do when the
sample size is too large?, How can I get familiarity with
tables and their differences. Is there any type of
integration among tables? Is there any connections
between test of hypotheses and confidence interval
under different scenario, for example testing with
respect to one, two more than two populations. And so
on.
Further Reading:
Kagan. A., What students can learn from tables of
basic distributions, Int. Journal of Mathematical
Education in Science & Technology, 30(6), 1999.
The following two Figures demonstrate useful
relationships among distributions and a unification of
statistical tables:
Unification of Common Statistical Tables,
needs Acrobat to view
Relationship Among Commonly Used Distributions in
Testing, needs Acrobat to view
Normal Distribution
Up to this point we have been concerned with how
empirical scores are distributed and how best to
describe the distribution. We have discussed several
different measures, but the mean will be the
measure that we use to describe the center of the
distribution and the standard deviation will be the
measure we use to describe the spread of the
distribution. Knowing these two facts gives us ample
information to make statements about the probability
of observing a certain value within that distribution. If I
know, for example, that the average I.Q. score is 100
with a standard deviation of  = 20, then I know that
someone with an I.Q. of 140 is very smart. I know this
because 140 deviates from the mean by twice the
average amount as the rest of the scores in the
distribution. Thus, it is unlikely to see a score as
extreme as 140 because most the I.Q. scores are
clustered around 100 and on average only deviate 20
points from the mean .
Many applications arise from central limit theorem
(average of values of n observations approaches
normal distribution, irrespective of form of original
distribution under quite general conditions).
Consequently, appropriate model for many, but not all,
physical phenomena.
Distribution of physical measurements on living
organisms, intelligence test scores, product
dimensions, average temperatures, and so on.
Know that the Normal distribution is to satisfy seven
requirements: the graph should be bell shaped curve,
mean, medial and mode equal and located at the
center of the distribution, only has one mode,
symmetric about mean, continuous, never touches xaxis and area under curve equals one.
Many methods of statistical analysis presume normal
distribution.
Normal Curve Area Area.
What Is So Important About the Normal
Distributions?
Normal Distribution (called also Gaussian) curves,
which have a bell-shaped appearance (it is sometimes
even referred to as the "bell-shaped curves") are very
important in statistical analysis. In any normal
distribution is observations are distributed
symmetrically around the mean, 68% of all values
under the curve lie within one standard deviation of the
mean and 95% lie within two standard deviations.
There are many reasons for their popularity. The
following are the most important reasons for its
applicability:
1. One reason the normal distribution is important is
that a wide variety of naturally occurring
random variables such as heights and weights
of all creatures are distributed evenly around a
central value, average, or norm (hence, the name
normal distribution). Although the distributions
are only approximately normal, they are usually
quite close.
When there are many, too many factors
influencing the outcome of a random outcome,
then the underlying distribution is approximately
normal. For example, the height of a tree is
determined by the "sum" of such factors as rain,
soil quality, sunshine, disease, etc.
As Francis Galton wrote in 1889, "Whenever a
large sample of chaotic elements are taken in
hand and marshaled in the order of their
magnitude, an unsuspected and most beautiful
form of regularity proves to have been latent all
along."
Visit the Web sites Quincunx (with 5 influencing
factors) influencing, Central Limit Theorem ( with
8 influencing factors), or BallDrop for demos.
2. Almost all statistical tables are limited by the
size of their parameters. However, when these
parameters are large enough one may use normal
distribution for calculating the critical values for
these tables. Visit Relationship Among Statistical
Tables and Their Applications (pdf version).
3. If the mean and standard deviation of a normal
distribution are known, it is easy to convert back
and forth from raw scores to percentiles.
4. It's characterized by two independent
parameters--mean and standard deviation.
Therefore many effective transformations can
be applied to convert almost any shaped
distribution into a normal one.
5. The most important reason for popularity of
normal distribution is the Central Limit
Theorem (CLT). The distribution of the sample
averages of a large number of independent
random variables will be approximately
normal regardless of the distributions of the
individual random variables. Visit also the Web
sites Central Limit Theorem Applet, Sampling
Distribution Simulation, and CLT, for some
demos.
6. The other reason the normal distributions are so
important is that the normality condition is
required by almost all kinds of
parametricstatistical tests. The CLT is a useful
tool when you are dealing with a population with
unknown distribution. Often, you may analyze the
mean (or the sum) of a sample of size n. For
example instead of analyzing the weights of
individual items you may analyze the batch of
size n, that is, the packages each containing n
items.
What is a Linear Least Squares Model?
Many problems in analyzing data involve describing
how variables are related. The simplest of all models
describing the relationship between two variables is a
linear, or straight-line, model. Linear regression is
always linear in the coefficients being estimated, not
necessarily linear in the variables.
The simplest method of drawing a linear model is to
"eye-ball" a line through the data on a plot, but a more
elegant, and conventional method is that of least
squares, which finds the line minimizing the sum of
distances between observed points and the fitted line.
Realize that fitting the "best" line by eye is difficult,
especially when there is a lot of residual variability in
the data.
Know that there is a simple connection between the
numerical coefficients in the regression equation and
the slope and intercept of regression line.
Know that a single summary statistic like a correlation
coefficient does not tell the whole story. A scatterplot is
an essential complement to examining the relationship
between the two variables.
Again, the regression line is a group of estimates for
the variable plotted on the Y-axis. It has a form of y =
a + mx, m is the slope of the line. The slope is the rise
over run. If a line goes up 2 for each 1 it goes over,
then its slope is 2.
Formulas:

=  x(i)/n
This is just the mean of the x values.

=  y(i)/n
This is just the mean of the y values.

Sxx = (x(i) -
)2 = x(i)2 - [x(i) ]
2
/n

Syy = (y(i) -
)2 = y(i)2 - [y(i) ]
2
/n

Sxx = (x(i) )(y(i) . y(i)] / n
Slope m = Sxy / Sxx



) = x(i).y(i) - [x(i)
Intercept, b =
-m.
The least squares regression line is:
y-predicted = yhat = mx + b
The regression line goes through a mean-mean point.
That is the point at the mean of the x values and the
mean of the y values. If you drew lines from the meanmean point out to each of the data points on the
scatter plot, each of the lines that you drew would
have a slope. The regression slope is the weighted
mean of those slopes, where the weights are the runs
squared.
If you put in each x, the regression line would spit out
for you an estimate for each y. Each estimate makes
an error. Some errors are positive and some are
negative. The sum of squared of the errors plus the
sum of squared of the estimates add up to the sum of
squared of Y. The regression line is the line that
minimizes the variance of the errors. (the mean error
is zero, so this means that it minimizes the sum of the
squared errors.)
The reason for finding the best line is so that you can
make a reasonable predictions of what y will be if x is
known (not vise-versa).
r2 is the variance of the estimates divided by the
variance of Y. r is ± the square root of r2. r is the size
of the slope of the regression line, in terms of standard
deviations. In other words, it is the slope if we use the
standardized X and Y. It is how many standard
deviations of Y you would go up, when you go one
standard deviation of X to the right.
Visit also the Web sites Simple Regression, Linear
Regression, Putting Points
Coefficient of Determination
Another measure of the closeness of the points to the
regression line is the Coefficient of Determination.
r2 = Syhat yhat / Syy
which is the amount of the squared deviation which is
explained by the points on the least squares regression
line.
When you have regression equations based on theory,
you should compare:
1. R squares, that is, the percentage of of variance
[in fact, sum of squares] in Y accounted for
variance in X captured by the model.
2. When you want to compare models of different
size (different numbers of independent variables
(p) and/or different sample sizes n) you must use
the Adjusted R-Squared, because the usual RSquared tends to grow with the number of
independent variables.
R2
a
= 1 - (n - 1)(1 - R2)/(n - p - 1)
3. prediction error or standard error
4. trends in error, 'observed-predicted' as a function
of control variables such as time. Systematic
trends are not uncommon
5. extrapolations to interesting extreme conditions
of theoretical significance
6. t-stats on individual parameters
7. values of the parameters and its content to
content underpinnings.
8. Fdf1 df2 value for overall assessment. Where df1
(numerator degrees of freedom) is the number of
linearly independent predictors in the assumed
model minus the number of linearly independent
predictors in the restricted model (i.e.,the number
of linearly independent restrictions imposed on
the assumed model), and df2 (denominator
degrees of freedom) is the number of
observations minus the number of linearly
independent predictors in the assumed model.
Homoscedasticity and Heteroscedasiticy:
Homoscedasticity (homo=same, skedasis=scattering)
is a word used to describe the distribution of data
points around the line of best fit. The opposite term is
heteroscedasiticy. Briefly, homoscedasticity means that
data points are distributed equally about the line of
best fit. Therefore, homoscedasticity means constancy
of variances for/over all the levels of factors.
Heteroscedasiticy means that the data points cluster or
clump above and below the line in a non-equal pattern.
You should find a discussion of these terms in any
decent statistics text that deals with least squares
regression. See, e.g., Testing Research Hypothesis
with the GLM, by McNeil, Newman and Kelly, 1996
pages 174-176.
Finally in statistics for business, there exists an opinion
that with more that 4 parameters one can fit an
elephant, so that if one attempts to fit a curve that
depends on many parameters the result should not be
regarded as very reliable.
If m1 and m2 are the slopes of two regressions y on x
and x on y respectively then R2=m1.m2
Logistic regression: Standard logistic regression is a
method for modeling binary data (e.g., does a person
smoke or not, does a person survive a disease, or
not). Polygamous logistic regression models more than
two options (beg., does a person take the bus, drive a
car or take the subway, does an office use
WordPerfect, Word, or another package).
Test for equality of two slopes: Let m1 represent the
regression coefficient for explanatory variable X1 in
sample 1 with size n1. Let m2 represent the regression
coefficient for X1 in sample 2 with size n2. Let S1 and
S2 represent the associated standard error estimates.
Then, the quantity
(m1 - m2) / SQRT(S1
2
+ S2 2)
has the t distribution with df = n1 + n2 - 4
Regression when both X and Y are in error: Simple
linear least-square regression has among its conditions
that the data for the independent (X) variables are
known without error. Infact, the estimated results are
conditioned on whatever errors happened to be present
in the independent dataset. When the X-data have an
error associated with them the result is to bias the
slope downwards. A procedure known as Deming
regression can handle this problem quite well. Biased
slope estimates (due to error in x) can be avoided
using Deming regression.
Reference:
Cook and Weisberg, An Introduction to Regression
Graphics, Wiley, 1994
Regression Analysis: Planning, Development, and
Maintenance
I – Planning:
1. Define the problem, select response, suggest
variables
2. Are the proposed variables fundamental to the
problem, and are they variables? Measurable?
Can one get a complete set of observations at the
same time? Ordinary regression analysis does not
assume that the independent variables are
measured without error. However, they are
conditioned on whatever errors happened to be
present in the independent dataset.
3. Is the problem potentially solvable?
4. Correlation Matrix and first regression runs (for a
subset of data).
Find the basic statistics, correlation matrix.
How difficult this problem may be?
Compute the Variance Inflation Factor, VIF = 1/(1
-rij),, i=1, 2, 3, .., i j. For moderate VIF, say
between 2 and 8 you might be able to come-up
with a ‘good' model.
Inspect rij's , one or two must be large. If all are
small, perhaps the ranges of the X variables are
too-small.
5. Establish goals, prepare budget and time table.
a - the final equation should have R2 = 0.8 (say).
b - Coef. of Variation of say less than 0.10
c – Nunmer of predictors should not exceed p
(say, 3), (for example for p=3, we need at least
30 points).
d – All estimated coefficients must be significant
at = 0.05 (say).
e – No pattern in the residuals
6. Are goals and budget acceptable?
II – Development of the Model:
1. 1 – Collect date, plot, try models, check the
quality of date, check the assumptions.
2. 2 – Consult experts for criticism.
Plot new variable and examine same fitted
model.
Also transformed Predictor Variable may be used.
3. 3 – Are goals met?
Have you found "the best" model?
III – Validation and Maintenance of the Model:
1. 1 – Are parameters stable over the sample space?
2. 2 – Is there lack of fit?
Are the coefficients reasonable?
Are any obvious variables missing?
Is the equation usable for control or for
prediction?
3. 3 – Maintenance of the Model.
Need to have control chart to check the model
periodically by statistical techniques.
Predicting Market Response
As applied researchers in business and economics,
faced with the task of predicting market response, we
seldom know the functional form of the response.
Perhaps market response is a nonlinear monotonic, or
even a non-monotonic function of explanatory
variables. Perhaps it is determined by interactions of
explanatory variable. Interaction is logically
independent of its components.
When we try to represent complex market relationships
within the context of a linear model, using appropriate
transformations of explanatory and response variables,
we learn how hard the work of statistics can be.
Finding reasonable models is a challenge, and
justifying our choice of models to our peers can be
even more of a challenge. Alternative specifications
abound.
Modern regression methods, such as generalized
additive models, multivariate adaptive regression
splines, and regression trees, have one clear
advantage: They can be used without specifying a
functional form in advance. These data-adaptive,
computer- intensive methods offer a more flexible
approach to modeling than traditional statistical
methods. How well modern regression methods
perform in predicting market response? Some perform
quite well based on the results of simulation studies.
How to Compare Two Correlations Coefficients?
The statistical test is the following for Ho: 1 = 2.
Compute
t = (z1 - z2) / [ 1/(n1-3) + 1/(n2-3) ]½ n1, n2
3.
where
z1 = 0.5 ln( (1+r1)/(1-r1) ),
z2 = 0.5 ln( (1+r2)/(1-r2) ) and
n1= sample size associated with r1, and n2=sample size
associated with r2
The distribution of the statistic t is approximately
N(0,1). So, you should reject Ho if |t| 1.96 at the
95% confidence level.
r is (positive) scale and (any) shift invariant. That is ax
+ c, and by + d, have same r as x and y, for any
positive a and b.
Procedures for Statistical Decision Making
The two most widely used measuring tools and
decision procedures in statistical decision making, are
Classical and Bayesian Approaches.
Classical Approach: Classical probability of finding
this sample statistic -- or any statistic more unlikely-assuming the null hypothesis is true. A small p-value is
not sufficient evidence to reject the null hypothesis and
to accept the alternate.
As indicated in the above Figure, type-I
error occurs when based on your data you reject
the null hypothesis when in fact it is true. The
probability of a type I error is the level of
significance of the test of hypothesis, and is
denoted by .
A type II error occurs when you do not reject the
null hypothesis when it is in fact it is false. The
probability of a type-II error is denoted by .
The quantity 1 - is known as the Power of a
Test. Type-II error can be evaluated for any
specific alternative hypotheses stated in the form
"Not Equal to" as a competing hypothesis.
Bayesian Approach: Difference in expected gain
(loss) associated with taking various actions each
having an associated gain (loss) and a given Bayesian
statistical significance. This is standard Min/Max
decision theory using Bayesian strength of belief
assessments in the truth of the alternate hypothesis.
One would choose the action which minimizes expected
loss or maximizes expected gain (the risk function).
Hypothesis Testing: Rejecting a Claim
To perform a hypothesis testing, one must be very
specific about the test one wishes to perform. The null
hypothesis must be clearly stated, and the data must
be collected in a repeatable manner. Usually, the
sampling design will involve random, stratified random,
or regular distribution of study plots. If there is any
subjectivity, the results are technically not valid. All of
the analyses, including the sample size, significance
level, the time, and the budget, must be planned in
advance, or else the user runs the risk of "data diving"
Hypothesis testing is mathematical proof by
contradiction. For example, for a Student's t test
comparing 2 groups, we assume that the two groups
come from the same population (same means,
standard deviations, and in general same
distributions). Then we try like all get out to prove that
this assumption is false. Rejecting H0 means either
H0 is false, or a rare event such as has occurred.
The real question in statistics not whether a null
hypothesis is correct, but whether it is close enough to
be used as an approximation.
Selecting Statistics
In most statistical tests concerning , we start by assuming
the 2 & higher moments (skewness, kurtosis) are equal. Then we
hypothesize that the 's are equal. Null hypothesis.
The "null" suggests no difference between group means, or no
relationship between quantitative variables, and so on.
Then we test with a calculated t-value. For simplicity, suppose we
have a 2 sided test. If the calculated t is close to 0, we say good,
as we expected. If the calculated t is far from 0, we say, "the
chance of getting this value of t, given my assumption of equal
populations, is so small that I will not believe the assumption. We
will say that the populations are not equal, specifically the means
are not equal."
Sketch a normal distribution, with mean
12 and
standard deviation s. If the null hypothesis is true, then the mean
is 0. We calculate the 't' value, as per the equation. We look up a
"critical" value of t. The probability of calculating a t value more
extreme ( + or - ) than this, given that the null hypothesis is
true, is equal or less than the  risk we used in pulling the critical
value from the table. Mark the calculated t, and critical t (both
sides) on the sketch of the distribution. Now. If the calculated t is
more extreme than the critical value, we say, "the chance of
getting this t, by shear chance, when the null hypothesis is true,
is so small that I would rather say the null hypothesis is false,
and accept the alternative, that the means are not equal." When
the calculated value is less extreme than the calculated value, we
say, "I could get this value of t by shear chance, often enough
that I will not write home about it. I cannot detect a difference in
the means of the two groups at the  significance level."
In this test we need (among others) the condition that the
population variances (i.e., treatment impacts on central tendency
but not variability) are equal. However, this test is robust to
violations of that condition if n's are large and almost the same
size. A counter example would be to try a t-test between (11, 12,
13) and (20, 30, 40). The pooled and un pooled tests both give t
statistics of 3.10, but the degrees of freedom are different: 4
(pooled) or about 2 (unpooled). Consequently the pooled test
gives p = .036 and the unpooled p = .088. We could go down to
n = 2 and get something still more extreme.
The Classical Approach to the Test of Hypotheses
In this treatment there are two parties, one party (or a person)
sets out the null hypothesis (the claim), an alternative hypothesis
is proposed by the other party , a significance level  and a
sample size n are agreed upon by both parties. The second step
is to compute the relevant statistic based on the null hypothesis
and the random sample of size n. Finally, one determines the
critical region (i.e. rejection region). The conclusion based on this
approach is as follows:
If the computed statistics falls within the rejection region,
then Reject the null hypothesis. Otherwise Do Not Reject the null
hypothesis (the claim).
You may ask: How to determine the the critical value (such as zvalue) for the rejection interval: for one and two-tailed
hypotheses. What is the rule?
First you have to choose a significance level . Knowing that the
null hypothesis is always in "equality" form, then, the alternative
hypothesis has one three possible forms: "greater-than", "lessthan", or "not equal to". The first two forms correspond to onetail hypotheses while the last one corresponds to a two-tail
hypothesis.



if your alternative is in the form of "greater-than",
then z is the value that gives you an area to the right
tail of distribution that is equal to .
if your alternative is in the form of "less-than",
then z is the value that gives you an area to the left
tail of distribution that is equal to .
if your alternative is in the form of "not equal to" then,
there are two z values, one positive the other negative.
The positive z is the value that gives you an /2 area
to the right tail of distribution. While, the negative z is
the value that gives you an /2 area to the left tail of
distribution.
This is a general rule, and to implement this process in
determining the critical value, for any test of hypothesis, you
must first master reading the statistical tables well, because, as
you see, not all tables in your textbook are presented in a same
format.
The Meaning and Interpretation of P-values (what the data say?)
The p-value, which directly depends on a given sample, attempts
to provide a measure of the strength of the results of a test for
the null hypotheses, in contrast to a simple reject or do not reject
in the classical approach to the test of hypotheses. If the null
hypothesis is true and the chance of random variation is the only
reason for sample differences, then the p-value is a quantitative
measure to feed into the decision making process as evidence.
The following table provides a reasonable interpretation of pvalues:
P-value
P  0.01
Interpretation
very strong evidence against H0
0.01
P  0.05
moderate evidence against H0
0.05
P  0.10
suggestive evidence against H0
0.10
P
little or no real evidence against H0
This interpretation is widely accepted, and many scientific
journals routinely publish papers using such an interpretation for
the result of test of hypothesis.
For the fixed-sample size, when the number of realizations is
decided in advance, the distribution of p is uniform (assuming the
null hypothesis). We would express this as P(p
means the criterion of p
x) = x. That
0.05 achieves  of 0.05.
Understand that the distribution of p-values under null hypothesis
H0 is uniform, and thus does not depend on a particular form of
the statistical test. In a statistical hypothesis test, the P value is
the probability of observing a test statistic at least as extreme as
the value actually observed, assuming that the null hypothesis is
true. The value of p is defined with respect to a distribution.
Therefore, we could call it "model-distributional hypothesis"
rather than "the null hypothesis".
In short, it simply means that if the null had been true, the p
value is the probability against the null in that case. The p-value
is determined by the observed value, however, this makes it
difficult to even state the inverse of p.
Reference:
Arsham H., Kuiper's P-value as a Measuring Tool and Decision
Procedure for the Goodness-of-fit Test, Journal of Applied
Statistics, Vol. 15, No.3, 131-135, 1988.
Blending the Classical and the P-value Based Approaches
in Test of Hypotheses
A p-value is a measure of how much evidence you have against
the null hypothesis. Notice that, the null hypothesis is always in =
form, and does not contain any forms of inequalities. The smaller
the p-value, the more evidence you have. In this setting the pvalue is based on the hull hypothesis and has nothing to do with
alternative hypothesis and therefore with the rejection region. In
recent years, some authors try to use the mixture of classical
approach (which is based the critical value obtained from given ,
and the computed statistics based) and the p-value approach.
This is a blend of two different school of thoughts. In this setting,
some textbooks compare the p-value with the significance level to
make decision on a given test of hypothesis. Larger the p-value is
when compared with  (in one sided alternative hypothesis,
and /2 for the two sided alternative hypotheses), less evidence
we have in rejecting the null hypothesis. In such a comparison, if
the p-value is less than some threshold (usually 0.05, sometimes
a bit larger like 0.1 or a bit smaller like 0.01) then you reject the
null hypothesis.The following paragraph deal with such a
combined approach.
Use of P-value and : In this setting, we must also consider the
alternative hypothesis in drawing the rejection interval (region) .
There is only one p-value to compare with  (or /2). Know that,
for any test of hypothesis, there is only one p-value. The
following outlines the computation of the p-value and the decision
process involving in a given test of hypothesis:
1. P-value for One-side Alternative Hypotheses: The pvalue is defined as the area to the right tail of
distribution if the rejection region in on the right tail, if
the rejection region is on the left tail, then the p-value
is the area to the left tail (in one-sided alternative
hypotheses).
2. P-value for Two-side Alternative Hypotheses: If the
alternative hypothesis is a two-sided (that is, rejection
regions are both, on the left and on the right tails)
then the p-value is the area to the right tail or to the
left of distribution depending on whether the computed
statistic is closer to the right rejection region or left
rejection region. For symmetric densities (such as t)
the left and right tails p-values are the same. However,
for non-symmetric densities (such as Chi-square) used
the smaller of the two (this makes the test more
conservative). Notice that, for two sided-test
alternative hypotheses, the p-value is never greater
than 0.5.
3. After finding the p-value as defined here, you compare
it with a preset  value for one-sided tests, and
with /2 for two sided-test. Larger the p-value is when
compared with  (in one sided alternative hypothesis,
and /2 for the two sided alternative hypotheses), less
evidence we have for rejecting the null hypothesis.
To avoid looking-up the p-values from the limited statistical
tables given in your textbook, most professional statistical
packages such as SPSS provide the two-tail p-value. Based on
where the rejection region is, you must find out what p-value to
use.
Unfortunately, some textbooks have many misleading statements
about p-value and its applications, for example in many textbooks
you find the authors double the p-value to compare it
with  when dealing with the the two-sided test of hypotheses.
One wonders how they do it in the case when "their" p-vaue
exceeds 0.5? Notice that, while it is correct to compare the pvalue with  for one side test of hypotheses , however, for twosided hypotheses, one must compare the p-value with /2,
NOT  with 2 times p-value, as unfortunately some text book
advise. While, the decision is the same, but there is a clear
distinction here and an important difference which the careful
reader will note.
When We Should POOL Variance Estimates?
Variance estimates should be pooled only if there is a good
reason for doing so, and then (depending on that reason) the
conclusions might have to be made explicitly conditional on the
validity of the equal-variance model. There are several different
good reasons for pooling:
(a) to get a single stable estimate from several relatively small
samples, where variance fluctuations seem not to be systematic;
or
(b) for convenience, when all the variance estimates are near
enough to equality; or
(c) when there is no choice but to model variance (as in simple
linear regression with no replicated X values), and deviations
from the constant-variance model do not seem systematic; or
(d) when group sizes are large and nearly equal, so that there is
essentially no difference between the pooled and unpooled
estimates of standard errors of pairwise contrasts, and degrees of
freedom are nearly asymptotic.
Note that this last rationale can fall apart for contrasts other than
pairwise ones. One is not really pooling variance in case (d),
rather one is merely taking a shortcut in the computation of
standard errors of pairwise contrasts.
If you calculate the test without the assumption, you have to
determine the degrees of freedom (or let the statistics package
do it). The formula works in such a way that df will be less if the
larger sample variance is in the group with the smaller number of
observations. This is the case in which the two tests will differ
considerably. A study of the formula for the df is most
enlightening and one must understand the correspondence
between the unfortunate design (having the most observations in
the group with little variance) and the low df and accompanying
large t-value.
Example: When doing t tests for differences in means of
populations (a classic independent samples case):
1. Use the standard error formula for differences in
means that does not make any assumption about
equality of population variances [i.e., (VAR1/n1 +
VAR2/n2)½].
2. Use the "regular" way to calculate df in a t test (n11)+(n2-1), n1, n2
2.
3. If total N is less than 50 and one sample is 1/2 the size
of the other (or less) and the smaller sample has a
standard deviation at least twice as large as the other
sample, then replace #2 with formula for adjusting df
value. Otherwise, don't worry about the problem of
having an actual level that is much different than
what you have set it to be.
In the Statistics With Confidence Section we are concerned with
the construction of confidence interval where the equality of
variances condition is an important issue.
Visit also the Web sites Statistics, Statistical tests.
Remember that in the t tests for differences in means there is a
condition of equal population variances that must be examined.
One way to test for possible differences in variances is to do an F
test. However, the F test is very sensitive to violations of the
normality condition; i.e., if populations appear not to be normal,
then the F test will tend to over reject the null of no differences in
population variances.
SPSS program for T-test, Two-Population Independent
Means:
$SPSS/OUTPUT=CH2DRUG.OUT
TITLE
' T-TEST, TWO INDEPENDENT MEANS '
DATA LIST
FREE FILE='A.IN'/drug walk
VAR LABELS
DRUG 'DRUG OR PLACEBO'
WALK 'DIFFERENCE IN TWO WALKS'
VALUE LABELS
DRUG 1 'DRUG' 2 'PLACEBO'
T-TEST GROUPS=DRUG(1,2)/VARIABLES=WALK
NPAR TESTS
M-W=WALK BY DRUG(1,2)/
NPAR TESTS
K-S=WALK BY DRUG(1,2)/
NPAR TESTS
K-W=WALK BY DRUG(1,2)/
SAMPLE 10 FROM 20
CONDESCRIPTIVES DRUG(ZDRUG),WALK(ZWALK)
LIST CASE
CASE =10/VARIABLES=DRUG,ZDRUG,WALK,ZWALK
FINISH
SPSS program for T-test, Two-Population Dependent
Means:
$ SPSS/OUTPUT=A.OUT
TITLE
' T-TEST, 2 DEPENDENT MEANS'
FILE HANDLE
MC/NAME='A.IN'
DATA LIST
FILE=MC/YEAR1,YEAR2,(F4.1,1X,F4.1)
VAR LABELS
YEAR1 'AVERAGE LENGTH OF STAY IN YEAR 1'
YEAR2 'AVERAGE LENGTH OF STAY IN YEAR 2'
LIST CASE
CASE=11/VARIABLES=ALL/
T-TEST PAIRS=YEAR1 YEAR2
NONPAR COR YEAR1,YEAR2
NPAR TESTS WILCOXON=YEAR1,YEAR2/
NPAR TESTS SIGN=YEAR1,YEAR2/
NPAR TESTS KENDALL=YEAR1,YEAR2/
FINISH
Visit also the Web site Statistical tests.
Analysis of Variance (ANOVA)
The tests we have learned up to this point allow us to test
hypotheses that examine the difference between only two means.
Analysis of Variance or ANOVA will allow us to test the difference
between 2 or more means. ANOVA does this by examining the
ratio of variability between two conditions and variability within
each condition. For example, say we give a drug that we believe
will improve memory to a group of people and give a placebo to
another group of people. We might measure memory
performance by the number of words recalled from a list we ask
everyone to memorize. A t-test would compare the likelihood of
observing the difference in the mean number of words recalled
for each group. An ANOVA test, on the other hand, would
compare the variability that we observe between the two
conditions to the variability observed within each condition. Recall
that we measure variability as the sum of the difference of each
score from the mean. When we actually calculate an ANOVA we
will use a short-cut formula
Thus, when the variability that we predict (between the two
groups) is much greater than the variability we don't predict
(within each group), then we will conclude that our treatments
produce different results.
An Illustrative Numerical Example for ANOVA
Introducing ANOVA in simplest forms by numerical illustration.
Example: Consider the following (small, and integer, indeed for
illustration while saving space) random samples from three
different populations.
With the null hypothesis H0: µ1 = µ2 = µ3, and the Ha: at least
two of the means are not equal. At the significance level =
0.05, the critical value from F-table is
F 0.05, 2, 12 = 3.89.
Sample 1 Sample 2 Sample 3
2
3
5
3
4
5
1
3
5
3
5
3
1
0
2
10
15
20
SUM
2
3
4
Mean
Demonstrate that, SST=SSB+SSW
Computation of sample SST: With the grand mean = 3, first,
start with taking the difference between each observation and the
grand mean, and then square it for each data point.
Sample 1 Sample 2 Sample 3
1
0
4
0
1
4
4
0
4
0
4
0
4
9
1
9
14
13
SUM
Therefore SST=36 with d.f = 15-1 = 14
Computation of sample SSB:
Second, let all the data in each sample have the same value as
the mean in that sample. This removes any variation WITHIN.
Compute SS differences from the grand mean.
Sample 1 Sample 2 Sample 3
1
0
1
1
0
1
1
0
1
1
0
1
1
0
1
SUM
5
0
5
Therefore SSB = 10, with d.f = 3-1 = 2
Computation of sample SSW:
Third, compute the SS difference within each sample using their
own sample means. This provides SS deviation WITHIN all
samples.
Sample 1 Sample 2 Sample 3
0
0
1
1
1
1
1
0
1
1
4
1
1
9
4
4
14
8
SUM
SSW = 26 with d.f = 3(5-1) = 12
Results are: SST = SSB + SSW, and d.fSST = d.fSSB + d.fSSW, as
expected.
Now, construct the ANOVA table for this numerical example by
plugging the results of your computation in the ANOVA Table.
The ANOVA Table
Sources of Variation Sum of Squares Degrees of Freedom Mean Squares F-Statistic
Between Samples
10
2
5
2.30
Within Samples
26
12
2.17
Total
36
14
Conclusion: There is not enough evidence to reject the null
hypothesis Ho.
Logic Behind ANOVA: First, let us try to explain the logic and then
illustrate it with a simple example. In performing ANOVA test, we
are trying to determine if a certain number of population means
are equal. To do that, we measure the difference of the sample
means and compare that to the variability within the sample
observations. That is why the test statistic is the ratio of the
between-sample variation (MST) and the within-sample variation
(MSE). If this ratio is close to 1, there is evidence that the
population means are equal.
Here's a hypothetical example: many people believe that men get
paid more in the business world than women, simply because
they are male. To justify or reject such a claim, you could look at
the variation within each group (one group being women's
salaries and the other being men salaries) and compare that to
the variation between the means of randomly selected samples of
each population. If the variation in the women's salaries is much
larger than the variation between the men and women's mean
salaries, one could say that because the variation is so large
within the women's group that this may not be a gender-related
problem.
Now, getting back to our numerical example, we notice that:
given the test conclusion and the ANOVA test's conditions, we
may conclude that these three populations are in fact the same
population. Therefore, the ANOVA technique could be used as a
measuring tool and statistical routine for quality control as
described below using our numerical example.
Construction of the Control Chart for the Sample
Means: Under the null hypothesis the ANOVA concludes that µ1
= µ2 = µ3; that is, we have a "hypothetical parent population."
The question is, what is its variance? The estimated variance is
36 / 14 = 2.75. Thus, estimated standard deviation is = 1.60 and
estimated standard deviation for the means is 1.6 /
5 = 0.71.
Under the conditions of ANOVA, we can construct a control chart
with the warning limits = 3 ± 2(0.71); the action limits = 3 ±
3(0.71). The following figure depicts the control chart.
Visit also the Web site Statistical tests.
Bartlett's Test: The Analysis of Variance requires certain
conditions be met if the statistical tests are to be valid. One of
the conditions we make is that the errors (residuals) all come
from the same normal distribution. Thus we have to test not only
for normality, but we must also test homogeneity of the
variances. We can do this by subdividing the data into
appropriate groups, computing the variances in each of the
groups and testing that they are consistent with being sampled
from a Normal distribution. The statistical test for homogeneity of
variance is due to Bartlett which is a modification of the NeymanPearson likelihood ratio test.
Bartlett's Test of Homogeneity of Variances for r Independent
Samples is a test to check for equal variances between
independent samples of data. The subgroups sizes do not have to
be equal. This tests assumes that each sample was randomly and
independently drawn from a normal population.
SPSS program for ANOVA: More Than Two Independent
Means:
$SPSS/OUTPUT=4-1.OUT1
TITLE
'ANALYSIS OF VARIANCE - 1st ITERATION'
DATA LIST
FREE FILE='A.IN'/GP Y
ONEWAY Y BY GP(1,5)/RANGES=DUNCAN
/STATISTICS DESCRIPTIVES HOMOGENEITY
STATISTICS 1
MANOVA Y BY GP(1,5)/PRINT=HOMOGENEITY(BARTLETT)/
NPAR TESTS K-W Y BY GP(1,5)/
FINISH
ANOVA like two population t-test can go wrong when the equality
of variances condition is not met.
Homogeneity of Variance: Checking the equality of
variances For 3 or more populations, there is a practical rule
known as the "Rule of 2". According to this rule, one divides the
highest variance of a sample by the lowest variance of the other
sample. Given that the sample sizes are almost the same, and
the value of this division is less than 2, then, the variations of the
populations are almost the same.
Example: Consider the following three random samples from
three populations, P1, P2, P2
P1
P2
P3
25
25
17
21
8
10
20
18
13
6
5
22
25
10
17
25
19
21
15
16
24
23
14
16
12
14
6
16
13
6
The summary statistics and the ANOVA table are computed to be:
Variable
P1
P2
P3
Source
Factor
Error
Total
N
10
10
10
DF
2
27
29
Mean
16.90
19.80
11.50
St.Dev
7.87
3.52
3.81
Analysis of Variance
SS
MS
F
79.40
39.70
4.38
244.90
9.07
324.30
SE Mean
2.49
1.11
1.20
p-value
0.023
With an F = 4.38 and a p-value of .023, we reject the null at =
0.05. This is not good news, since ANOVA, like two sample t-test,
can go wrong when the equality of variances condition is not met
by the Rule of 2.
Visit also the Web site Statistical tests.
SPSS program for ANOVA: More Than Two Independent
Means:
$SPSS/OUTPUT=A.OUT
TITLE
'ANALYSIS OF VARIANCE - 1st ITERATION'
DATA LIST
FREE FILE='A.IN'/GP Y
ONEWAY Y BY GP(1,5)/RANGES=DUNCAN
STATISTICS 1
MANOVA Y BY GP(1,5)/PRINT=HOMOGENEITY(BARTLETT)/
NPAR TESTS K-W Y BY GP(1,5)/
FINISH
CHI square test: Dependency
$SPSS/OUTPUT=A.OUT
TITLE
'PROBLEM 4.2 CHI SQUARE; TABLE 4.18'
DATA LIST
FREE FILE='A.IN'/FREQ SAMPLE NOM
WEIGHT BY FREQ
VARIABLE LABELS
SAMPLE
'SAMPLE 1 TO 4'
NOM
'LESS OR MORE THAN 8'
VALUE LABELS
SAMPLE 1 'SAMPLE1' 2 'SAMPLE2' 3 'SAMPLE3' 4 'SAMPLE4'/
NOM
1 'LESS THAN 8' 2 'GT/EQ TO 8'/
CROSSTABS TABLES=NOM BY SAMPLE/
STATISTIC 1
FINISH
Non-parametric ANOVA:
$SPSS/OUTPUT=A.OUT
DATA LIST
FREE FILE='A.IN'/GP Y
NPAR TESTS K-W Y BY GP(1,4)
FINISH
Power of a Test
Power of a test is the probability of correctly rejecting a false null
hypothesis. This probability is inversely related to the probability
of making a Type II error. Recall that we choose the probability of
making a Type I error when we set . If we decrease the
probability of making a Type I error we increase the probability of
making a Type II error. Therefore, there are basically two errors
possible when conducting a statistical analysis; types I and II:


Type I error - risk (i.e. probability) of rejecting the null
hypothesis when it is in fact true
Type II error - risk of not rejecting the null hypothesis
when it is in fact false
Power and Alpha ()
Thus, the probability of correctly retaining a true null has the
same relationship to Type I errors as the probability of correctly
rejecting an untrue null does to Type II error. Yet, as I mentioned
if we decrease the odds of making one type of error we increase
the odds of making the other type of error. What is the
relationship between Type I and Type II errors? For a fixed
sample size, decreasing one type of error increases the size of
the other one.
Power and the True Difference Between Population Means
Anytime we test whether a sample differs from a population, or
whether two samples come from 2 separate populations, there is
the condition that each of the populations we are comparing has
it's own mean and standard deviation (even if we do not know it).
The distance between the two population means will affect the
power of our test.
Power as a Function of Sample Size and Variance 2:
Anything that effects the extent to which the two distributions
share common values will increase Beta (the likelihood of making
a Type II error)
Four factors influence power:




effect size (for example, the difference between the
means)
standard error 
significance level 
number of observations, or the sample size n
A Numerical Example: The following Figure provides an
illustrative numerical example:
Not rejecting the null hypothesis when it is false is defined as a
type II error, and is denoted by the  region. In the above Figure
this region lies to the left of the critical value. In the configuration
shown in this Figure,  falls to the left of the critical value (and
below the statistic's density under the alternative hypothesis Ha).
The  is also defined as the probability of incorrectly not-rejecting
a false null hypothesis, also called a miss. Related to the value
of  is the power of a test. The power is defined as the probability
of rejecting the null hypothesis given that a specific alternative is
true, and is computed as (1- ).
A Short Discussion: Consider testing a simple null versus simple
alternative. In the Neyman-Pearson setup, an upper bound is set
for the probability of type I error (), and then it is desirable to
find tests with low probability of type II error () given this. The
usual justification for this is that "we are more concerned about a
type I error, so we set an upper limit on the  we can tolerate." I
have seen this sort of reasoning in elementary texts and also
some advanced ones. It doesn't seem to make any sense. When
the sample size is large, for most standard tests, the
ratio / tends to 0. If we care more about type I error than type
II error, why should this concern dissipate with increasing sample
size?
This is indeed a drawback of the classical theory of testing
statistical hypotheses. A second drawback is that the choice lies
between only two test decisions: reject the null or accept the null.
It is worth considering approaches that overcome these
deficiencies. This can be done, for example, by the concept of
profile-tests at a 'level' . Neither the Type I nor Type II error
rates are considered separately, but they are the ratio of a
correct decision. For example, we accept the alternative
hypothesis Ha and reject the null H0, if an event is observed which
is at least a-times greater under Ha than under H0. Conversey, we
accept H0 and reject Ha, if an event is observed which is at least
a-times greater under H0 than under Ha. This is a symmetric
concept which is formulated within the classical approach.
Furthermore, more than two decisions can also be formulated.
Visit also, the Web site Sample Size Calculations
Parametric vs. Non-Parametric vs. Distribution-free Tests
One must use a statistical technique called nonparametric if it
satisfies at least one of the following five types of criteria:
1. The data entering the analysis are enumerative - that
is, count data representing the number of observations
in each category or cross-category.
2. The data are measured and/or analyzed using a
nominal scale of measurement.
3. The data are measured and/or analyzed using an
ordinal scale of measurement.
4. The inference does not concern a parameter in the
population distribution - as, for example, the
hypothesis that a time-ordered set of observations
exhibits a random pattern.
5. The probability distribution of the statistic upon which
the analysis is based is not dependent upon specific
information or assumptions about the population(s)
from which the sample(s) are drawn, but only on
general assumptions, such as a continuous and/or
symmetric population distribution.
By this definition, the distinction of nonparametric is accorded
either because of the level of measurement used or required for
the analysis, as in types 1 through 3; the type of inference, as in
type 4, or the generality of the assumptions made about the
population distribution, as in type 5.
For example, one may use the Mann-Whitney Rank Test as a
nonparametric alternative to Students T-test when one does not
have normally distributed data.
Mann-Whitney: To be used with two independent groups
(analogous to the independent groups t-test)
Wilcoxon: To be used with two related (i.e., matched or repeated)
groups (analogous to the related samples t-test)
Kruskall-Wallis: To be used with two or more independent
groups (analogous to the single-factor between-subjects ANOVA)
Friedman: To be used with two or more related groups
(analogous to the single-factor within-subjects ANOVA)
Non-parametric vs. Distribution-free Tests:
Non-parametric tests are those used when some specific
conditions for the ordinary tests are violated.
Distribution-free tests are those for which the procedure is valid
for all different shape of the population distribution.
For example, the chi-square test concerning the variance of a
given population is parametric since this test requires that the
population distribution be normal. The chi-square test of
independence does not assume normality, or even that the data
are numerical. The Kolmogorov-Smirinov goodness-of-fit test is a
distribution-free test which can be applied to test any distribution.
Pearson's and Spearman's Correlations
There are measures that describe the degree to which two
variables are linearly related. For the majority of these measures,
the correlation is expressed as a coefficient that ranges from
1.00, indicating a perfect linear relationship such that knowing
the value of one variable will allow perfect prediction of the value
of the related value, to 0.00, indicating no predictability by a
linear model, with negative values indicating that when the value
of one variable is high, the other is low (and vice versa), and
positive values indicating that when the value of one variable is
high, so is the other (and vice versa). Correlation has similar
interpretation compared with the derivative you have learned in
you calculus (a deterministic course).
The Pearson's product correlation is an index of the linear
relationship between two variables.
Formulas:

=  xi / n
This is just the mean of the x values.

=  yi / n
This is just the mean of the y values.

Sxx = (xi -
)2 = xi2 - [xi) ]

Syy = (yi -
)2 = yi2 - [yi ]

Sxx = (xi -
)(yi -
2
2
/n
/n
) = xi. yi - [x(i) . yi ] / n
The Pearson's correlation is
= Sxy / (Sxx . Syy)0.5
If there is a positive relationship an individual has a score on
variable x that is above the mean of variable x, this individual is
likely to have a score on variable y that is above the mean of
variable y, and vice versa. A negative relationship would be an x
score above the mean of x and a y score below the mean of y. It
is a measure of the relationship between variables and an index
of the proportion of individual differences in one variable that can
be associated with the individual differences in another variable.
In essence, the product-moment correlation coefficient is the
mean of the cross-products of scores. If you have three values
for of .40, .60, and .80. you cannot say that the difference
between = .40 and = .60 is the same as the difference
between =.60 and = .80, or that = .80 is twice as large
as = .40 because the scale of values for the correlation
coefficient is not interval or ratio, but ordinal. Therefore, all you
can say is that, for example, a correlation coefficient of +.80
indicates a high positive linear relationship and a correlation
coefficient of +.40 indicates a some what lower positive linear
relationship. It can tell us how much of the total variance of one
variable can be associated with the variance of another variable.
The square of the correlation coefficient equals the proportion of
the total variance in Y that can be associated with the variance in
x.
However, in engineering/manufacturing/development, an of 0.7
is often considered weak, and +0.9 is desirable. When the
correlation coefficient is around +0.9, it is time to make a
prediction and confirmation trial(s). Note that a correlation
coefficient is usually done on linear correlations. If the data forms
a symmetric quadratic hump, a linear correlation of x and y will
produce an of 0!. So one must be careful and look at data.
Spearman rank-order correlation coefficient is used as a nonparametric version of Pearson's. It is expressed as:
= 1 - (6d2) / [n(n2 - 1)],
where d is the difference rank between each X and Y pair.
Spearman correlation coefficient can be algebraically derived from
the Pearson correlation formula by making use of sums of series.
Pearson contains expressions
for x(i), y(i), x(i)2 and y(i)2.
In the Spearman case, the x(i)'s and y(i)' are ranks, and so the
sums of the ranks, and the sums of the ranks squared, are
entirely determined by the number of cases (without any ties).
i = (N+1)N/2, i2 = N(N+1)(2N+1)/6
The Spearman formula then is equal to:
[12P - 3N(N+1)2] / [N(N2 - 1)],
where P is the sum of the product of each pair of ranks x(i)y(i).
This reduces to:
= 1 - (6d2) / [n(n2 - 1)],
where d is the difference rank between each x(i) and y(i) pair.
An important consequence of this is that if you enter ranks into a
Pearson formula, you get precisely the same numerical value as
that obtained by entering the ranks into the Spearman formula.
This comes as a bit of a shock to those who like to adopt
simplistic slogans such as "Pearson is for interval data, Spearman
is for ranked data". Spearman doesn't work too well if there are
lots of tied ranks. That's because the formula for calculating the
sums of squared ranks no longer holds true. If one has lots of
tied ranks, use the Pearson formula.
Visit also the Web sites: Correlation Pearsons r, Spearman's Rank
Correlation
Independence vs. Correlated
In the sense that it is used in statistics, i.e., as an assumption in
applying a statistical test, a random sample from the entire
population provides a set of random variables X1,...., Xn that are
identically distributed and and mutually independent (mutually
independent is stronger than pairwise independence). The
random variables are mutually independent if their joint
distribution is equal to the product of their marginal distributions.
In the case of joint normality, independence is equivalent to zero
correlation but not in general. Independence will imply zero
correlation (if the random variables have second moments) but
not conversely. Not that not all random variables have a first
moment let alone a second moment and hence there may not be
a correlation coefficient.
However if the correlation coefficient of two random variables
(theoretical) is not zero then the random variables are not
independent.
Correlation, and Level of Significance
It is intuitive that with very few data points, a high correlation
may not be statistically significant. You may see statements such
as, "correlation is significant between x and y at the  = .005
level" and "correlation is significant at the  = .05 level." The
question is that how to determine these numbers?
For simple correlation, you can look at the test as a test on r2.
Looking at a simple correlation, the formula for F, where F is the
square of the t-statistic, becomes
F= (n-2) r2 / (1-r2), n
2.
As you may see, this is monotonic in r2and in n. If the degrees of
freedom (n-2) is large, then the F-test is very closely
approximated by the chisquared - so that a value of 3.84 is what
is needed for reaching  = 5% level. The cutoff value of F
changes little enough that the same value, 3.84, gives a pretty
good estimate even when the n is small. You can look up an Ftable or chisquared table to see the cutoff values needed for
other  levels.
Resampling Techniques: Jackknifing, and Bootstrapping
Statistical inference techniques that do not require distributional
assumptions about the statistics involved. These modern nonparametric methods use large amounts of computation to explore
the empirical variability of a statistic, rather than making a priori
assumptions about this variability, as is done in the traditional
parametric t- and z- tests. Monte Carlo simulation allows for the
evaluation of the behavior of a statistic when its mathematical
analysis is intractable. Bootstrapping and jackknifing allow
inferences to be made from a sample when traditional parametric
inference fails. These techniques are especially useful to deal with
statistical problems such as small sample size, statistics with no
well-developed distributional theory, and parametric inference
conditions violations. Both are comouter intensive. Bootstrapping
involves taking repeated samples from a popular with the
operating rule that you delete n from the sample each time.
Jackknifing involves systematically doing n steps, of omitting 1
case from a sample at a time, or, more generally, n/k steps of
omitting k cases; computations that compare "included" vs.
"omitted" can be used (especially) to reduce the bias of
estimation.
Bootstrapping means you take repeated samples from a sample
and then make statements about a population. Bootstrapping
entails sampling-with-replacement from a sample. Both have
applications in reducing biase in estimations.
Resampling -- including the bootstrap, permutation, and other
non-parametric tests -- is a method for hypothesis tests,
confidence limits, and other applied problems in statistics and
probability. It involves no formulas or tables. Resampling
procedure-free for all tests.
Following the first publication of the general technique (and the
bootstrap) in 1969 by Julian Simon and subsequent independent
development by Bradley Efron, resampling has become an
alternative approach for test of hypotheses.
There are other findings, "The bootstrap started out as a good
notion in that it presented, in theory, an elegant statistical
procedure that was free of distributional conditions.
Unfortunately, it doesn't work very well, and the attempts to
modify it make it more complicated and more confusing than the
parametric procedures that it was meant to replace."
For the pros and cons of the bootstrap, read
Young G., Bootstrap: More than a Stab in the Dark?, Statistical
Science, l9, 382-395, 1994.
visit also, the Web sites
Resampling, and
Bootstrapping with SAS.
Sampling Methods
From the food you eat to the TV you watch, from political
elections to school board actions, much of your life is regulated
by the results of sample surveys. In the information age of today
and tomorrow, it is increasingly important that sample survey
design and analysis be understood by many so as to produce
good data for decision making and to recognize questionable data
when it arises. Relevant topics are: Simple Random Sampling,
Stratified Random Sampling, Cluster Sampling, Systematic
Sampling, Ratio and Regression Estimation, Estimating a
Population Size, Sampling a Continuum of Time, Area or Volume,
Questionnaire Design, Errors in Surveys.
A sample is a group of units selected from a larger group (the
population). By studying the sample it is hoped to draw valid
conclusions about the larger group.
A sample is generally selected for study because the population is
too large to study in its entirety. The sample should be
representative of the general population. This is often best
achieved by random sampling. Also, before collecting the sample,
it is important that the researcher carefully and completely
defines the population, including a description of the members to
be included.
Random sampling of size n from a population size N. Unbiased
estimate for variance of
is Var(
) = S2(1-n/N)/n, where
n/N is the sampling fraction. For sampling fraction less than 10%
the finite population correction factor (N-n)/(N-1) is almost 1.
The total T is estimated by N.
, its variance is N2Var(
).
For 0, 1, (binary) type variables, variation in Pbar is
S2 = Pbar.(1-Pbar).(1-n/N)/(n-1).
For ratio r = xi/yi=
/
, the variation for r is
[(N-n)(r2S2x + S2y -2 r Cov(x, y)]/[n(N-1).
Stratified Sampling:
and
t
s
2
].
=  Wt. Bxart, over t=1, 2, ..L (strata),
is Xit/nt.
Its variance is:
W2t /(Nt-nt)S2t/[nt(Nt-1)]
Population total T is estimated by N.
s,
its variance is
N2t(Nt-nt)S2t/[nt(Nt-1)].
Since the survey usually measures several attributes for each
population member, it is impossible to find an allocation that is
simultaneously optimal for each of those variables. Therefore, in
such a case we use the popular method of allocation which use
the same sampling fraction in each stratum. This yield optimal
allocation given the variation of the strata are all the same.
Determination of sample sizes (n) with regard to binary data:
Smallest integer greater than or equal to:
[t2 N p(1-p)] / [t2 p(1-p) + 2 (N-1)]
with N being the size of the total number of cases, n being the
sample size,  the expected error, t being the value taken from
the t distribution corresponding to a certain confidence interval,
and p being the probability of an event.
Cross-Sectional Sampling:: Cross-Sectional Study the
observation of a defined population at a single point in time or
time interval. Exposure and outcome are determined
simultaneously.
For more information on sampling methods, visit the Web sites :
Sampling
Sampling In Research
Sampling, Questionnaire Distribution and Interviewing
SRMSNET: An Electronic Bulletin Board for Survey Researchers
Sampling and Surveying Handbook
Warranties: Statistical Planning and Analysis
In today market place, warranty has become an increasingly
important component of a product package and most consumer
and industrial products are sold with a warranty. The warranty
serves many purposes. It provides protection for both buyer and
manufacturer. For a manufacturer, a warranty also serves to
communicate information about product quality, and, as such,
may be used as a very effective marketing tool.
Warranty decisions involve both technical and commercial
considerations. Because of the possible financial consequences of
these decisions, effective warranty management is critical for the
financial success of a manufacturing firm. This requires that
management at all levels be aware of the concept, role, uses and
cost and design implications of warranty.
The aim is to understand:
the concept of warranty and its uses; warranty policy
alternatives; the consumer/manufacturer perspectives with
regards warranties; the commercial/technical aspects of warranty
and their interaction; strategic warranty management; methods
for warranty cost prediction; warranty administration
References and Further Readings:
Brennan J., Warranties: Planning, Analysis, and Implementation,
McGraw Hill, New York, 1994.
Factor Analysis
Factor Analysis is a technique for data reduction that is,
explaining the variation in a collection of continuous variables by
a smaller number of underlying dimensions (called factors).
Common factor analysis can also be used to form index numbers
or factor scores by using correlation or covariance matrix. The
main problem with factor analysis concept is that it is very
subjective in its interpretation of the results.
Delphi Analysis
Delphi Analysis is used in decision making process, in particular in
forecasting. Several "experts" sit together and try to compromise
on something they cannot agree on.
Reference:
Delbecq, A., Group Techniques for Program Planning, Scott
Foresman, 1975.
Binomial Distribution
Application: Gives probability of exact number of successes in n
independent trials, when probability of success p on single trial is
a constant. Used frequently in quality control, reliability, survey
sampling, and other industrial problems.
Example: What is the probability of 7 or more "heads" in 10
tosses of a fair coin?
Know that the binomial distribution is to satisfy the five following
requirements: each trial can have only two outcomes or its
outcomes can be reduced to two categories which is called pass
and fail, there must be a fixed number of trials, the outcome of
each trail must be independent, the probabilities must remain
constant, and the outcome of interest is the number of successes.
Comments: Can sometimes be approximated by normal or by
Poisson distribution.
Poisson
Application: Gives probability of exactly x independent
occurrences during a given period of time if events take place
independently and at a constant rate. May also represent number
of occurrences over constant areas or volumes. Used frequently
in quality control, reliability, queuing theory, and so on.
Example: Used to represent distribution of number of defects in a
piece of material, customer arrivals, insurance claims, incoming
telephone calls, alpha particles emitted, and so on.
Comments: Frequently used as approximation to binomial
distribution.
Exponential Distribution
Application: Gives distribution of time between independent
events occurring at a constant rate. Equivalently, probability
distribution of life, presuming constant conditional failure (or
hazard) rate. Consequently, applicable in many, but not all
reliability situations.
Example: Distribution of time between arrival of particles at a
counter. Also life distribution of complex non redundant systems,
and usage life of some components - in particular, when these
are exposed to initial burn-in, and preventive maintenance
eliminates parts before wear-out.
Comments: Special case of both Weibull and gamma
distributions.
Uniform Distribution
Application: Gives probability that observation will occur within a
particular interval when probability of occurrence within that
interval is directly proportional to interval length.
Example: Used to generate random valued.
Comments: Special case of beta distribution.
The density of geometric mean of n independent uniforms(0,1)
is:
P(X = x) = n x(n - 1) (Log[1/xn])(n -1) / (n - 1)!.
zL = [UL-(1-U)L] / L is said to have Tukey's symmetrical lambda
distribution.
Student's t-Distributions
The t distributions were discovered in 1908 by William
Gosset who was a chemist and a statistician employed by the
Guinness brewing company. He considered himself a student still
learning statistics, so that is how he signed his papers as
pseudonym "Student". Or perhaps he used a pseudonym due to
"trade secrets" restrictions by Guinness.
Note that there are different t distributions, it is a class of
distributions. When we speak of a specific t distribution, we have
to specify the degrees of freedom. The t density curves are
symmetric and bell-shaped like the normal distribution and have
their peak at 0. However, the spread is more than that of the
standard normal distribution. The larger the degrees of freedom,
the closer the t-density is to the normal density.
Critical Values for the t-Distribution
Annotated Review of Statistical Tools on the
Internet
Visit also the Web site Computational Tools and Demos on the
Internet
Introduction: Modern, web-based learning and computing
provides the means for fundamentally changing the way in which
statistical instruction is delivered to students. Multimedia learning
resources combined with CD-ROMs and workbooks attempt to
explore the essential concepts of a course by using the full
pedagogical power of multimedia. Many Web sites have nice
features such as interactive examples, animation, video,
narrative, and written text. These web sites are designed to
provide students with a "self-help" learning resource to
complement the traditional textbook.
In a few pilot studies, [Mann, B. (1997) Evaluation of
Presentation modalities in a hypermedia system, Computers &
Education, 28, 133-143. Ward M. and D. Newlands (1998) Use of
the Web in undergraduate teaching, Computers & Education, 31,
171-184.] compared the relative effectiveness of three versions
of hypermedia systems, namely, Text, Sound/Text, and Sound.
The results indicate that those working with Sound could focus
their attention on the critical information. Those working with the
Text and Sound/Text version however, did not learn as much and
stated their displeasure with reading so much text from the
screen. Based on this study, it is clear at least at this time that
such web-based innovations cannot serve as an adequate
substitute for face-to-face live instruction [See also Mcintyre D.,
and F. Wolff, An experiment with WWW interactive learning in
university education, Computers & Education, 31, 255-264,
1998].
Online learning education does for knowledge what just-in-time
delivery does for manufacturing: It delivers the right tools and
parts when you need them.
The Java applets are probably the most phenomenal way of
simplifying various concepts by way of interactive processes.
These applets help bring into life every concept from central limit
theorem to interactive random games and multimedia
applications.
The Flashlight Project develops survey items, interview plans,
cost analysis methods, and other procedures that institutions can
use to monitor the success of educational strategies that use
technology.
Read also, Critical notice: we are blessed with the emergence of
the WWW? Edited by B. Khan, and R. Goodfellow, Computers and
Education, 30(1-2), 131-136, 1998.
The following compilation summarizes currently available public
domain web sites offering statistical instructional material. While
some sites may have been missed, I feel that this listing is fully
representative. I would welcome information regarding any
further sites for inclusion, E-mail.
Academic Assistance Access It is a free tutoring service designed
to offer assistance to your statistics questions.
Basic Definitions, by V. Easton and J. McColl, Contains glossary of
basic terms and concepts.
Basic principles of statistical analysis, by Bob Baker, Basics
concepts of statistical models, Mixed model, Choosing between
fixed and random effects, Estimating variances and covariance,
Estimating fixed effects, Predicting random effects, Inference
space, Conclusions, Some references.
Briefbook of Data Analysis, has many contributors. The most
comprehensive dictionary of statistics. Includes ANOVA, Analysis
of Variance, Attenuation, Average, Bayes Theorem, Bayesian
Statistics, Beta Distribution, Bias, Binomial Distribution, Bivariate
Normal Distribution, Bootstrap, Cauchy Distribution, Central Limit
Theorem, Bootstrap, Chi-square Distribution, Composite
Hypothesis, Confidence Level, Correlation Coefficient, Covariance,
Cramer-Rao Inequality, Cramer-Smirnov-Von Mises Test, Degrees
of Freedom, Discriminant Analysis, Estimator, Exponential
Distribution, F-Distribution, F-test, Factor Analysis, Fitting,
Geometric Mean, Goodness-of-fit Test, Histogram, Importance
Sampling, Jackknife, Kolmogorov Test, Kurtosis, Least Squares,
Likelihood, Linear Regression, Maximum Likelihood Method,
Mean, Median, Mode, Moment, Monte Carlo Methods, Multinomial
Distribution, Multivariate Normal, Distribution Normal
Distribution, Outlier, Poisson Distribution, Principal Component
Analysis, Probability, Probability Calculus, Random Numbers,
Random Variable, Regression Analysis, Residuals, Runs Test,
Sample Mean, Sample Variance, Sampling from a Probability
Density Function, Scatter Diagram, Significance of Test,
Skewness, Standard Deviation, Stratified Sampling, Student's t
Distribution, Student's test, Training Sample, Transformation of
Random Variables, Trimming, Truly Random Numbers, Uniform
Distribution, Validation Sample Variance, Weighted Mean, etc.,
References, and Index.
Calculus Applied to Probability and Statistics for Liberal Arts and
Business Majors, by Stefan Waner and Steven Costenoble,
contains: Continuous Random Variables and Histograms;
Probability Density Functions; Mean, Median, Variance and
Standard Deviation.
Computing Studio, by John Behrens, Each page is a data entry
form that will allow you to type data in and will write a page that
walks you through the steps of computing your statistic: Mean,
Median, Quartiles, Variance of a population, Sample variance for
estimating a population variance, Standard-deviation of a
population, Sample standard-deviation used to estimate a
population standard-deviation, Covariance for a sample, Pearson
Product-Moment Correlation Coefficient (r), Slope of a regression
line, Sums-of-squares for simple regression.
CTI Statistics, by Stuart Young, CTI Statistics is a statistical
resource center. Here you will find software reviews and articles,
a searchable guide to software for teaching, a diary of
forthcoming statistical events worldwide, a CBL software
developers' forum, mailing list information, contact addresses,
and links to a wealth of statistical resources worldwide.
Data and Story Library, It is an online library of datafiles and
stories that illustrate the use of basic statistics methods.
DAU Stat Refresher, has many contributors. Tutorial, Tests,
Probability, Random Variables, Expectations, Distributions, Data
Analysis, Linear Regression, Multiple Regression, Moving
Averages, Exponential Smoothing, Clustering Algorithms, etc.
Descriptive Statistics Computation, Enter a column of your data
so that the mean, standard deviation, etc. will be calculated.
Elementary Statistics Interactive, by Wlodzimierz Bryc,
Interactive exercises, including links to further reading materials,
includes on-line tests.
Elementary Statistics, by J. McDowell. Frequency distributions,
Statistical moments, Standard scores and the standard normal
distribution, Correlation and regression, Probability, Sampling
Theory, Inference: One Sample, Two Samples.
Evaluation of Intelligent Systems, by Paul Cohen (Editor-inChief), covers: Exploratory data analysis, Hypothesis testing,
Modeling, and Statistical terminology. It also serves as
community-building function.
First Bayes, by Tony O'Hagan, First Bayes is a teaching package
for elementary Bayesian Statistics.
Fisher's Exact Test, by Øyvind Langsrud, To categorical variables
with two levels.
Gallery of Statistics Jokes, by Gary Ramseyer, Collection of
Statistical Joks.
Glossary of Statistical Terms, by D. Hoffman, Glossary of major
keywords and phrases in suggested learning order is provided.
Graphing Studio, Data entry forms to produce plots for twodimensional scatterplot, and three-dimensional scatterplot.
HyperStat Online, by David Lane. It is an introductory-level
statistics book.
Interactive Statistics, Contains some nice Java applets: guessing
correlations, scatterplots, Data Applet, etc.
Interactive Statistics Page, by John Pezzullo, Web pages that
perform mostly needed statistical calculations. A complete
collection on: Calculators, Tables, Descriptives, Comparisons,
Cross-Tabs, Regression, Other Tests, Power&Size, Specialized,
Textbooks, Other Stats Pages.
Internet Glossary of Statistical Terms, by By H. Hoffman, The
contents are arranged in suggested learning order and
alphabetical order, from Alpha to Z score.
Internet Project, by Neil Weiss, Helps students understand
statistics by analyzing real data and interacting with graphical
demonstrations of statistical concepts.
Introduction to Descriptive Statistics, by Jay Hill, Provides
everyday's applications of Mode, Median, Mean, Central
Tendency, Variation, Range, Variance, and Standard Deviation.
Introduction to Quantitative Methods, by Gene Glass. A basic
statistics course in the College of Education at Arizona State
University.
Introductory Statistics Demonstrations, Topics such as Variance
and Standard Deviation, Z-Scores, Z-Scores and Probability,
Sampling Distributions, Standard Error, Standard Error and Zscore Hypothesis Testing, Confidence Intervals, and Power.
Introductory Statistics: Concepts, Models, and Applications, by
David Stockburger. It represents over twenty years of experience
in teaching the material contained therein by the author. The high
price of textbooks and a desire to customize course material for
his own needs caused him to write this material. It contains
projects, interactive exercises, animated examples of the use of
statistical packages, and inclusion of statistical packages.
The Introductory Statistics Course: A New Approach, by D.
Macnaughton. Students frequently view statistics as the worst
course taken in college. To address that problem, this paper
proposes five concepts for discussion at the beginning of an
introductory course: (1) entities, (2) properties of entities, (3) a
goal of science: to predict and control the values of properties of
entities, (4) relationships between properties of entities as a key
to prediction and control, and (5) statistical techniques for
studying relationships between properties of entities as a means
to prediction and control. It is argued that the proposed approach
gives students a lasting appreciation of the vital role of the field
of statistics in scientific research. Successful testing of the
approach in three courses is summarized.
Java Applets, by many contributors. Distributions (Histograms,
Normal Approximation to Binomial, Normal Density, The T
distribution, Area Under Normal Curves, Z Scores & the Normal
Distribution. Probability & Stochastic Processes (Binomial
Probabilities, Brownian Motion, Central Limit Theorem, A Gamma
Process, Let's Make a Deal Game. Statistics (Guide to basic stats
labs, ANOVA, Confidence Intervals, Regression, Spearman's rank
correlation, T-test, Simple Least-Squares Regression, and
Discriminant Analysis.
The Knowledge Base, by Bill Trochim, The Knowledge Base is an
online textbook for an introductory course in research methods.
Lies, Damn Lies, and Psychology, by David Howell, This is the
homepage for a course modeled after the Chance course.
Math Titles: Full List of Math Lesson Titles, by University of
Illinois, Lessons on Statistics and Probability topics among others.
Nonparametric Statistical Methods, by Anthony Rossini, almost all
widely used nonparametric tests are presented.
On-Line Statistics, by Ronny Richardson, contains the contents of
his lecture notes on: Descriptive Statistics, Probability, Random
Variables, The Normal Distribution, Create Your Own Normal
Table, Sampling and Sampling Distributions, Confidence
Intervals, Hypothesis Testing, Linear Regression Correlation Using
Excel.
Online Statistical Textbooks, by Haiko Lüpsen.
Power Analysis for ANOVA Designs, by Michael Friendly, It runs a
SAS program that calculates power or sample size needed to
attain a given power for one effect in a factorial ANOVA design.
The program is based on specifying Effect Size in terms of the
range of treatment means, and calculating the minimum power,
or maximum required sample size.
Practice Questions for Business Statistics, by Brian Schott, Over
800 statistics quiz questions for introduction to business
statistics.
Prentice Hall Statistics, This site contains full description of the
materials covers in the following books coauthored by Prof.
McClave: A First Course In Statistics, Statistics, Statistics For
Business And Economics, A First Course In Business Statistics.
Probability Lessons, Interactive probability lessons for problemsolving and actively.
Probability Theory: The logic of Science, by E. Jaynes. Plausible
Reasoning, The Cox Theorems, Elementary Sampling Theory,
Elementary Hypothesis Testing, Queer Uses for Probability
Theory, Elementary Parameter Estimation, The Central Gaussian,
or Normal, Distribution, Sufficiency, Capillarity, and All That,
Repetitive Experiments: Probability and Frequency, Physics of
``Random Experiments'', The Entropy Principle, Ignorance Priors
-- Transformation Groups, Decision Theory: Historical Survey,
Simple Applications of Decision Theory, Paradoxes of Probability
Theory, Orthodox Statistics: Historical Background, Principles and
Pathology of Orthodox Statistics, The A --Distribution and Rule of
Succession. Physical Measurements, Regression and Linear
Models, Estimation with Cauchy and t--Distributions, Time Series
Analysis and Auto regressive Models, Spectrum / Shape Analysis,
Model Comparison and Robustness, Image Reconstruction,
Nationalization Theory, Communication Theory, Optimal Antenna
and Filter Design, Statistical Mechanics, Conclusions Other
Approaches to Probability Theory, Formalities and Mathematical
Style, Convolutions and Cumulates, Circlet Integrals and
Generating Functions, The Binomial -- Gaussian Hierarchy of
Distributions, Courier Analysis, Infinite Series, Matrix Analysis and
Computation, Computer Programs.
Probability and Statistics, by Beth Chance. Covers the
introductory materials supporting Moo re and
McCabe, Introduction to the practice of statistics, ND edition,
WHO Freeman, 1999. text book.
Rice Virtual Lab in Statistics, by David Lane et al., An introductory
statistics course which uses Java script Monte Carlo.
Sampling distribution demo, by David Lane, Applet estimates and
plots the sampling distribution of various statistics given
population distribution, sample size, and statistic.
Selecting Statistics, Cornell University. Answer the questions
therein correctly, then Selecting Statistics leads you to an
appropriate statistical test for your data.
Simple Regression, Enter pairs of data so that a line can be fit to
the data.
Scatterplot, by John Behrens, Provides a two-dimensional
scatterplot.
Selecting Statistics, by Bill Trochim, An expert system for
statistical procedures selection.
Some experimental pages for teaching statistics, by Juha
Puranen, contains some - different methods for visualizing
statistical phenomena, such as Power and Box-Cox
transformations.
Statlets: Download Academic Version (Free), Contains Java
Applets for Plots, Summarize, One and two-Sample Analysis,
Analysis of Variance, Regression Analysis, Time Series Analysis,
Rates and Proportions, and Quality Control.
Statistical Analysis Tools, Part of Computation Tools of Hyperstat.
Statistical Demos and Monte Carlo, Provides demos for Sampling
Distribution Simulation, Normal Approximation to the Binomial
Distribution, and A "Small" Effect Size Can Make a Large
Difference.
Statistical Education Resource Kit, by Laura Simon, This web page
contains a collection of resources used by faculty in Penn State's
Department of Statistics in teaching a broad range of statistics
courses.
Statistical Instruction Internet Palette, For teaching and learning
statistics, with extensive computational capability.
Statistical Terms, by The Animated Software Company,
Definitions for terms via a standard alphabetical listing.
Statiscope, by Mikael Bonnier, Interactive environment (Java
applet) for summarizing data and descriptive statistical charts.
Statistical Calculators, Presided at UCLA, Material here includes:
Power Calculator, Statistical Tables, Regression and GLM
Calculator, Two Sample Test Calculator, Correlation and
Regression Calculator, and CDF/PDF Calculators.
Statistical Home Page, by David C. Howell, This is a Home Page
containing statistical material covered in the author's textbooks
(Statistical Methods for Psychology and Fundamental Statistics for
the Behavioral Sciences), but it will be useful to others not using
those book. It is always under construction.
Statistics Page, by Berrie, Movies to illustrate some statistical
concepts.
Statistical Procedures, by Phillip Ingram, Descriptions of various
statistical procedures applicable to the Earth Sciences: Data
Manipulation, One and Two Variable Measures, Time Series
Analysis, Analysis of Variance, Measures of Similarity, Multivariate Procedures, Multiple regression, and Geostatistical
Analysis.
Statistical Tests, Contains Probability Distributions (Binomial,
Gaussian, Student-t, Chi-Square), One-Sample and MatchedPairs tests, Two-Sample tests, Regression and correlation, and
Test for categorical data.
Statistical Tools, Pointers for demos on Binomial and Normal
distributions, Normal approximation, Sample distribution, Sample
mean, Confidence intervals, Correlation, Regression, Leverage
points and Chisquare.
Statistics, This server will perform some elementary statistical
tests on your data. Test included are Sign Test, McNemar's Test,
Wilcoxon Matched-Pairs Signed-Ranks Test, Student-t test for one
sample, Two-Sample tests, Median Test, Binomial proportions,
Wilcoxon Test, Student-t test for two samples, Multiple-Sample
tests, Friedman Test, Correlations, Rank Correlation coefficient,
Correlation coefficient, Comparing Correlation coefficients,
Categorical data (Chi-square tests), Chi-square test for known
distributions, Chi-square test for equality of distributions.
Statistics Homepage, by StatSoft Co., Complete coverage of
almost all topics
Statistics: The Study of Stability in Variation, Editor: Jan de
Leeuw. It has components which can be used on all levels of
statistics teaching. It is disguised as an introductory textbook,
perhaps, but many parts are completely unsuitable for
introductory teaching. Its contents are Introduction, Analysis of a
Single Variable, Analysis of a Pair of Variables, and Analysis of
Multi-variables.
Statistics Every Writer Should Know, by Robert Niles and Laurie
Niles. Treatment of elementary concepts.
Statistics Glossary, by V. Easton and J. McColl, Alphabetical index
of all major keywords and phrases
Statistics Network A Web-based resource for almost all statistical
kinds of information.
Statistics Online A good collection of links on: Statistics to Use,
Confidence Intervals, Hypothesis Testing, Probability
Distributions, One-Sample and Matched-Pairs Tests, Two-Sample
Tests, Correlations, Categorical Data, and Statistical Tables.
Statistics on the Web, by Clay Helberg, Just as the Web itself
seems to have unlimited resources, Statistics on the web must
have hundreds of sites listing such statistical areas as:
Professional Organizations, Institutes and Consulting Groups,
Educational Resources, Web courses, and others too numerous to
mention. One could literally shop all day finding the joys and
treasures of Statistics!
Statistics To Use, by T. Kirkman, Among others it contains
computations on: Mean, Standard Deviation, etc., Student's tTests, chi-square distribution test, contingency tables, Fisher
Exact Test, ANOVA, Ordinary Least Squares, Ordinary Least
Squares with Plot option, Beyond Ordinary Least Squares, and Fit
to data with errors in both coordinates.
Stat Refresher, This module is an interactive tutorial which gives
a comprehensive view of Probability and Statistics. This
interactive module covers basic probability, random variables,
moments, distributions, data analysis including regression,
moving averages, exponential smoothing, and clustering.
Tables, by William Knight, Tables for: Confidence Intervals for the
Median, Binomial Coefficients, Normal, T, Chi-Square, F, and
other distributions.
Two-Population T-test
SURFSTAT Australia, by Keith Dear. Summarizing and Presenting
Data, Producing Data, Variation and Probability, Statistical
Inference, Control Charts.
UCLA Statistics, by Jan de Leeuw, On-line introductory textbook
with datasets, Lispstat archive, datasets, and live on-line
calculators for most distributions and equations.
VassarStats, by Richard Lowry, On-line elementary statistical
computation.
Web Interface for Statistics Education, by Dale Berger, Sampling
Distribution of the Means, Central Limit Theorem, Introduction to
Hypothesis Testing, t-test tutorial. Collection of links for Online
Tutorials, Glossaries, Statistics Links, On-line Journals, Online
Discussions, Statistics Applets.
WebStat, by Webster West. Offers many interactive test
procedures, graphics, such as Summary Statistics, Z tests (one
and two sample) for population means, T tests (one and two
sample) for population means, a chi-square test for population
variance, a F test for comparing population variances, Regression,
Histograms, Stem and Leaf plots, Box plots, Dot plots, Parallel
Coordinate plots, Means plots, Scatter plots, QQ plots, and Time
Series Plots.
WWW Resources for Teaching Statistics, by Robin Lock.
Interesting and Useful Sites
Selected Reciprocal Web Sites
| ABCentral | Bulletin Board Libraries |Business Problem
Solving |Business Math |Casebook |Chance |CTI
Statistics |Cursos de estadística |Demos for Learning
Statistics |Electronic texts and statistical tables |Epidemiology
and Biostatistics |Financial and Economic Links | Hyperstat |Intro.
to Stat. |Java Applets |Lecturesonline |Lecture
summaries | Maths & Stats Links|
| Online Statistical Textbooks and Courses |Probability
Tutorial | Research Methods & Statistics Resources | Statistical
Demos and Calculations |Statistical Education Resource Kit|
|Statistical Resources |Statistical Resources on the
Web |Statistical tests |Statistical Training on the Web |Statistics
Education-I |Statistics Education-II |
| Statistics Network |Statistics on the Web |Statistics, Statistical
Computing, and Mathematics |Statoo |Stats
Links |st@tserv |StatSoft | StatsNet |
| StudyWeb | SurfStat |Using Excel |Virtual Library |WebEc | Web
Tutorial Links |Yahoo:Statistics|
More reciprocal sites may be found by clicking on the following
search engines:
GoTo| HotBot| InfoSeek| LookSmart| Lycos|
General References
| The MBA Page | What is OPRE? | Desk Reference| Another Desk
Reference | Spreadsheets | All Topics on the Web | Contacts to
Statisticians | Statistics Departments (by country)|
| ABCentral | Syllabits | World Lecture Hall | Others Selected
Links | Virtual Library
| Argus Clearinghouse | TILE.NET | CataList | Maths and
Computing Lists
Statistics References
| Careers in Statistics | Conferences | | Statistical List
Subscription | Statistics Mailing Lists | Edstat-L | Mailbase
Lists | Stat-L | Stats-Discuss | Stat Discussion
Group | StatsNet | List Servers|
| Math Forum Search|
| Statistics Journals | Books and Journal | Main Journals | Journal
Web Sites|
Statistical Societies & Organizations

American Statistical Association (ASA)

ASA D.C. Chapter

Applied Probability Trust

Bernoulli Society

Biomathematics and Statistics Scotland

Biometric Society

Center for Applied Probability at Columbia


Center for Applied Probability at Georgia Tech
Center for Statistical and Mathematical Computing

Classification Society of North America

CTI Statistics

Dublin Applied Probability Group

Institute of Mathematical Statistics

International Association for Statistical Computing

International Biometric Society

International Environmetric Society

International Society for Bayesian Analysis

International Statistical Institute

National Institute of Statistical Sciences

RAND Statistics Group

Royal Statistical Society

Social Statistics

Statistical Engineering Division

Statistical Society of Australia

Statistical Society of Canada
Statistics Resources
| Statistics Main Resources | Statistics and OPRE
Resources | Statistics Links | STATS | StatsNet | Resources | UK
Statistical Resources|
| Mathematics Internet Resources | Mathematical and
Quantitative Methods |Stat Index | StatServ | Study
WEB | Ordination Methods for Ecologists|
WWW Resources | StatLib: Statistics Library | Guide for
Statisticians|
| Stat Links | Use and Abuse of Statistics | Statistics Links|
| Statistical Links | Statistics Handouts | Statistics Related
Links | Statistics Resources |OnLine Text Books|
Probability Resources
|Probability Tutorial |Probability | Probability & Statistics |Theory
of Probability | Virtual Laboratories in Probability and Statistics
|Let's Make a Deal Game |Central Limit Theorem | The Probability
Web | Probability Abstracts
| Coin Flipping |Java Applets on Probability | Uncertainty in
AI |Normal Curve Area | Topics in Probability | PQRS Probability
Plot | The Birthday Problem|
Data and Data Analysis
|Histograms | Statistical Data Analysis | Exploring Data | Data
Mining |Books on Statistical Data Analysis|
| Evaluation of Intelligent Systems | AI and Statistics|
Statistical Software
| Statistical Software Providers | S-
PLUS | WebStat | QDStat | Statistical Calculators on
Web | MODSTAT | The AssiStat|
| Statistical Software | Mathematical and Statistical
Software | NCSS Statistical Software|
| Choosing a Statistical Analysis Package | Statistical Software
Review| Descriptive Statistics by Spreadsheet | Statistics with
Microsoft Excel|
Learning Statistics
| How to Study Statistics | Statistics Education | Web and
Statistical Education | Statistics & Decision
Sciences | Statistics | Statistical Education through Problem
Solving|
| Exam, tests samples | INFORMS Education and Students
Affairs | CHANCE Magazin | Chance Web Index|
| Statistics Education Bibliography | Teacher
Network | Computers in Teaching Statistics|
Glossary Collections
The following sites provide a wide range of keywords & phrases.
Visit them frequently to learn the language of statisticians.
|Data Analysis Briefbook | Glossary of Statistical Terms |Glossary
of Terms |Glossary of Statistics |Internet Glossary of Statistical
Terms |Lexicon|Selecting Statistics Glossary |Statistics
Glossary | SurfStat glossary|
Selected Topics
|ANOVA |Confidence Intervals |Regression
| Kolmogorov-Smirnov Test | Topics in Statistics-I | Topics in
Statistics-I | Statistical Topics | Resampling | Pattern
Recognition | Statistical Sites by Applications | Statistics and
Computing|
| Biostatistics | Biomathematics and Statistics | Introduction to
Biostatistics Bartlett Corrections|
| Statistical Planning | Regression Analysis | AI-Geostats | TotalQality | Analysis of Variance and Covariance|
| Significance Testing | Hypothesis Testing | Two-Tailed
Hypothesis Testing | Commentaries on Significance
Testing | Bayesian | Philosophy of Testing|
Questionnaire Design, Surveys Sampling and
Analysis
|Questionnaire Design and Statistical Data Analysis |Summary of
Survey Analysis Software |Sample Size in Surveys
Sampling |Survey Samplings|
|Multilevel Statistical Models | Write more effective survey
questions|
| Sampling In Research | Sampling, Questionnaire Distribution
and Interviewing | SRMSNET: An Electronic Bulletin Board for
Survey|
| Sampling and Surveying Handbook |Surveys Sampling
Routines |Survey Software |Multilevel Models Project|
Econometric and Forecasting
| Time Series Analysis for Official Statisticians | Time Series and
Forecasting | Business Forecasting | International Association of
Business Forecasting |Institute of Business Forecasting |Principles
of Forecasting|
| Financial Statistics | Econometric-Research | Econometric
Links | Economists | RFE: Resources for Economists | Business &
Economics Scout Reports|
| A Business Forecasting Course | A Forecasting Course | Time
Series Data Library | Journal of Forecasting|
| Economics and Teaching |Box-Jenkins Methodology |
Statistical Tables
The following Web sites provide critical values useful in statistical
testing and construction of confidence intervals. The results are
identical to those given in almost all textbook. However, in most
cases they are more extensive (therefore more accurate).
|Normal Curve Area |Normal Calculator |Normal Probability
Calculation |Critical Values for the t-Distribution | Critical Values
for the F-Distribution |Critical Values for the Chi- square
Distribution|
A selection of:
Academic Info: Business, AOL: Science and Technology, Biz/ed: Business
and Economics, BUBL Catalogue, Business & Economics: Scout
Report, Business & Finance, Business & Industrial,
Business Nation, Dogpile: Statistics, HotBot Directory:
Statistics, IFORS, LookSmart: Statistics, LookSmart: Data &
Statistics, MathForum: Business,McGraw-Hill: Business Statistics, NEEDS:
The National Engineering Education Delivery System, Netscape:
Statistics, NetFirst,
SavvySearch Guide: Statistics, Small Business, Social Science Information
Gateway, WebEc, and the Yahoo
The Copy Right Statements: The fair use, according the 1996 Fair
Use Guidelines for Educational Multimedia, of materials presented
on this Web site is permitted for noncommercial and classroom
purposes.
This site may be mirrored, intact including these notices, on any
server with the public access, it may be linked to any other Web
pages.
Kindly e-mail me your comments, suggestions, and concerns.
Thank you.
Professor Hossein Arsham
EOF
Chapter 16
Hypothesis Testing and Probability Theory
Does Caffeine Make People More Alert?
The Experimental Design
Does the coffee I drink almost every morning really make me more alert. If all the students
drank a cup of coffee before class, would the time spent sleeping in class decrease? These
questions may be answered using experimental methodology and hypothesis testing
procedures.
Hypothesis Testing
The last part of the text is concerned with , or procedures to make rational decisions about the
reality of effects. The purpose of hypothesis testing is perhaps best illustrated by an example.
To test the effect of caffeine on alertness in people, one experimental design would divide the
classroom students into two groups; one group receiving coffee with caffeine, the other
coffee without caffeine. The second group gets coffee without caffeine rather than nothing to
drink because the effect of caffeine is the effect of interest, rather than the effect of ingesting
liquids. The number of minutes that students sleep during that class would be recorded.
Suppose the group, which got coffee with caffeine, sleeps less on the average than the group
which drank coffee without caffeine. On the basis of this evidence, the researcher argues that
caffeine had the predicted effect.
A statistician, learning of the study, argues that such a conclusion is not warranted without
performing a hypothesis test. The reasoning for this argument goes as follows: Suppose that
caffeine really had no effect. Isn't it possible that the difference between the average alertness
of the two groups was due to chance? That is, the individuals who belonged to the caffeine
group had gotten a better night's sleep, were more interested in the class, etc., than the no
caffeine group? If the class was divided in a different manner the differences would
disappear.
The purpose of the hypothesis test is to make a rational decision between the hypotheses of
real effects and chance explanations. The scientist is never able to totally eliminate
the chance explanation, but may decide that the difference between the two groups is so large
that it makes the chance explanation unlikely. If this is the case, the decision would be made
that the effects are real. A hypothesis test specifies how large the differences must be in order
to make a decision that the effects are real.
At the conclusion of the experiment, then, one of two decisions will be made depending upon
the size of the differences between the caffeine and no caffeine groups. The decision will
either be that caffeine has an effect, making people more alert, or that chance factors (the
composition of the group) could explain the result. The purpose of the hypothesis test is to
eliminate false scientific conclusions as much as possible.
Definition and Purpose of Hypothesis Tests
Hypothesis tests are procedures for making rational decisions about the reality of effects.
Rational Decisions
Most decisions require that an individual select a single alternative from a number of possible
alternatives. The decision is made without knowing whether or not it is correct; that is, it is
based on incomplete information. For example, a person either takes or does not take
an umbrella to school based upon both the weather report and observation of outside
conditions. If it is not currently raining, this decision must be made with incomplete
information.
The concept of a decision by a rational man or woman is characterized by the use of a
procedure that insures that both thelikelihood and the potential costs and benefits of all events
are incorporated into the decision-making process. The procedure must be stated in such a
fashion that another individual, using the same information, would make the same decision.
One is reminded of a episode in which Captain Kirk is stranded on a planet without his
communicator and is unable to get back to the . Spock has assumed command and is being
attacked by Klingons (who else?). Spock asks for and receives information about the location
of the enemy, but is unable to act because he does not have complete information. Captain
Kirk arrives at the last moment and saves the day because he can act on incomplete
information.
This story goes against the concept of rational man. Spock, being a rational man, would not
be immobilized by indecision. Instead, he would have selected the alternative which realized
the greatest expected benefit given the information available. If complete information were
required to make decisions, few decisions would be made by rational men and women. This
is obviously not the case. The script writer misunderstood Spock and rational man.
Effects
When a change in one thing is associated with a change in another, we have an effect . The
changes may be eitherquantitative or qualitative, with the hypothesis testing procedure
selected based upon the type of change observed. For example, if changes in sugar intake in a
diet are associated with activity level in children, we say an effect occurred. In another case,
if the distribution of political party preference (Republicans, Democrats, or Independents)
differs for sex (Male or Female), then an effect is present. Much of the behavioral science is
directed toward discovering and understanding effects.
The effects discussed in the remainder of this text are measured using various statistics
including: differences between means, a chi-Sqare statistic computed from a contingency
tables, and correlation coefficients.
General Principles
All hypothesis tests conform to similar principles and proceed with the same sequence of
events.
In almost all cases, the researcher wants to find statistically significant results. Failing to find
statistically significant results means that the research will probably never be published,
because few journals will to publish results that could be due to haphazard or chance
findings. If research is not published, it is generally not very useful.
In order to decide that there are real effects a model of the world is created in which there
are no effects and the experiment is repeated an infinite number of times. The repetion is not
real, but rather a "thought experiment", , or mathematical deduction. The sampling
distribution is used to create the model of the world when there are no effects and the study is
repeated an infinite number of times.
The results of the single real experiment or study are compared with the theoretical model of
no effects. If, given the model, the results are unlikely, then the model and the hypothesis of
no effects generating the model are rejected and the effects are accepted as real. If the results
could be explained by the model, the model must be retained and no decision can be made
about whether the effects were real or not.
Hypothesis testing is equivalent to the geometrical concept of hypothesis negation. That is, if
one wants to prove that A (the hypothesis) is true, one first assumes that it isn't true. If it is
shown that this assumption is logically impossible, then the original hypothesis is proven. In
the case of hypothesis testing the hypothesis may never be proven; rather, it is decided that
the model of no effects is unlikely enough that the opposite hypothesis, that of real effects,
must be true.
An analogous situation exists with respect to hypothesis testing in statistics. In hypothesis
testing one wants to show real effects of an experiment. By showing that the experimental
results were unlikely, given that there were no effects, one maydecide that the effects are, in
fact, real. The hypothesis that there were no effects is called the . The symbol H0 is used to
abbreviate the Null Hypothesis in statistics. Note that, unlike geometry, we cannot prove the
effects are real, rather we may decide the effects are real.
For example, suppose the probability model (distribution) in the following figure described
the state of the world when there were no effects. In the case of Event A, the decision would
be that the model could explain the results and the null hypothesis may true because Event A
is fairly likely given that the model is true. On the other hand, if Event B occurred, the model
would be rejected because Event B is unlikely, given the model.
The probability of Event A is much higher than the probability of Event B.
The Model
The is a theoretical distribution of a sample statistic . It is used as a model of what would
happen if
1. the null hypothesis were true (there really were no effects), and
2. the experiment were repeated an infinite number of times.
Because of its importance in hypothesis testing, the sampling distribution will be discussed in
a separate chapter.
Probability
Probability theory essentially defines probabilities of simple events in algebraic terms and
then presents rules for combining the probabilities of simple events into probabilities of
complex events given that certain conditions are present (assumptions are met). As such,
probability theory is a mathematical model of uncertainty. It can never be "true" in an
absolute sense, but may be more or less useful; depending upon how closely it mirrors reality.
Probabilities in an abstract sense are relative frequencies based on infinite repetitions. The
probability of heads when a coin is flipped is the number of heads divided by the number of
tosses as the number of tosses approaches infinity. In a similar vein, the probability of rain
tonight is the proportion of times it rains given that conditions are identical to the conditions
right now and they happen an infinite number of times. In neither the case of the coin nor the
weather is it possible to "know" the exact probability of the event. Because of this Kyburg
and Smokler (1964), among others, have argued that all probabilities are subjective and
reflect a "degree of belief" about a relative frequency rather than a relative frequency.
Flipping a coin a large number of times is more intuitive than the exact weather conditions
repeating themselves over and over again. Maybe that is why most texts begin by discussing
coin tosses and drawing cards in an idealized game. The essential fact remains that it is
impossible to flip a coin an infinite number of times. The true probability of obtaining heads
or tails must always remain unknown. In a similar vein, it is impossible to manufacture a die
that will have an exact probability of 1/6 for each side, although if enough care is taken the
long-term results may be "close enough" that the casino will make money. The difficulty of
computing a truly random sequence of numbers to use in simulations of probability
experiments is well-established (Peterson, 1998).
The conceptualization of probabilities as unlimited relative frequencies has certain
implications for probabilities of events that fall on the extreme ends of the continuum,
however. The relative frequency of an impossible event must always remain at zero, no
matter how many times it is repeated. The probability of getting an "arm" when flipping a
coin must be zero, because although "heads", "tails", or an "edge" are possibilities, a coin has
no "arm". An "arm" will never appear no matter how many times I flip a coin; thus its
probability is zero.
In a like manner the probability of a certain event is one. The probability of a compound
event such as obtaining "heads", "tails", or an "edge" when flipping a coin is a certainty, as
one of these three outcomes must occur. No matter how many times a coin is flipped, one of
the outcomes of this compound event must occur each time. Because any number divided by
itself is one, the probability of a certain event is one.
The two extremes of zero and one provide the upper and lower limits to the values of
probabilities. All values between these extremes can never be known exactly.
In addition to defining the nature of probabilities, probability theory also describes rules
about how probabilities can be combined to produce probabilities of compound and
conditional events. A compound event is a combination of simple events joined with either
"and" or "or". For example, the statement "Both the quarterback remains healthy and all the
lineman all pass this semester" is a compound event, called a joint event, employing the word
"and". In a similar vein, the statement "Either they all study very hard or they all get very
lucky" is a compound event with the word "or". A conditional statement employs the term
"given". For example, university football team will win the conference football championship
next season given that the quarterback remains healthy and all the linemen pass this semester.
The condition following the word "given" must be true before the condition before the
"given" takes effect.
Combining Probabilities of Independent Events
The probability of a compound event described by the word "and" is the product of the
simple events if the simple events are independent. To be independent two events cannot
possibly influence each other. For example, as long as one is willing to assume that the events
of the quarterback remaining healthy and the linemen all passing are independent, then the
probability of winning the conference football championship can be calculated by
multiplying the probabilities of each of the separate events together. For example, if the
probability of the quarterback remaining healthy is .6 and the probability of all the linemen
passing this semester is .2, then the probability of winning the conference championship is .6
* .2 or .12. This relationship can be written in symbols as follows:
P ( A and B ) = P ( A ) * P ( B ) if A and B are independent events.
Combining Probabilities Using "or"
If the compound event can be described by two or more events joined by the word "or", then
the probability of the compound event is the sum of the probabilities of the individual events
minus the probability of the joint event. For example, the probability of all the linemen
passing would be the sum of the probability of all studying very hard plus the probability of
all being very lucky, minus the probability of all studying very hard and all being very lucky.
For example, suppose that the probability of all studying very hard was .15, the probability of
all being very lucky was .0588, and the probability of all studying very hard and all being
very lucky was .0088. The probability of all passing would be .15 + .0588 - .0088 = .20. In
general the relationship can be written as follows:
P ( A or B ) = P ( A ) + P ( B ) - P ( A and B )
Conditional Probabilities
A conditional probability is the probability of an event given another event is true. The
probability that the quarterback will remain healthy given that he stretches properly at
practice and before game time would be a conditional probability. By definition a conditional
probability is the probability of the joint event divided by the probability of the conditional
event. In the previous example, the probability that the quarterback will remain healthy given
that he stretches properly at practice and before game time would be the probability of the
quarterback both remaining healthy and stretching properly divided by the probability of
stretching properly. Suppose the probability of stretching properly is .8 and the probability of
both stretching properly and remaining healthy is .55. The conditional probability of
remaining healthy given that he stretched properly would be .55 / . 8 = .6875. The "given" is
written in probability theory as a vertical line (|), such that the preceding could be written as:
P ( A | B ) = P ( A and B ) / P ( B )
Conditional probabilities can be combined into a very useful formula called Bayes's Rule.
This equation describes how to modify a probability given information in the form of
conditional probabilities. The equation is presented in the following:
P ( A | B ) = ( P ( B | A ) * P ( A ) ) / ( P ( B | A ) * P ( A ) + P ( B | not A ) * P ( not A ) )
Where A and B are any events whose probabilities are not 0 and 1.
Suppose that an instructor randomly picks a student from a class where males outnumber
females two to one. What is the probability that the selected student is a female? Given the
ratio of males to females, this probability could be set to 1/3 or .333. This probability is called
the prior probability and would be represented in the above equation as P(A). In a similar
manner, the probability of the student being a male, P(not A), would be 2/3 or .667. Suppose
additional information was provided about the selected student, that the shoe size of the
person selected was 7.5. Often it is possible to compute the conditional probability of B given
A or in this case, the probability of a size 7.5 given the person was a female. In a like manner,
the probability of B given not A can often be calculated; in this case the probability of a size
7.5 given the person was a male. Suppose the former probability is .8 and the latter is .1. The
likelihood of the person being a female given a shoe size of 7.5 can be calculated using
Bayes's Rule as follows:
P ( A | B ) = ( P ( B | A ) * P ( A ) ) / ( P ( B | A ) * P ( A ) + P ( B | not A ) * P ( not A ) )
= (. 8 * .333 ) / ( .8 * .333 + .1 * .667 )
= .2664 / .3331 = .7998
The value of P ( A | B ) is called a posterior probability and in this case the probability of the
student being a female given a shoe size of 7.5 is fairly high at .7998. The ability to
recompute probabilities based on data is the foundation of a branch of statistics called
Bayesian Statistics.
This set of rules barely scratches the surface when considering the possibilities of probability
models. The interested reader is pointed to any number of more thorough treatments of the
topic.
Including Cost in Making Decisions with Probabilities
Including cost as a factor in the equation can extend the usefulness of probabilities as an aid
in decision-making. This is the case in a branch of statistics called utility theory that includes
a concept called utility in the equation. Utility is the gain or loss experienced by a player
depending upon the outcome of the game and can be symbolized with a "U". Usually utility
is expressed in monetary units, although there is no requirement that it must be. The symbol
U(A) would be the utility of outcome A to the player of the game. A concept called expected
utility would be the result of playing the game an infinite number of times. In its simplest
form, expected utility is a sum of the products of probabilities and utilities:
Expected Utility = P ( A ) * U ( A ) + P ( not A ) * U ( not A )
Suppose someone was offered a chance to play a game with two dice. If the dice totaled to
"6" or "8" the player would receive $70, otherwise he or she would pay $30. The utility to the
player is plus $70 for A and minus $30 for not A. The probability of a "6" or "8" is 10/36
=.2778, while the probability of some other total is .7222. Should the player consider the
game? Using expected utility analysis, the expected utility would be:
Expected Utility = ( .2778 * 70 ) + ( .7222 * (-30) ) = -2.22
Since the expected utility is less than 0, indicating a loss over the long run, expected utility
theory would argue against playing play the game. Again this illustration just barely scratches
the surface of a very complex and interesting area of study and the reader is directed to other
sources for further study. In particular, the area of game theory holds a great deal of promise.
Your should be aware that the preceding analysis of whether on not to play a given game
based on expected utility assumes that the dice are "fair", that is, each face is equally likely.
To the extent the fairness assumption is incorrect, for example using weighted dice, then the
theoretical analysis will also be incorrect. Going back to the original definition of
probabilities, that of a relative frequency given an infinite number of possibilities, it is never
possible to "know" the probability of any event exactly.
Using Probability Models in Science
Does this mean that all the preceding is useless? Absolutely not! It does mean, however, that
probability theory and probability models must be viewed within the larger framework of
model-building in science. The "laws" of probability are a formal language model of the
world that, like algebra and numbers, exist as symbols and relationships between symbols.
They have no meaning in and of themselves and belong in the circled portion of the modelbuilding paradigm.
The model-building process - Transforming the model.
As with numbers and algebraic operators, the symbols within the language must be given
meaning before the models become useful. In this case "interpretation" implies that numbers
are assigned to probabilities based on rules. The circled part of the following figure illustrates
the portion of the model-building process that now becomes critical.
The model-building process - Creating the model.
Establishing Probabilities
There are a number of different ways to estimate probabilities. Each has advantages and
disadvantages and some have proven more useful than others. Just because a number can be
assigned to a given probability symbol, however, does not mean that the number is the "true"
probability.
Equal Likelihood
When there is no reason to believe that any outcome is more or less likely than any other
outcome, then the solution is to assign all outcomes an equal probability. For example, since
there is no reason to believe that heads is more likely than a tails a value of .5 is assigned to
each when a coin is flipped. In a similar manner, if there is no reason to believe that one card
is more likely to be picked than any other, then a probability of 1/52 or .0192 is assigned to
every card in a standard deck.
Note that this system does not work when there is reason to believe that one outcome is more
likely than another. For example, setting a probability of .5 that it will either be snowing
outside in an hour is not reasonable. There are two alternatives, it will either be snowing or it
won't, but equal probabilities are not tenable because it is sunny and 60 degrees outside my
office right now and I have reason to believe that it will not be snowing in an hour.
Relative Frequency
The relative frequency of an event in the past can be used as an estimate of its probability.
For example, the probability of a student succeeding in a given graduate program could be
calculated by dividing the number of students actually finishing the program by the number
of students admitted in the past. Establishing probabilities in this fashion assumes that
conditions in the past will continue into the future, generally a fairly safe bet. The greater the
number of observations, the more stable the estimate based on relative frequency. For
example, the probability of a heads for a given coin could be calculated by dividing the
number of heads by the number of tosses. An estimate based on 10,000 tosses would be much
better than one based on 10 tosses.
The probability of snow outside in a hour could be calculated by dividing the number of
times in the past that it has snowed when the temperature an hour before was 60 degrees by
the number of times it has been 60 degrees. Since I don't have accurate records of such
events, I would have to rely on memory to estimate the relative frequency. Since memory
seems to work better for outstanding events, I am more likely to remember the few times it
did snow in contrast to the many times it did not.
Area Under Theoretical Models of Frequency Distributions
The problems with using relative frequency were discussed in some detail in Chapter 5,
"Frequency Distributions." If an estimate of the probability of females who wear size 7.5
shoes is needed, one could use the proportion of women wearing a size 7.5 in a sample of
women. The problem is that unless a very large sample of women's shoe sizes is taken, the
relative frequency of any one shoe size is unstable and inaccurate. A solution to this dilemma
is to construct a theoretical model of women's shoe sizes and then use the area under the
theoretical model between values of 7.25 and 7.75 as an estimate of the probability of a size
7.5 shoe size. This method of establishing probabilities has the advantage of requiring a much
smaller sample to estimate relatively stable probabilities. It has the disadvantage that
probability estimation is several steps removed from the relative frequency, requiring both
the selection of the model and the estimation of the parameters of the model. Fortunately,
selecting the correct model and estimating parameters of the models is a well-understood and
thoroughly studied topic in statistics.
Area under theoretical models of distributions is the method that classical hypothesis testing
employs to estimate probabilities. A major part of an intermediate course in mathematical
statistics is the theoretical justification of the models that are used in hypothesis testing.
Subjective Probabilities
A controversial method of estimating probabilities is to simply ask people to state their
degree of belief as a number between zero and one and then treat that number as a
probability. A slightly more sophisticated method is to ask the odds the person would be
willing to take in order to place a bet. Probabilities obtained in this manner are
called subjective probabilities. If someone was asked "Give me a number between zero and
one, where zero is impossible and one is certain, to describe the likelihood of Jane Student
finishing the graduate program." that number would be a subjective probability.
Subjective probabilities have the greatest advantage in that they are intuitive and easy to
obtain. People use subjective probabilities all the time to make decisions. For example, my
decision about what to wear when I leave the house in the morning is partially based on what
I think the weather will be like an hour from now. A decision on whether or not to take an
umbrella is based partly on the subjective probability of rain. A decision to invest in a
particular company in the stock market is partly based on the subjective probability that the
company will increase in value in the future.
The greatest disadvantage of subjective probabilities is that people are notoriously bad at
estimating the likelihood of events, especially rare or unlikely events. Memory is selective.
Human memory is poorly structured to answer queries such as estimating the relative
frequency of snow an hour after the temperature was 60 degrees Fahrenheit and likely to be
influenced by significant, but rare, events. If asked to give a subjective probability of snow in
an hour, the resulting probability estimate would be a compound probability resulting from a
large number of conditional probabilities, such as the latest weather report, the time of year,
the current temperature, and intuitive feelings.
Inaccurate estimates of probabilities and their effect
Subjective probability estimates are influenced by emotion. In assessing the likelihood of
your favorite baseball team winning the pennant, feelings are likely to intervene and make the
estimate larger that reality would suggest. Bookmakers (bookies) everywhere bank on such
human behavior. In a similar manner, people are likely to assess the likelihood of
experimental methods to cure currently incurable diseases as much higher than they actually
are, especially when they have an incurable disease. The foundation of lotteries is an
overestimate of the probability of winning. Almost every winner in a casino is celebrated by
lights flashing and bells ringing, causing patrons to maintain a general overestimate of the
probability of winning.
People have a difficult time assessing risk and responding appropriately, especially when the
probabilities of the events are low. In the late 1980's people were canceling overseas travel
because of threats of terrorist attacks. Paulos (1988) estimates that the likelihood of being
killed by terrorists in any given year is one in 1,600,000 while the chances of dying in a car
crash in the same time frame is one in only 5,300. Yet people still refuse to use seat belts.
When people are asked to estimate the probability of some event, the event occurs, and then
the same people are asked what their original probabilities were, they almost inevitably
inflate them in the direction of the event. For example, suppose people were asked to give the
probability that a particular candidate would win an election, the candidate won, and then the
same people were asked to repeat the probability that they originally presented. In almost all
cases, the probability would be higher than the original probability. This well-established
phenomenon is called hindsight bias (Winman, Juslin, and Bjorkman, 1998)
Since most subjective probability estimates are compound probabilities, humans have also
have a difficult time combining simple probabilities into compound probabilities. Some of
the difficulty has to do with a lack of understanding about independence and mutual
exclusivity necessary to multiply and add probabilities. If a couple has three children, all
boys, the probability of the next child being a boy is approximately .5, even though the
probability of having four boys is .54 or .0625. The correct probability is a conditional
probability of having a boy given that they already had three boys.
Another difficulty with probabilities has to do with a misunderstanding about conditional
probability. When subjects were asked to rank potential occupations of a person described by
a former neighbor as "very shy and withdrawn, invariably helpful, but with little interest in
people, or in the world of reality. A meek and tidy soul, he has a need for order and structure,
and a passion for detail.", Tversky and Kahneman (1974, p. 380) found that they inevitably
categorize him as a librarian rather than a farmer. People fail to take into account that the
base rate or prior probability of being a farmer is much higher than being a librarian.
Using this and other illustrations of systematic and predictable errors made by humans is
assessing probabilities, Tversky and Kahneman (1974) argue that reliance on subjective
probabilities to assign values to symbols used within probability theory will inevitably lead to
logical contradictions.
Using Probabilities
In the Long Run
The casinos in Las Vegas and around the world are testaments that probability models work
as advertised. Insurance companies seldom go broke. There is no question that probability
models work if care is used in their construction and the user has the ability to participate for
the long run. These models are so useful that Peter Bernstein (1996) has claimed (p. 1) "The
revolutionary idea that defines the boundary between modern times and the past is the
mastery of risk: the notion that the future more than a whim of the gods and that men and
women are not passive before nature."
In hypothesis testing, probability models are used to control the proportion of times a
researcher claims to have found effects when in fact the results were due to chance or
haphazard circumstances. Because the science as a whole is able to participate in the long
run, these models have been successfully applied with the result that only a small proportion
of published research is the result of chance, coincidence, or haphazard events.
In the Short Run
Most of the decisions that are made in real life are made without the ability to view the results
in the long run. An undergraduate student decides to apply to a given graduate school based
upon an assessment of the probability of a favorable outcome and the benefits of attending
that particular school. There is generally no opportunity to apply to the same program over
and over again and observe the results. Probability models have limited value in these
situations because of the difficulties in estimating probabilities with any kind of accuracy.
Personally, I use expected utility theory in justifying not playing the lottery or gambling in
casinos. If the expected value is less than zero, I don't play. That doesn't explain why I carry
insurance on my house and my health, other than the bank requires it for a mortgage and the
university provides it as part of my benefits.
It has been fairly well-established that probability and utility theory and not accurate
normative models of how people actually make decisions. Harvey (1998) argues that people
use a variety of heuristics, or rules of thumb, to make decisions about the world. An
awareness and use of probability and utility theory have the potential benefit of making the
people much better decision-makers and are worthy of further study.
Summary
Hypothesis tests are procedures for making rational decisions about the reality of effects. All
hypothesis tests proceed by measuring the size of an effect, or relationship between two
variables, by computing a statistic. A theoretical probability model or distribution of what
that statistic would look like given there were no effects is created using the sampling
distribution. The statistic that measures the size of the effect is compared to the model of no
effects. If the probability of the obtained value of the statistic is unlikely given the model, the
model of no effects is rejected and the alternative hypothesis that there are real effects is
accepted. If the model could explain the results, the model and the hypothesis that there are
no effects is retained, as is the alternative hypothesis that there are real effects.
View HTML