Download Simple random sampling

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Measures of Central Tendency
(a quick review)
Topic Index | Algebra2/Trig Index | Regents Exam Prep Center
You are already familiar with measures of central tendency used with single data
sets:
mean, median and mode.
Let's quickly refresh our memories on these methods of indicating the center of a
data set:
Mean (or average):
Median (middle):
(n is the number of values in the data set) (n is the number of values in the data set)
• is the number found by adding all
of the values in the data set and
dividing by the total number of
values in that set.
• is the middle number in an ordered
data set. The number of values that
precede the median will be the same
as the number of values that follow
it.
To find the median:
1. Arrange the values in the data set
into increasing or decreasing order.
2. If n is odd, the number in the
middle is the median.
3. If n is even, the median is the
average of the two middle numbers.
Mode (most):
(least reliable indicator of the center of
the data set)
• is the value in the data set that
occurs most often. When in table
form, the mode is the value with the
highest frequency.
If there is no repeated number in the
set, there is no mode.
It is possible that a set has more than
one mode.
See how to use your TI83+/TI-84+ graphing
calculator with mean, mode
and median.
Click calculator.
Check out fast ways to use the calculator
with grouped data (frequency tables):
See how to use your TI83+/TI-84+ graphing
calculator with mean, mode,
median and grouped data.
Click calculator.
It is possible to get a sense of a data set's distribution by examining a five
statistical summary, the (1) minimum, (2) maximum,
(3) median (or second quartile), (4) the first quartile, and (5) the third
quartile. Such information will show the extent to which the data is located
near the median or near the extremes.
Quartiles:
We know that the median of a set of data separates the data into two equal
parts. Data can be further separated into quartiles. Quartiles separate the
original set of data into four equal parts. Each of these parts contains onefourth of the data.
Quartiles are percentiles that divide the data into fourths.
• The first quartile is the • The second quartile is • The third quartile is the
middle (the median) of
another name for the
middle (the median) of
the lower half of the
median of the entire set
the upper half of the
data. One-fourth of the
of data.
data. Three-fourths of
data lies below the first
Median of data set =
the data lies below the
quartile and threesecond quartile of data third quartile and onefourths lies above.
set.
fourth lies above.
th
th
(the 25 percentile)
(the 50 percentile)
(the 75th percentile)
A quartile is a number, it is not a range of values. A value can be described
as "above" or "below" the first quartile, but a value is never "in" the first
quartile.
Consider:
Check out this five statistical summary for a set of tests scores.
minimum
first
quartile
second quartile
(median)
65
70
80
third
quartile
90
maximum
100
While we do not know every test score, we do know that half of the scores is
below 80 and half is above 80. We also know that half of the scores is
between 70 and 90.
The difference between the third and first quartiles is called the interquartile
range, IQR.
For this example, the interquartile range is 20.)
The interquartile range (IQR), also called the midspread or middle fifty, is the
range between the third and first quartiles and is considered a more stable
statistic than the total range. The IQR contains 50% of the data.
Box and Whisker
Plots:
A five statistical summary can be represented graphically as a box and
whisker plot. The first and third quartiles are at the ends of the box, the
median is indicated with a vertical line in the interior of the box, and the
maximum and minimum are at the ends of the whiskers.
See how to use
your TI-83+/TI84+ graphing
calculator with
box and whisker
plots.
Click calculator.
Box-and-whisker plots are helpful in interpreting the
distribution of data.
NOTE: You may see a box-and-whisker plot which contains an asterisk.
Sometimes there is ONE piece of
data that falls well outside the range
of the other values. This single
piece of data is called an outlier. If
the outlier is included in the
whisker, readers may think that
there are grades dispersed
throughout the whole range from
the first quartile to the outlier,
which is not true. To avoid this
misconception, an * is used to mark
this "out of the ordinary" value.
Example of working with grouped data:
A survey was taken in
biology class regarding the number of siblings of each student. The table shows
the class data with the frequency of responses. The mean of this data is 2.5. Find
the value of k in the table.
Siblings
1
2
3
4
5
Frequency
5
k
8
4
1
Solution: Set up for finding the average (mean), simplify, and solve.
Measures of Dispersion
Topic Index | Algebra2/Trig Index | Regents Exam Prep Center
While knowing the mean value for a set of data may give us some information
about the set itself, many varying sets can have the same mean value. To
determine how the sets are different, we need more information. Another way of
examining single variable data is to look at how the data is spread out, or dispersed
about the mean.
We will discuss 4 ways of examining the dispersion of data.
The smaller the values from these methods, the more consistent the data.
1. Range:
The simplest of our methods for measuring dispersion
is range. Range is the difference between the largest value and the smallest value
in the data set. While being simple to compute, the range is often unreliable as a
measure of dispersion since it is based on only two values in the set.
A range of 50 tells us very little about how the values are dispersed.
Are the values all clustered to one end with the low value (12) or the high value (62) being an outlier?
Or are the values more evenly dispersed among the range?
Before discussing our next methods, let's establish some vocabulary:
Population form:
Sample form:
The population form is used when
the data being analyzed includes the
entire set of possible data. When
using this form, divide by n, the
number of values in the data set.
The sample form is used when the
data is a random sample taken from
the entire set of data. When using
this form, divide by n - 1.
(It can be shown that dividing by n - 1
makes S2 for the sample, a better estimate of
for the population from which the sample was
taken.)
All people living in the US.
Sam, Pete and Claire who live in the US.
The population form should be used unless you know a random sample is
being analyzed.
2. Mean Absolute Deviation (MAD):
The mean absolute deviation is the mean (average) of the absolute value of the
difference between the individual values in the data set and the mean. The method
tries to measure the average distances between the values in the data set and the
mean.
3. Variance:
To find the variance:
• subtract the mean,
• square the result
, from each of the values in the data set,
.
• add all of these squares
• and divide by the number of values in the data set.
4. Standard Deviation:
Standard deviation is the square root of the
variance. The formulas are:
Mean absolute deviation, variance and standard deviation are ways to describe the difference
between the mean and the values in the data set without worrying about the signs of these
differences.
These values are usually computed using a calculator.
Warning!!!
Be sure you know where to find "population" forms versus
"sample" forms on the calculator. If you are unsure, check out the information at
these links.
See how to use your TI83+/TI-84+ graphing
calculator with
measures of dispersion
ongrouped data.
Click calculator.
See how to use your TI83+/TI-84+ graphing
calculator with measures
of dispersion.
Click calculator.
Examples:
1. Find, to the nearest tenth, the standard deviation and variance of the
distribution:
100 200 300 400 500
Score
Frequency
15
21
19
24
17
Solution: For more detailed information on using the graphing calculator, follow the links
provided above.
Grab your graphing calculator.
Enter the data and frequencies
in lists.
Population variance
is 17069.7
Choose 1-Var Stats and
enter as grouped data.
Population standard
deviation
is 134.0
2. Find, to the nearest tenth, the mean absolute deviation for the set
{2, 5, 7, 9, 1, 3, 4, 2, 6, 7, 11, 5, 8, 2, 4}.
Enter the data in list.
Be sure to have the calculator Mean absolute deviation
first determine the mean.
is 2.3
For more detailed information on using the graphing calculator, follow the links provided
above.
Topic Index | Algebra2/Trig Index | Regents Exam Prep Center
Created by Donna Roberts
Copyright 1998-2012 http://regentsprep.org
Oswego City School District Regents Exam Prep Center
Practice with
Central Tendency and Dispersion
Topic Index | Algebra2/Trig Index | Regents Exam Prep Center
Choose the best answer to the following questions.
Grab your calculator.
1.
The table displays the frequency of scores on a Choose:
twenty point quiz. The mean of the quiz scores is
18.
8
Find the value of k in the table.
11
12
Score
Frequency
15 16 17 18 19 20
2
4
7
13
k
5
2.
3.
Choose:
The table displays the frequency of scores on a
10 point quiz. Find the median of the scores.
Score
5
6
7
8
9
10
Frequency
1
5
8
14 12
7
The table displays the number of
uncles of each student in a class of
Algebra 2. Find the mean, median
and mode of the uncles per student
for this data set.
Express answers to the nearest hundredth.
Score
0
1
2
3
4
5
Frequency
2
5
4
6
10
8
4. The average amount earned
by 110 juniors for a week
was $35, while during the
same week 90 seniors
averaged $50. What were
the average earnings for that
week for the combined
group?
7
8
9
Choose:
answers are stated in the
order
mean, median, mode
3.17, 4, 4
3.17, 3, 4
3.18, 4, 4
Choose:
$41.75
$43.50
$47.55
Choose:
5.
For the data set:
{5, 4, 2, 5, 9, 3, 4, 5, 3, 1, 6, 7, 5,
8, 3, 7}
find the interquartile range.
6.
3
3.5
6.5
Choose:
Find, to the nearest tenth, the standard deviation
of the distribution:
Score
Frequency
1
2
3
4
5
14
15
14
17
10
1.3
1.4
2.9
Choose:
7.
If all of the data in a set were
multiplied by 8, the variance of
the new data set would be
changed by a factor of ____.
4
8
16
64
Choose:
8.
x = 0, y =
If the five numbers {3, 4,
7, x, y} have a mean of 5
and a standard deviation
of
, find x and y given
that y > x.
1
x = 0, y =
4
x = 0, y =
6
x = 5, y =
6
Sampling (statistics)
From Wikipedia, the free encyclopedia
Not to be confused with Sample (statistics).
For computer simulation, see pseudo-random number sampling.
A visual representation of the sampling process.
In statistics, quality assurance, and survey methodology, sampling is concerned with the
selection of a subset of individuals from within a statistical population to estimate characteristics
of the whole population. Each observation measures one or more properties (such as weight,
location, color) of observable bodies distinguished as independent objects or individuals.
Insurvey sampling, weights can be applied to the data to adjust for the sample design,
particularly stratified sampling. Results from probability theory and statistical theory are employed
to guide practice. In business and medical research, sampling is widely used for gathering
information about a population.[1]
The sampling process comprises several stages:







Defining the population of concern
Specifying a sampling frame, a set of items or events possible to measure
Specifying a sampling method for selecting items or events from the frame
Determining the sample size
Implementing the sampling plan
Sampling and data collecting
Data which can be selected
Contents
[hide]

















1Population definition
2Sampling frame
3Probability and nonprobability sampling
o 3.1Probability sampling
o 3.2Nonprobability sampling
4Sampling methods
o 4.1Simple random sampling
o 4.2Systematic sampling
o 4.3Stratified sampling
o 4.4Probability-proportional-to-size sampling
o 4.5Cluster sampling
o 4.6Quota sampling
o 4.7Minimax sampling
o 4.8Accidental sampling
o 4.9Line-intercept sampling
o 4.10Panel sampling
o 4.11Snowball sampling
o 4.12Theoretical sampling
5Replacement of selected units
6Sample size
o 6.1Steps for using sample size tables
7Sampling and data collection
8Applications of Sampling
9Errors in sample surveys
o 9.1Sampling errors and biases
o 9.2Non-sampling error
10Survey weights
11Methods of producing random samples
12History
13See also
14Notes
15References
16Further reading
17Standards
o 17.1ISO
o 17.2ASTM
o 17.3ANSI, ASQ
o 17.4U.S. federal and military standards
Population definition[edit]
Successful statistical practice is based on focused problem definition. In sampling, this includes
defining the population from which our sample is drawn. A population can be defined as including
all people or items with the characteristic one wishes to understand. Because there is very rarely
enough time or money to gather information from everyone or everything in a population, the
goal becomes finding a representative sample (or subset) of that population.
Sometimes what defines a population is obvious. For example, a manufacturer needs to decide
whether a batch of material from production is of high enough quality to be released to the
customer, or should be sentenced for scrap or rework due to poor quality. In this case, the batch
is the population.
Although the population of interest often consists of physical objects, sometimes we need to
sample over time, space, or some combination of these dimensions. For instance, an
investigation of supermarket staffing could examine checkout line length at various times, or a
study on endangered penguins might aim to understand their usage of various hunting grounds
over time. For the time dimension, the focus may be on periods or discrete occasions.
In other cases, our 'population' may be even less tangible. For example, Joseph Jagger studied
the behaviour of roulette wheels at a casino in Monte Carlo, and used this to identify a biased
wheel. In this case, the 'population' Jagger wanted to investigate was the overall behaviour of the
wheel (i.e. the probability distribution of its results over infinitely many trials), while his 'sample'
was formed from observed results from that wheel. Similar considerations arise when taking
repeated measurements of some physical characteristic such as the electrical
conductivity of copper.
This situation often arises when we seek knowledge about the cause system of which
the observed population is an outcome. In such cases, sampling theory may treat the observed
population as a sample from a larger 'superpopulation'. For example, a researcher might study
the success rate of a new 'quit smoking' program on a test group of 100 patients, in order to
predict the effects of the program if it were made available nationwide. Here the superpopulation
is "everybody in the country, given access to this treatment" – a group which does not yet exist,
since the program isn't yet available to all.
Note also that the population from which the sample is drawn may not be the same as the
population about which we actually want information. Often there is large but not complete
overlap between these two groups due to frame issues etc. (see below). Sometimes they may be
entirely separate – for instance, we might study rats in order to get a better understanding of
human health, or we might study records from people born in 2008 in order to make predictions
about people born in 2009.
Time spent in making the sampled population and population of concern precise is often well
spent, because it raises many issues, ambiguities and questions that would otherwise have been
overlooked at this stage.
Sampling frame[edit]
Main article: Sampling frame
In the most straightforward case, such as the sentencing of a batch of material from production
(acceptance sampling by lots), it is possible to identify and measure every single item in the
population and to include any one of them in our sample. However, in the more general case this
is not possible. There is no way to identify all rats in the set of all rats. Where voting is not
compulsory, there is no way to identify which people will actually vote at a forthcoming election
(in advance of the election). These imprecise populations are not amenable to sampling in any of
the ways below and to which we could apply statistical theory.
As a remedy, we seek a sampling frame which has the property that we can identify every single
element and include any in our sample.[2][3][4][5] The most straightforward type of frame is a list of
elements of the population (preferably the entire population) with appropriate contact information.
For example, in an opinion poll, possible sampling frames include an electoral register and
a telephone directory.
Probability and nonprobability sampling[edit]
Probability sampling[edit]
A probability sample is a sample in which every unit in the population has a chance (greater
than zero) of being selected in the sample, and this probability can be accurately determined.
The combination of these traits makes it possible to produce unbiased estimates of population
totals, by weighting sampled units according to their probability of selection.
Example: We want to estimate the total income of adults living in a given street. We visit each
household in that street, identify all adults living there, and randomly select one adult from each
household. (For example, we can allocate each person a random number, generated from
a uniform distribution between 0 and 1, and select the person with the highest number in each
household). We then interview the selected person and find their income.
People living on their own are certain to be selected, so we simply add their income to our
estimate of the total. But a person living in a household of two adults has only a one-in-two
chance of selection. To reflect this, when we come to such a household, we would count the
selected person's income twice towards the total. (The person who is selected from that
household can be loosely viewed as also representing the person who isn't selected.)
In the above example, not everybody has the same probability of selection; what makes it a
probability sample is the fact that each person's probability is known. When every element in the
population does have the same probability of selection, this is known as an 'equal probability of
selection' (EPS) design. Such designs are also referred to as 'self-weighting' because all
sampled units are given the same weight.
Probability sampling includes: Simple Random Sampling, Systematic Sampling, Stratified
Sampling, Probability Proportional to Size Sampling, and Cluster or Multistage Sampling. These
various ways of probability sampling have two things in common:
1. Every element has a known nonzero probability of being sampled and
2. involves random selection at some point.
Nonprobability sampling[edit]
Main article: Nonprobability sampling
Nonprobability sampling is any sampling method where some elements of the population
have no chance of selection (these are sometimes referred to as 'out of
coverage'/'undercovered'), or where the probability of selection can't be accurately determined. It
involves the selection of elements based on assumptions regarding the population of interest,
which forms the criteria for selection. Hence, because the selection of elements is nonrandom,
nonprobability sampling does not allow the estimation of sampling errors. These conditions give
rise to exclusion bias, placing limits on how much information a sample can provide about the
population. Information about the relationship between sample and population is limited, making
it difficult to extrapolate from the sample to the population.
Example: We visit every household in a given street, and interview the first person to answer the
door. In any household with more than one occupant, this is a nonprobability sample, because
some people are more likely to answer the door (e.g. an unemployed person who spends most of
their time at home is more likely to answer than an employed housemate who might be at work
when the interviewer calls) and it's not practical to calculate these probabilities.
Nonprobability sampling methods include convenience sampling, quota sampling and purposive
sampling. In addition, nonresponse effects may turn any probability design into a nonprobability
design if the characteristics of nonresponse are not well understood, since nonresponse
effectively modifies each element's probability of being sampled.
Sampling methods[edit]
Within any of the types of frame identified above, a variety of sampling methods can be
employed, individually or in combination. Factors commonly influencing the choice between
these designs include:





Nature and quality of the frame
Availability of auxiliary information about units on the frame
Accuracy requirements, and the need to measure accuracy
Whether detailed analysis of the sample is expected
Cost/operational concerns
Simple random sampling [edit]
Main article: Simple random sampling
A visual representation of selecting a simple random sample
In a simple random sample (SRS) of a given size, all such subsets of the frame are given an
equal probability. Furthermore, any given pair of elements has the same chance of selection as
any other such pair (and similarly for triples, and so on). This minimises bias and simplifies
analysis of results. In particular, the variance between individual results within the sample is a
good indicator of variance in the overall population, which makes it relatively easy to estimate the
accuracy of results.
SRS can be vulnerable to sampling error because the randomness of the selection may result in
a sample that doesn't reflect the makeup of the population. For instance, a simple random
sample of ten people from a given country will on averageproduce five men and five women, but
any given trial is likely to overrepresent one sex and underrepresent the other. Systematic and
stratified techniques attempt to overcome this problem by "using information about the
population" to choose a more "representative" sample.
SRS may also be cumbersome and tedious when sampling from an unusually large target
population. In some cases, investigators are interested in "research questions specific" to
subgroups of the population. For example, researchers might be interested in examining whether
cognitive ability as a predictor of job performance is equally applicable across racial groups. SRS
cannot accommodate the needs of researchers in this situation because it does not provide
subsamples of the population. "Stratified sampling" addresses this weakness of SRS.
Systematic sampling[edit]
Main article: Systematic sampling
A visual representation of selecting a random sample using the systematic sampling technique
Systematic sampling (also known as interval sampling) relies on arranging the study population
according to some ordering scheme and then selecting elements at regular intervals through that
ordered list. Systematic sampling involves a random start and then proceeds with the selection of
every kth element from then onwards. In this case,k=(population size/sample size). It is important
that the starting point is not automatically the first in the list, but is instead randomly chosen from
within the first to the kth element in the list. A simple example would be to select every 10th
name from the telephone directory (an 'every 10th' sample, also referred to as 'sampling with a
skip of 10').
As long as the starting point is randomized, systematic sampling is a type of probability sampling.
It is easy to implement and the stratification induced can make it efficient, if the variable by which
the list is ordered is correlated with the variable of interest. 'Every 10th' sampling is especially
useful for efficient sampling from databases.
For example, suppose we wish to sample people from a long street that starts in a poor area
(house No. 1) and ends in an expensive district (house No. 1000). A simple random selection of
addresses from this street could easily end up with too many from the high end and too few from
the low end (or vice versa), leading to an unrepresentative sample. Selecting (e.g.) every 10th
street number along the street ensures that the sample is spread evenly along the length of the
street, representing all of these districts. (Note that if we always start at house #1 and end at
#991, the sample is slightly biased towards the low end; by randomly selecting the start between
#1 and #10, this bias is eliminated.
However, systematic sampling is especially vulnerable to periodicities in the list. If periodicity is
present and the period is a multiple or factor of the interval used, the sample is especially likely to
be unrepresentative of the overall population, making the scheme less accurate than simple
random sampling.
For example, consider a street where the odd-numbered houses are all on the north (expensive)
side of the road, and the even-numbered houses are all on the south (cheap) side. Under the
sampling scheme given above, it is impossible to get a representative sample; either the houses
sampled will all be from the odd-numbered, expensive side, or they will all be from the evennumbered, cheap side, unless the researcher has previous knowledge of this bias and avoids it
by a using a skip which ensures jumping between the two sides (any odd-numbered skip).
Another drawback of systematic sampling is that even in scenarios where it is more accurate
than SRS, its theoretical properties make it difficult to quantify that accuracy. (In the two
examples of systematic sampling that are given above, much of the potential sampling error is
due to variation between neighbouring houses – but because this method never selects two
neighbouring houses, the sample will not give us any information on that variation.)
As described above, systematic sampling is an EPS method, because all elements have the
same probability of selection (in the example given, one in ten). It is not 'simple random sampling'
because different subsets of the same size have different selection probabilities – e.g. the set
{4,14,24,...,994} has a one-in-ten probability of selection, but the set {4,13,24,34,...} has zero
probability of selection.
Systematic sampling can also be adapted to a non-EPS approach; for an example, see
discussion of PPS samples below.
Stratified sampling[edit]
Main article: Stratified sampling
A visual representation of selecting a random sample using the stratified sampling technique
There is a proposal that portions of this section be split into a new article
titled Stratified sampling. (Discuss) (June 2014)
Where the population embraces a number of distinct categories, the frame can be organized by
these categories into separate "strata." Each stratum is then sampled as an independent subpopulation, out of which individual elements can be randomly selected.[2] There are several
potential benefits to stratified sampling.
First, dividing the population into distinct, independent strata can enable researchers to draw
inferences about specific subgroups that may be lost in a more generalized random sample.
Second, utilizing a stratified sampling method can lead to more efficient statistical estimates
(provided that strata are selected based upon relevance to the criterion in question, instead of
availability of the samples). Even if a stratified sampling approach does not lead to increased
statistical efficiency, such a tactic will not result in less efficiency than would simple random
sampling, provided that each stratum is proportional to the group's size in the population.
Third, it is sometimes the case that data are more readily available for individual, pre-existing
strata within a population than for the overall population; in such cases, using a stratified
sampling approach may be more convenient than aggregating data across groups (though this
may potentially be at odds with the previously noted importance of utilizing criterion-relevant
strata).
Finally, since each stratum is treated as an independent population, different sampling
approaches can be applied to different strata, potentially enabling researchers to use the
approach best suited (or most cost-effective) for each identified subgroup within the population.
There are, however, some potential drawbacks to using stratified sampling. First, identifying
strata and implementing such an approach can increase the cost and complexity of sample
selection, as well as leading to increased complexity of population estimates. Second, when
examining multiple criteria, stratifying variables may be related to some, but not to others, further
complicating the design, and potentially reducing the utility of the strata. Finally, in some cases
(such as designs with a large number of strata, or those with a specified minimum sample size
per group), stratified sampling can potentially require a larger sample than would other methods
(although in most cases, the required sample size would be no larger than would be required for
simple random sampling.
A stratified sampling approach is most effective when three conditions are met
1. Variability within strata are minimized
2. Variability between strata are maximized
3. The variables upon which the population is stratified are strongly correlated with the
desired dependent variable.
Advantages over other sampling methods
1.
2.
3.
4.
Focuses on important subpopulations and ignores irrelevant ones.
Allows use of different sampling techniques for different subpopulations.
Improves the accuracy/efficiency of estimation.
Permits greater balancing of statistical power of tests of differences between strata by
sampling equal numbers from strata varying widely in size.
Disadvantages
1. Requires selection of relevant stratification variables which can be difficult.
2. Is not useful when there are no homogeneous subgroups.
3. Can be expensive to implement.
Poststratification
Stratification is sometimes introduced after the sampling phase in a process called
"poststratification".[2] This approach is typically implemented due to a lack of prior knowledge of
an appropriate stratifying variable or when the experimenter lacks the necessary information to
create a stratifying variable during the sampling phase. Although the method is susceptible to the
pitfalls of post hoc approaches, it can provide several benefits in the right situation.
Implementation usually follows a simple random sample. In addition to allowing for stratification
on an ancillary variable, poststratification can be used to implement weighting, which can
improve the precision of a sample's estimates.[2]
Oversampling
Choice-based sampling is one of the stratified sampling strategies. In choice-based
sampling,[6] the data are stratified on the target and a sample is taken from each stratum so that
the rare target class will be more represented in the sample. The model is then built on
this biased sample. The effects of the input variables on the target are often estimated with more
precision with the choice-based sample even when a smaller overall sample size is taken,
compared to a random sample. The results usually must be adjusted to correct for the
oversampling.
Probability-proportional-to-size sampling[edit]
In some cases the sample designer has access to an "auxiliary variable" or "size measure",
believed to be correlated to the variable of interest, for each element in the population. These
data can be used to improve accuracy in sample design. One option is to use the auxiliary
variable as a basis for stratification, as discussed above.
Another option is probability proportional to size ('PPS') sampling, in which the selection
probability for each element is set to be proportional to its size measure, up to a maximum of 1.
In a simple PPS design, these selection probabilities can then be used as the basis for Poisson
sampling. However, this has the drawback of variable sample size, and different portions of the
population may still be over- or under-represented due to chance variation in selections.
Systematic sampling theory can be used to create a probability proportionate to size sample.
This is done by treating each count within the size variable as a single sampling unit. Samples
are then identified by selecting at even intervals among these counts within the size variable.
This method is sometimes called PPS-sequential or monetary unit sampling in the case of audits
or forensic sampling.
Example: Suppose we have six schools with populations of 150, 180, 200, 220, 260, and 490
students respectively (total 1500 students), and we want to use student population as the basis
for a PPS sample of size three. To do this, we could allocate the first school numbers 1 to 150,
the second school 151 to 330 (= 150 + 180), the third school 331 to 530, and so on to the last
school (1011 to 1500). We then generate a random start between 1 and 500 (equal to 1500/3)
and count through the school populations by multiples of 500. If our random start was 137, we
would select the schools which have been allocated numbers 137, 637, and 1137, i.e. the first,
fourth, and sixth schools.
The PPS approach can improve accuracy for a given sample size by concentrating sample on
large elements that have the greatest impact on population estimates. PPS sampling is
commonly used for surveys of businesses, where element size varies greatly and auxiliary
information is often available—for instance, a survey attempting to measure the number of guestnights spent in hotels might use each hotel's number of rooms as an auxiliary variable. In some
cases, an older measurement of the variable of interest can be used as an auxiliary variable
when attempting to produce more current estimates.[7]
Cluster sampling[edit]
A visual representation of selecting a random sample using the cluster sampling technique
Sometimes it is more cost-effective to select respondents in groups ('clusters'). Sampling is often
clustered by geography, or by time periods. (Nearly all samples are in some sense 'clustered' in
time – although this is rarely taken into account in the analysis.) For instance, if surveying
households within a city, we might choose to select 100 city blocks and then interview every
household within the selected blocks.
Clustering can reduce travel and administrative costs. In the example above, an interviewer can
make a single trip to visit several households in one block, rather than having to drive to a
different block for each household.
It also means that one does not need a sampling frame listing all elements in the target
population. Instead, clusters can be chosen from a cluster-level frame, with an element-level
frame created only for the selected clusters. In the example above, the sample only requires a
block-level city map for initial selections, and then a household-level map of the 100 selected
blocks, rather than a household-level map of the whole city.
Cluster sampling (also known as clustered sampling) generally increases the variability of sample
estimates above that of simple random sampling, depending on how the clusters differ between
themselves, as compared with the within-cluster variation. For this reason, cluster sampling
requires a larger sample than SRS to achieve the same level of accuracy – but cost savings from
clustering might still make this a cheaper option.
Cluster sampling is commonly implemented as multistage sampling. This is a complex form of
cluster sampling in which two or more levels of units are embedded one in the other. The first
stage consists of constructing the clusters that will be used to sample from. In the second stage,
a sample of primary units is randomly selected from each cluster (rather than using all units
contained in all selected clusters). In following stages, in each of those selected clusters,
additional samples of units are selected, and so on. All ultimate units (individuals, for instance)
selected at the last step of this procedure are then surveyed. This technique, thus, is essentially
the process of taking random subsamples of preceding random samples.
Multistage sampling can substantially reduce sampling costs, where the complete population list
would need to be constructed (before other sampling methods could be applied). By eliminating
the work involved in describing clusters that are not selected, multistage sampling can reduce the
large costs associated with traditional cluster sampling.[7]However, each sample may not be a full
representative of the whole population.
Quota sampling[edit]
In quota sampling, the population is first segmented into mutually exclusive sub-groups, just as
in stratified sampling. Then judgement is used to select the subjects or units from each segment
based on a specified proportion. For example, an interviewer may be told to sample 200 females
and 300 males between the age of 45 and 60.
It is this second step which makes the technique one of non-probability sampling. In quota
sampling the selection of the sample is non-random. For example interviewers might be tempted
to interview those who look most helpful. The problem is that these samples may be biased
because not everyone gets a chance of selection. This random element is its greatest weakness
and quota versus probability has been a matter of controversy for several years.
Minimax sampling[edit]
In imbalanced datasets, where the sampling ratio does not follow the population statistics, one
can resample the dataset in a conservative manner called minimax sampling.[8]The minimax
sampling has its origin in Anderson minimax ratio whose value is proved to be 0.5: in a binary
classification, the class-sample sizes should be chosen equally.[9]This ratio can be proved to be
minimax ratio only under the assumption of LDA classifier with Gaussian distributions.[9] The
notion of minimax sampling is recently developed for a general class of classification rules, called
class-wise smart classifiers. In this case, the sampling ratio of classes is selected so that the
worst case classifier error over all the possible population statistics for class prior probabilities,
would be the
Accidental sampling[edit]
Accidental sampling (sometimes known as grab, convenience or opportunity sampling) is a
type of nonprobability sampling which involves the sample being drawn from that part of the
population which is close to hand. That is, a population is selected because it is readily available
and convenient. It may be through meeting the person or including a person in the sample when
one meets them or chosen by finding them through technological means such as the internet or
through phone. The researcher using such a sample cannot scientifically make generalizations
about the total population from this sample because it would not be representative enough. For
example, if the interviewer were to conduct such a survey at a shopping center early in the
morning on a given day, the people that he/she could interview would be limited to those given
there at that given time, which would not represent the views of other members of society in such
an area, if the survey were to be conducted at different times of day and several times per week.
This type of sampling is most useful for pilot testing. Several important considerations for
researchers using convenience samples include:
1. Are there controls within the research design or experiment which can serve to lessen
the impact of a non-random convenience sample, thereby ensuring the results will be
more representative of the population?
2. Is there good reason to believe that a particular convenience sample would or should
respond or behave differently than a random sample from the same population?
3. Is the question being asked by the research one that can adequately be answered using
a convenience sample?
In social science research, snowball sampling is a similar technique, where existing study
subjects are used to recruit more subjects into the sample. Some variants of snowball sampling,
such as respondent driven sampling, allow calculation of selection probabilities and are
probability sampling methods under certain conditions.
Line-intercept sampling[edit]
Line-intercept sampling is a method of sampling elements in a region whereby an element is
sampled if a chosen line segment, called a "transect", intersects the element.
Panel sampling[edit]
Panel sampling is the method of first selecting a group of participants through a random
sampling method and then asking that group for (potentially the same) information several times
over a period of time. Therefore, each participant is interviewed at two or more time points; each
period of data collection is called a "wave". The method was developed by sociologist Paul
Lazarsfeld in 1938 as a means of studying political campaigns.[10] This longitudinal samplingmethod allows estimates of changes in the population, for example with regard to chronic illness
to job stress to weekly food expenditures. Panel sampling can also be used to inform
researchers about within-person health changes due to age or to help explain changes in
continuous dependent variables such as spousal interaction.[11] There have been several
proposed methods of analyzing panel data, including MANOVA, growth curves, and structural
equation modeling with lagged effects.
Snowball sampling[edit]
Snowball sampling involves finding a small group of initial respondents and using them to recruit
more respondents. It is particularly useful in cases where the population is hidden or difficult to
enumerate.
Theoretical sampling[edit]
This section
requires expansion.(July 2015)
Theoretical sampling[12] occurs when samples are selected on the basis of the results of the data
collected so far with a goal of developing a deeper understanding of the area or develop theory.
Replacement of selected units[edit]
Sampling schemes may be without replacement ('WOR'—no element can be selected more than
once in the same sample) or with replacement ('WR'—an element may appear multiple times in
the one sample). For example, if we catch fish, measure them, and immediately return them to
the water before continuing with the sample, this is a WR design, because we might end up
catching and measuring the same fish more than once. However, if we do not return the fish to
the water (e.g., if we eat the fish), this becomes a WOR design.
Sample size[edit]
Main article: Sample size
Formulas, tables, and power function charts are well known approaches to determine sample
size.
Steps for using sample size tables[edit]
1. Postulate the effect size of interest, α, and β.
2. Check sample size table[13]
1. Select the table corresponding to the selected α
2. Locate the row corresponding to the desired power
3. Locate the column corresponding to the estimated effect size.
4. The intersection of the column and row is the minimum sample size required.
Sampling and data collection[edit]
Good data collection involves:




Following the defined sampling process
Keeping the data in time order
Noting comments and other contextual events
Recording non-responses
Applications of Sampling[edit]
Sampling enables the selection of right data points from within the larger data set to estimate the
characteristics of the whole population. For example, there are about 600 million tweets
produced every day. Is it necessary to look at all of them to determine the topics that are
discussed during the day? Is it necessary to look at all the tweets to determine the sentiment on
each of the topics? In manufacturing different types of sensory data such as acoustics, vibration,
pressure, current, voltage and controller data are available at short time intervals. To predict
down-time it may not be necessary to look at all the data but a sample may be sufficient.
A theoretical formulation for sampling Twitter data has been developed.[14]
Errors in sample surveys[edit]
Main article: Sampling error
Survey results are typically subject to some error. Total errors can be classified into sampling
errors and non-sampling errors. The term "error" here includes systematic biases as well as
random errors.
Sampling errors and biases[edit]
Sampling errors and biases are induced by the sample design. They include:
1. Selection bias: When the true selection probabilities differ from those assumed in
calculating the results.
2. Random sampling error: Random variation in the results due to the elements in the
sample being selected at random.
Non-sampling error[edit]
Non-sampling errors are other errors which can impact the final survey estimates, caused by
problems in data collection, processing, or sample design. They include:
1. Over-coverage: Inclusion of data from outside of the population.
2. Under-coverage: Sampling frame does not include elements in the population.
3. Measurement error: e.g. when respondents misunderstand a question, or find it difficult
to answer.
4. Processing error: Mistakes in data coding.
5. Non-response: Failure to obtain complete data from all selected individuals.
After sampling, a review should be held of the exact process followed in sampling, rather than
that intended, in order to study any effects that any divergences might have on subsequent
analysis. A particular problem is that of non-response.
Two major types of non-response exist: unit nonresponse (referring to lack of completion of any
part of the survey) and item non-response (submission or participation in survey but failing to
complete one or more components/questions of the survey).[15][16] In survey sampling, many of the
individuals identified as part of the sample may be unwilling to participate, not have the time to
participate (opportunity cost),[17] or survey administrators may not have been able to contact
them. In this case, there is a risk of differences, between respondents and nonrespondents,
leading to biased estimates of population parameters. This is often addressed by improving
survey design, offering incentives, and conducting follow-up studies which make a repeated
attempt to contact the unresponsive and to characterize their similarities and differences with the
rest of the frame.[18] The effects can also be mitigated by weighting the data when population
benchmarks are available or by imputing data based on answers to other questions.
Nonresponse is particularly a problem in internet sampling. Reasons for this problem include
improperly designed surveys,[16] over-surveying (or survey fatigue),[11][19] and the fact that potential
participants hold multiple e-mail addresses, which they don't use anymore or don't check
regularly.
Survey weights[edit]
In many situations the sample fraction may be varied by stratum and data will have to be
weighted to correctly represent the population. Thus for example, a simple random sample of
individuals in the United Kingdom might include some in remote Scottish islands who would be
inordinately expensive to sample. A cheaper method would be to use a stratified sample with
urban and rural strata. The rural sample could be under-represented in the sample, but weighted
up appropriately in the analysis to compensate.
More generally, data should usually be weighted if the sample design does not give each
individual an equal chance of being selected. For instance, when households have equal
selection probabilities but one person is interviewed from within each household, this gives
people from large households a smaller chance of being interviewed. This can be accounted for
using survey weights. Similarly, households with more than one telephone line have a greater
chance of being selected in a random digit dialing sample, and weights can adjust for this.
Weights can also serve other purposes, such as helping to correct for non-response.
Methods of producing random samples[edit]



Random number table
Mathematical algorithms for pseudo-random number generators
Physical randomization devices such as coins, playing cards or sophisticated devices such
as ERNIE
History[edit]
Random sampling by using lots is an old idea, mentioned several times in the Bible. In 1786
Pierre Simon Laplace estimated the population of France by using a sample, along with ratio
estimator. He also computed probabilistic estimates of the error. These were not expressed as
modern confidence intervals but as the sample size that would be needed to achieve a particular
upper bound on the sampling error with probability 1000/1001. His estimates used Bayes'
theorem with a uniform prior probability and assumed that his sample was random. Alexander
Ivanovich Chuprov introduced sample surveys to Imperial Russia in the 1870s.[citation needed]
In the USA the 1936 Literary Digest prediction of a Republican win in the presidential
election went badly awry, due to severe bias [1]. More than two million people responded to the
study with their names obtained through magazine subscription lists and telephone directories. It
was not appreciated that these lists were heavily biased towards Republicans and the resulting
sample, though very large, was deeply flawed.[20][21]
See also[edit]
Statistics
portal
Wikiversity has learning
materials about Sampling
(statistics)









Data collection
Gy's sampling theory
Horvitz–Thompson estimator
Official statistics
Ratio estimator
Replication (statistics)
Sampling (case studies)
Sampling error
Random-sampling mechanism
Notes[edit]
The textbook by Groves et alia provides an overview of survey methodology, including recent
literature on questionnaire development (informed by cognitive psychology) :

Robert Groves, et alia. Survey methodology (2010) Second edition of the (2004) first
edition ISBN 0-471-48348-6.
The other books focus on the statistical theory of survey sampling and require some knowledge
of basic statistics, as discussed in the following textbooks:


David S. Moore and George P. McCabe (February 2005). "Introduction to the practice of
statistics" (5th edition). W.H. Freeman & Company. ISBN 0-7167-6282-X.
Freedman, David; Pisani, Robert; Purves, Roger (2007). Statistics (4th ed.). New
York: Norton. ISBN 0-393-92972-8.
The elementary book by Scheaffer et alia uses quadratic equations from high-school algebra:

Scheaffer, Richard L., William Mendenhal and R. Lyman Ott. Elementary survey sampling,
Fifth Edition. Belmont: Duxbury Press, 1996.
More mathematical statistics is required for Lohr, for Särndal et alia, and for Cochran (classic):



Cochran, William G. (1977). Sampling techniques (Third ed.). Wiley. ISBN 0-471-16240-X.
Lohr, Sharon L. (1999). Sampling: Design and analysis. Duxbury. ISBN 0-534-35361-4.
Särndal, Carl-Erik, and Swensson, Bengt, and Wretman, Jan (1992). Model assisted survey
sampling. Springer-Verlag. ISBN 0-387-40620-4.
The historically important books by Deming and Kish remain valuable for insights for social
scientists (particularly about the U.S. census and the Institute for Social Research at
the University of Michigan):


Deming, W. Edwards (1966). Some Theory of Sampling. Dover Publications. ISBN 0-48664684-X. OCLC 166526.
Kish, Leslie (1995) Survey Sampling, Wiley, ISBN 0-471-10949-5
References[edit]
1. Jump up^ Salant, Priscilla, I. Dillman, and A. Don. How to conduct your own survey. No. 300.723 S3..
1994.
2. ^ Jump up to:a b c d Robert M. Groves; et al. Survey methodology. ISBN 0470465468.
3. Jump up^ Lohr, Sharon L. Sampling: Design and analysis.
4. Jump up^ Särndal, Carl-Erik, and Swensson, Bengt, and Wretman, Jan. Model Assisted Survey
Sampling.
5. Jump up^ Scheaffer, Richard L., William Mendenhal and R. Lyman Ott. Elementary survey
sampling.
6. Jump up^ Scott, A.J.; Wild, C.J. (1986). "Fitting logistic models under case-control or choicebased sampling". Journal of the Royal Statistical Society, Series B 48: 170–182. JSTOR 2345712.
7. ^ Jump up to:a b
 Lohr, Sharon L. Sampling: Design and Analysis.
 Särndal, Carl-Erik, and Swensson, Bengt, and Wretman, Jan. Model Assisted Survey
Sampling.
8. ^ Jump up to:a b Shahrokh Esfahani, Mohammad; Dougherty, Edward (2014). "Effect of separate
sampling on classification accuracy". Bioinformatics 30 (2): 242–
250.doi:10.1093/bioinformatics/btt662.
9. ^ Jump up to:a b c Anderson, Theodore (1951). "Classification by multivariate
analysis". Psychometrika 16 (1): 31–50. doi:10.1007/bf02313425.
10. Jump up^ Lazarsfeld, P., & Fiske, M. (1938). The" panel" as a new tool for measuring opinion. The Public
Opinion Quarterly, 2(4), 596–612.
11. ^ Jump up to:a b Groves, et alia. Survey Methodology
12. Jump up^ "Examples of sampling methods" (PDF).
13. Jump up^ Cohen, 1988
14. Jump up^ Deepan Palguna, Vikas Joshi, Venkatesan Chakaravarthy, Ravi Kothari and L. V.
Subramaniam (2015). Analysis of Sampling Algorithms for Twitter. International Joint Conference
on Artificial Intelligence.
15. Jump up^ Berinsky, A. J. (2008). Survey non-response. In W. Donsbach & M. W. Traugott (Eds.), The SAGE
handbook of public opinion research (pp. 309–321). Thousand Oaks, CA: Sage Publications.
16. ^ Jump up to:a b Dillman, D. A., Eltinge, J. L., Groves, R. M., & Little, R. J. A. (2002). Survey nonresponse in
design, data collection, and analysis. In R. M. Groves, D. A. Dillman, J. L. Eltinge, & R. J. A. Little (Eds.),
Survey nonresponse (pp. 3–26). New York: John Wiley & Sons.
17. Jump up^ Dillman, D.A., Smyth, J.D., & Christian, L. M. (2009). Internet, mail, and mixed-mode surveys:
The tailored design method. San Francisco: Jossey-Bass.
18. Jump up^ Vehovar, V., Batagelj, Z., Manfreda, K.L., & Zaletel, M. (2002). Nonresponse in web surveys. In
R. M. Groves, D. A. Dillman, J. L. Eltinge, & R. J. A. Little (Eds.), Survey nonresponse (pp. 229–242). New
York: John Wiley & Sons.
19. Jump up^ Porter, Whitcomb, Weitzer (2004) Multiple surveys of students and survey fatigue. In S. R.
Porter (Ed.), Overcoming survey research problems: Vol. 121. New directions for institutional research (pp.
63–74). San Francisco, CA: Jossey Bass.
20. Jump up^ David S. Moore and George P. McCabe. "Introduction to the Practice of Statistics".
21. Jump up^ Freedman, David; Pisani, Robert; Purves, Roger. Statistics.
Further reading[edit]


Chambers, R L, and Skinner, C J (editors) (2003), Analysis of Survey Data, Wiley, ISBN 0471-89987-9
Deming, W. Edwards (1975) On probability as a basis for action, The American Statistician,
29(4), pp146–152.









Gy, P (1992) Sampling of Heterogeneous and Dynamic Material Systems: Theories of
Heterogeneity, Sampling and Homogenizing
Korn, E.L., and Graubard, B.I. (1999) Analysis of Health Surveys, Wiley, ISBN 0-471-137731
Lucas, Samuel R. (2012). "Beyond the Existence Proof: Ontological Conditions,
Epistemological Implications, and In-Depth Interview Research.", Quality & Quantity,
doi:10.1007/s11135-012-9775-3.
Stuart, Alan (1962) Basic Ideas of Scientific Sampling, Hafner Publishing Company, New
York
Smith, T. M. F. (1984). "Present Position and Potential Developments: Some Personal
Views: Sample surveys". Journal of the Royal Statistical Society, Series A 147 (The 150th
Anniversary of the Royal Statistical Society, number 2): 208–
221. doi:10.2307/2981677. JSTOR 2981677.
Smith, T. M. F. (1993). "Populations and Selection: Limitations of Statistics (Presidential
address)". Journal of the Royal Statistical Society, Series A 156 (2): 144–
166.doi:10.2307/2982726. JSTOR 2982726. (Portrait of T. M. F. Smith on page 144)
Smith, T. M. F. (2001). "Biometrika centenary: Sample surveys". Biometrika 88 (1): 167–
243. doi:10.1093/biomet/88.1.167.
Smith, T. M. F. (2001). "Biometrika centenary: Sample surveys". In D. M. Titterington and D.
R. Cox. Biometrika: One Hundred Years. Oxford University Press. pp. 165–194.ISBN 0-19850993-6.
Whittle, P. (May 1954). "Optimum preventative sampling". Journal of the Operations
Research Society of America 2 (2): 197–203. doi:10.1287/opre.2.2.197.JSTOR 166605.
Standards[edit]
ISO[edit]


ISO 2859 series
ISO 3951 series
ASTM[edit]






ASTM E105 Standard Practice for Probability Sampling Of Materials
ASTM E122 Standard Practice for Calculating Sample Size to Estimate, With a Specified
Tolerable Error, the Average for Characteristic of a Lot or Process
ASTM E141 Standard Practice for Acceptance of Evidence Based on the Results of
Probability Sampling
ASTM E1402 Standard Terminology Relating to Sampling
ASTM E1994 Standard Practice for Use of Process Oriented AOQL and LTPD Sampling
Plans
ASTM E2234 Standard Practice for Sampling a Stream of Product by Attributes Indexed by
AQL
ANSI, ASQ[edit]

ANSI/ASQ Z1.4
U.S. federal and military standards[edit]


MIL-STD-105
MIL-STD-1916
[show]

v

t

e
Statistics
[show]

v

t

e
Social survey research
Authority control

GND: 4191095-3
NDL: 00568738
Categories:

Sampling (statistics)

Survey methodology
Navigation menu






Not logged in

Talk

Contributions

Create account

Log in
Article
Talk
Read
Edit
View history
Go









Main page
Contents
Featured content
Current events
Random article
Donate to Wikipedia
Wikipedia store
Interaction
Help
About Wikipedia














Community portal
Recent changes
Contact page
Tools
What links here
Related changes
Upload file
Special pages
Permanent link
Page information
Wikidata item
Cite this page
Print/export
Create a book
Download as PDF
Printable version
Languages

























‫ال عرب ية‬
Català


Türkçe
中文
Edit links
Dansk
Deutsch
Ελληνικά
Español
Euskara
Français
Galego
한국어
Հայերեն
Bahasa Indonesia
Italiano
‫עברית‬
Lietuvių
Magyar
日本語
Norsk bokmål
Polski
Português
Русский
Simple English
Basa Sunda
Suomi
தமிழ்

This page was last modified on 30 January 2016, at 11:10.

Text is available under the Creative Commons Attribution-ShareAlike License; additional terms may apply. By using this
site, you agree to the Terms of Use and Privacy Policy. Wikipedia® is a registered trademark of the Wikimedia
Foundation, Inc., a non-profit organization.
. THE PRINCIPAL STEPS IN A SAMPLE SURVEY
As a preliminary to a discussion of the role that theory plays in a sample survey, it is useful to
describe briefly the steps involved in the planning and execution of a survey.
The principal steps in a survey are grouped somewhat arbitrarily under 11 headings.
3.1 Objectives of the survey
The first step when assessing a sample survey is to well identify the general objectives of the survey.
Without a lucid statement of the objectives, it is easy in a complex survey to forget the objectives
when engrossed in the details of planning, and to make decisions that are at variance with the
objectives.
One of the principal choice is between average values (mean of the population) or total values. In
fact, depending on this choice, techniques for the optimal sample size and estimators factors are
different.
A number of measures exist that have been used by various agencies to measure the economic
significance of fisheries to the regional economy. In addition, a number of performance indicators also
exist that can be used to assess the performance of fisheries management in achieving its economic
objectives (see chapter 1 and related annexes).
3.2 Population to be sampled
The word population is used to denote the aggregate from which the sample is chosen. The definition
of the population may present some problems in the fishing sector, as it should consider the complete
list of vessels and their physical and technical characteristics.
The population to be sampled (the sampled population) should coincide with the population about
which information is wanted (the target population). Some-times, for reasons of practicability or
convenience, the sampled population is more restricted than the target population. If so, it should be
remembered that conclusions drawn from the sample apply to the sampled population. Judgement
about the extent to which these conclusions will also apply to the target population must depend on
other sources of information. Any supplementary information that can be gathered about the nature of
the differences between sampled and target population may be helpful.
For example, let us consider the Italian statistical sampling design for the estimation of “quantity and
average price of fishery products landed each calendar month in Italy by Community and EFTA
vessels” (Reg. CE n. 1382/91 modified by Reg. CE n. 2104/93). Aim of the survey is to estimate total
catches and average prices for individual species. Therefore, the sampling basis consists of the more
than 800 landing points spread over the 8 000 km of Italian coasts. It is not however feasible to
consider the list of the landing points as the list of elementary units. To overcome these difficulties, a
sampled population, distinct from the target population but including units in which the considered
phenomenon takes place, has been considered. In synthesis, the elementary units considered are the
landings of the vessels belonging to the sampled fleet. Thus, the list from which the sampling units
are extracted is constituted by all the vessels belonging to the Italian fishery fleet.
3.3 Data to be collected
It is well to verify that all the data are relevant to the purposes of the survey and that no essential data
are omitted There is frequently a tendency to ask too many questions, some of which are never
subsequently analysed. An overlong questionnaire lowers the quality of the answers to important as
well as unimportant questions.
3.4 Degree of precision desired
The results of sample surveys are always subject to some uncertainty because only part of the
population has been measured and because of errors of measurement. This uncertainty can be
reduced by taking larger samples and by using superior instruments of measurement. But this usually
costs time and money. Consequently, the specification of the degree of precision wanted in the
results is an important step. This step is the responsibility of the person who is going to use the data.
It may present difficulties, since many administrators are unaccustomed to thinking in terms of the
amount of error that can be tolerated in estimates, consistent with making good decisions. The
statistician can often help at this stage.
3.5 The questionnaire and the choice of the data collectors
There may be a choice of measuring instrument and of method of approach to the population. The
survey may employ a self-administered questionnaire, an interviewer who reads a standard set of
questions with no discretion, or an interviewing process that allows much latitude in the form and
ordering of the questions. The approach may be by mail, by telephone, by personal visit, or by a
combination of the three. Much study has been made of interviewing methods and problems.
A major part of the preliminary work is the construction of record forms on which the questions and
answers are to be entered. With simple questionnaires, the answers can sometimes be pre-coded,
that is, entered in a manner in which they can be routinely transferred to mechanical equipment. In
fact, for the construction of good record forms, it is necessary to visualise the structure of the final
summary tables that will be used for drawing conclusions.
Information may be collected using a number of different survey methods. These include personal
interview, telephone interview or postal survey. The questionnaire design needs to vary based on the
approach taken.
Personal interviews involves visiting the individual from which data are to be collected. The
interviewer controls the questionnaire, and fills in the required data. The questionnaire can be less
detailed in terms of explanatory information as the interviewer can be trained on its completion before
starting the interview process. This type of survey is best for long, complex surveys and it allows the
interviewer and fisher to agree a time convenient for both parties. It is particularly useful when the
respondent may have to go and find information such as accounts, log book records etc. The
personal interview approach also allows the interviewer to probe more fully if he/she feels that the
fisher has misunderstood a question, or information provided conflicts with other earlier statements.
Data collectors are usually external to the phenomenon that is being examined and, moreover, they
are often part of some public structure, in order to avoid possible influences due to personal interests.
However, on the basis of the experience acquired in this field by Irepa, it has been demonstrated
(Istat, Irepa 2000) that it is essential to have data collectors belonging to the fishery productive chain
in order to obtain correct and timely data. Therefore, data collectors should belong to the productive
or management fishery sectors.
During meetings on socio-economic indicators partners involved presented several questionnaires.
These questionnaires are aimed to collect the information required to calculate the socio-economic
indicators and some of them are reported in appendix C.
3.6 Selection of the sample design
There is a variety of plans by which the sample may be selected (simple random sample, stratified
random sample, two-stage sampling, etc.). For each plan that is considered, rough estimates of the
size of sample can be made from a knowledge of the degree of precision desired. The relative costs
and time involved for each plan are also compared before making a decision.
3.7 Sampling units
Sample units have to be drawn according to the sample design.
To draw sample units from the population, several methods can be used, depending on the type of
the chosen sample strategy:


sample with equal probabilities
sample with probabilities proportional to the size (PPS).
In the first case, each unit of the population has the same probability to take part of the sample, while
in the case of a PPS sample each unit has a different probability to be sampled and this probability is
proportional to the following measure: Pi = Xi/Xh, where, i = a generic vessel, h = stratum, X= a size
parameter, for example the overall length of a vessel.
3.8 The pre-test
It has been found useful to try out the questionnaire and the field methods on a small scale. This
nearly always results in improvements in the questionnaire and may reveal other troubles that will be
serious on a large scale, for example, that the cost will be much greater than expected.
3.9 Organization of the field work
In a survey, many problems of business administration are met. The personnel must receive training
in the purpose of the survey and in the methods of measurement to be employed and must be
adequately supervised in their work.
A procedure for early checking of the quality of the returns is invaluable.
Plans must be made for handling non-response, that is, the failure of the enumerator to obtain
information from certain of the units in the sample.
3.10 Summary and analysis of the data
The first step is to edit the completed questionnaires, in the hope of amending recording errors, or at
least of deleting data that are obviously erroneous. The check on the elementary data to eliminate
non-sampling errors can be achieved by means of computer programmes implemented to correct the
erroneous values and to permit statistical data analysis. These programmes are mainly based on
graphical analysis of elementary data.
Thereafter, the computations that lead to the estimates are performed. Different methods of
estimation may be available for the same data.
In the presentation of results it is good practice to report the amount of error to be expected in the
most important estimates One of the advantages of probability sampling is that such statements can
be made, although they have to be severely qualified if the amount of non-response is substantial
3.11 Information gained for future surveys
The more information we have initially about a population, the easier it is to devise a sample that will
give accurate estimates. Any completed sample is potentially a guide to improved future sampling, in
the data that it supplies about the means, standard deviations, and nature of the variability of the
principal measurements and about the costs involved in getting the data. Sampling practice advances
more rapidly when provisions are made to assemble and record information of this type.
Figure 1: The principal steps in a sample survey
What is Sampling? What are its Characteristics,
Advantages and Disadvantages?
Posted in Research Methodology |
Email This Post
Introduction
and
Meaning
In the Research Methodology, practical formulation of the research is very much
important and so should be done very carefully with proper concentration and in the
presence
of
a
very
good
guidance.
But during the formulation of the research on the practical grounds, one tends to go
through a large number of problems. These problems are generally related to the
knowing of the features of the universe or the population on the basis of studying the
characteristics of the specific part or some portion, generally called as the sample.
So now sampling can be defined as the method or the technique consisting of selection
for the study of the so called part or the portion or the sample, with a view to draw
conclusions or the solutions about the universe or the population.
According to Mildred Parton, “Sampling method is the process or the method of drawing
a definite number of the individuals, cases or the observations from a particular
universe, selecting part of a total group for investigation.”
Basic
Theory
Principles
of
sampling
is
of
based
on
Sampling
the
following
laws-
• Law of Statistical Regularity – This law comes from the mathematical theory of
probability. According to King,” Law of Statistical Regularity says that a moderately large
number of the items chosen at random from the large group are almost sure on the
average
to
possess
the
features
of
the
large
group.”
According to this law the units of the sample must be selected at random.
• Law of Inertia of Large Numbers – According to this law, the other things being
equal – the larger the size of the sample; the more accurate the results are likely to be.
Characteristics
of
the
1.
Much
2.
time.
Much
Very
suitable
for
technique
cheaper.
Saves
3.
4.
sampling
reliable.
carrying
out
different
surveys.
5. Scientific in nature.
Advantages
of
1.
2.
sampling
Very
Economical
accurate.
in
nature.
3.
4.
Very
High
suitability
5.
ratio
reliable.
towards
Takes
the
different
surveys.
less
time.
6. In cases, when the universe is very large, then the sampling method is the only
practical method for collecting the data.
Disadvantages
1.
of
Inadequacy
2.
of
the
Chances
3.
4.
sampling
for
Problems
Difficulty
of
5.
bias.
of
getting
the
accuracy.
representative
Untrained
6.
samples.
Absence
of
sample.
manpower.
the
informants.
7. Chances of committing the errors in sampling.
This article has been written by KJ Singh a MBA Graduate from a prestigious Business
School In India
More Entries :







What is Sampling Frame? Also describe errors
What are the methods of Sampling and Probability Sampling?
Explain the types of Survey’s?
What are the steps in research design?
Write about Objectives, Advantages and Disadvantages of Interview
Method of Data Collection
What are the steps involved in carrying out an experiment?
Features, Advantages and Disadvantages of Observation
Comments

bob September 16, 2014 at 12:50 am
Precise and really helpful n of all it’s not on the PDF format
ReplyLinkQuote

Priyadarshini Ghosh October 10, 2014 at 10:33 pm
A detailed explanation would have been better.
ReplyLinkQuote

chandru mfm pg center hemagangotri hasan October 30, 2014 at 4:12 am
its very useful to finance studied student, thankful to you sir. From mfm students hemgangotri hasan
ReplyLinkQuote

Ekoh Louisa Adanma December 9, 2014 at 9:13 pm
Very easy to understand n helpful too. Thanks
ReplyLinkQuote

huzaif February 12, 2015 at 12:27 am
thank you very much
ReplyLinkQuote

flavia chumbu March 26, 2015 at 9:22 am
thax for the information l understood better
ReplyLinkQuote

NEBERTY mr mayor CHIGODORA March 26, 2015 at 9:29 am
Go ahead with the good work,you are spoon feeding us.You are such a wooooow
ReplyLinkQuote

asinyen anjeline March 28, 2015 at 3:43 am
thank you very much
ReplyLinkQuote

Bayonle Stanley Akingbule June 9, 2015 at 1:06 am
Well detailed , thanks and GOD BLESS
ReplyLinkQuote

ajak daau ajak June 27, 2015 at 12:59 pm
it’s so helpful but more elaboration is needed for better understanding.
ReplyLinkQuote

Subham roy July 9, 2015 at 1:36 am
well its helpful and useful…thnkz
ReplyLinkQuote

karuma amos August 9, 2015 at 3:46 am
Some how helpful but much content left out…thanks for your trial
ReplyLinkQuote

Mukiibi Solomon September 9, 2015 at 8:04 am
Its NYC BT more explainations
ReplyLinkQuote

najjuka cissy December 11, 2015 at 4:55 am
Thx fo dat. Tho more explanations on the points because dey are listed out
ReplyLinkQuote

Post a comment
Name
Email
Website
Post your comment
«12
Share Information

MBA Tutor


o







o

o



o
o
o
o
o
o
GD Topics
MBA Colleges
MBA Colleges India
Assam
Delhi
Haryana
Jammu and Kashmir
Jharkhand
Punjab
MBA Courses
Business Environment
Communication
Human Resource Management
Human Resource Planning and Development
Management of Conflict
Performance Planning and Potential Appraisal
Information Technology
Management Information System
Motivation
Operations Management
Principles of Management
Research Methodology
Get Guidance To Study Abroad
Recently Added










Should we promote co-curricular activities in schools?
Is Coalition Government better than Single Party Government?
Should Voting be made Compulsory?
Should the children have a responsibility towards aged parents?
Is Democracy the best form of Government?
Should Direct Taxes be Abolished?
Should we ban child artist?
Should Social Media be Moderated?
Industrialization vs. Environment – What is of utmost need?
Do India really need more AIIMS and IIT’s?
About Us | Contact Us | Disclaimer | Privacy Policy
All Rights
Types of Sampling
In applications:
Probability Sampling: Simple Random Sampling, Stratified Random Sampling,
Multi-Stage Sampling



What is each and how is it done?
How do we decide which to use?
How do we analyze the results differently depending on the type of
sampling?
Non-probability Sampling: Why don't we use non-probability sampling schemes?
Two reasons:


We can't use the mathematics of probability to analyze the results.
In general, we can't count on a non-probability sampling scheme to produce
representative samples.
In mathematical statistics books (for courses that assume you have already taken
a probability course):


Described as assumptions about random variables
Sampling with replacement versus sampling without replacement
What are the main types of sampling and how is each done?
Simple Random Sampling: A simple random sample (SRS) of size n is produced
by a scheme which ensures that each subgroup of the population of size n has an
equal probability of being chosen as the sample.
Stratified Random Sampling: Divide the population into "strata". There can be
any number of these. Then choose a simple random sample from each stratum.
Combine those into the overall sample. That is a stratified random sample.
(Example: Church A has 600 women and 400 women as members. One way to get
a stratified random sample of size 30 is to take a SRS of 18 women from the 600
women and another SRS of 12 men from the 400 men.)
Multi-Stage Sampling: Sometimes the population is too large and scattered for it
to be practical to make a list of the entire population from which to draw a SRS.
For instance, when the a polling organization samples US voters, they do not do a
SRS. Since voter lists are compiled by counties, they might first do a sample of the
counties and then sample within the selected counties. This illustrates two stages.
In some instances, they might use even more stages. At each stage, they might do a
stratified random sample on sex, race, income level, or any other useful variable on
which they could get information before sampling.
How does one decide which type of sampling to use?
The formulas in almost all statistics books assume simple random sampling.
Unless you are willing to learn the more complex techniques to analyze the data
after it is collected, it is appropriate to use simple random sampling. To learn the
appropriate formulas for the more complex sampling schemes, look for a book or
course on sampling.
Stratified random sampling gives more precise information than simple random
sampling for a given sample size. So, if information on all members of the
population is available that divides them into strata that seem relevant, stratified
sampling will usually be used.
If the population is large and enough resources are available, usually one will use
multi-stage sampling. In such situations, usually stratified sampling will be done at
some stages.
How do we analyze the results differently depending on the different type of
sampling?
The main difference is in the computation of the estimates of the variance (or
standard deviation). An excellent book for self-study is A Sampler on Sampling, by
Williams, Wiley. In this, you see a rather small population and then a complete
derivation and description of the sampling distribution of the sample mean for a
particular small sample size. I believe that is accessible for any student who has
had an upper-division mathematical statistics course and for some strong students
who have had a freshman introductory statistics course. A very simple statement of
the conclusion is that the variance of the estimator is smaller if it came from a
stratified random sample than from simple random sample of the same size. Since
small variance means more precise information from the sample, we see that this is
consistent with stratified random sampling giving better estimators for a given
sample size.
Return to the top.
Non-probability sampling schemes
These include voluntary response sampling, judgement sampling, convenience
sampling, and maybe others.
In the early part of the 20th century, many important samples were done that
weren't based on probability sampling schemes. They led to some memorable
mistakes. Look in an introductory statistics text at the discussion of sampling for
some interesting examples. The introductory statistics books I usually teach from
are Basic Practice of Statistics by David Moore, Freeman, and Introduction to the
Practice of Statistics by Moore and McCabe, also from Freeman. A particularly
good book for a discussion of the problems of non-probability sampling
is Statistics by Freedman, Pisani, and Purves. The detail is fascinating. Or, ask a
statistics teacher to lunch and have them tell you the stories they tell in class. Most
of us like to talk about these! Someday when I have time, maybe I'll write some of
them here.
Mathematically, the important thing to recognize is that the discipline of statistics
is based on the mathematics of probability. That's about random variables. All of
our formulas in statistics are based on probabilities in sampling distributions of
estimators. To create a sampling distribution of an estimator for a sample size of
30, we must be able to consider all possible samples of size 30 and base our
analysis on how likely each individual result is.
Return to the top.
In mathematical statistics books (for courses that assume you have already taken
a probability course) the part of the problem relating to the sampling is described
as assumptions about random variables.
Mathematical statistics texts almost always says to consider the X's (or Y's) to be
independent with a common distribution. How does this correspond to some
description of how to sample from a population?Answer: simple random
sampling with replacement.
Return to the top.
Mary Parker.
Different Types of Sample
There are 5 different types of sample you should be able to define. You should
also understand when to use them, and what their advantages and disadvantages
are.
Simple Random Sample
Obtaining a genuine random sample is difficult. We usually use Random Number Tables, and use the following
procedure;
1.
2.
3.
4.
5.
6.
Number the population from 0 to n
Pick a random place I the number table
Work in a random direction
Organise numbers into the required number of digits (e.g. if the size of the population is 80, use 2
digits)
Reject any numbers not applicable (in our example, numbers between 80 and 99)
Continue until the required number of samples has been collected
7.
[ If the sample is "without replacement", discard any repetitions of any number]
Advantages:
The sample will be free from Bias (i.e. it's random!)
Disadvantages: Difficult to obtain
Due to its very randomness, "freak" results can sometimes be
obtained that are not representative of the population. In addition,
these freak results may be difficult to spot. Increasing the sample size
is the best way to eradicate this problem.
Systematic Sample
With this method, items are chosen from the population according to a fixed rule, e.g. every 10 th house along a
street. This method should yield a more representative sample than the random sample (especially if the sample
size is small). It seeks to eliminate sources of bias, e.g. an inspector checking sweets on a conveyor belt might
unconsciously favour red sweets. However, a systematic method can also introduce bias, e.g. the period chosen
might coincide with the period of faulty machine, thus yielding an unrepresentative number of faulty sweets.
Advantages:
Can eliminate other sources of bias
Disadvantages:
Can introduce bias where the pattern used for the samples coincides
with a pattern in the population.
Stratified Sampling
The population is broken down into categories, and a random sample is taken of each category. The proportions
of the sample sizes are the same as the proportion of each category to the whole.
Advantages:
Yields more accurate results than simple random sampling
Can show different tendencies within each category (e.g. men and
women)
Disadvantages: Nothing major, hence it's used a lot
Quota Sampling
As with stratified samples, the population is broken down into different categories. However, the size of the
sample of each category does not reflect the population as a whole. This can be used where an unrepresentative
sample is desirable (e.g. you might want to interview more children than adults for a survey on computer
games), or where it would be too difficult to undertake a stratified sample.
Advantages:
Simpler to undertake than a stratified sample
Sometimes a deliberately biased sample is desirable
Disadvantages: Not a genuine random sample
Likely to yield a biased result
Cluster Sampling
Used when populations can be broken down into many different categories, or clusters (e.g. church parishes).
Rather than taking a sample from each cluster, a random selection of clusters is chosen to represent the whole.
Within each cluster, a random sample is taken.
Advantages:
Less expensive and time consuming than a fully random sample
Can show "regional" variations
Disadvantages: Not a genuine random sample
Likely to yield a biased result (especially if only a few clusters are
sampled)
Types of samples
The best sampling is probability sampling, because it increases the likelihood of
obtaining samples that are representative of the population.
Probability sampling (Representative samples)
Probability samples are selected in such a way as to be representative of the
population. They provide the most valid or credible results because they reflect the
characteristics of the population from which they are selected (e.g., residents of a
particular community, students at an elementary school, etc.). There are two types of
probability samples: random and stratified.
Random sample
The term random has a very precise meaning. Each individual in
the population of interest has an equal likelihood of
selection. This is a very strict meaning -- you can't just collect
responses on the street and have a random sample.
The assumption of an equal chance of selection means that sources such as a telephone
book or voter registration lists are not adequate for providing a random sample of a
community. In both these cases there will be a number of residents whose names are
not listed. Telephone surveys get around this problem by random-digit dialing -- but that
assumes that everyone in the population has a telephone. The key to random selection is
that there is no bias involved in the selection of the sample. Any variation between the
sample characteristics and the population characteristics is only a matter of chance.
Stratified sample
A stratified sample is a mini-reproduction of the population. Before
sampling, the population is divided into characteristics of
importance for the research. For example, by gender, social class,
education level, religion, etc. Then the population is randomly
sampled within each category or stratum. If 38% of the population
is college-educated, then 38% of the sample is randomly selected
from the college-educated population.
Stratified samples are as good as or better than random samples, but they require a
fairly detailed advance knowledge of the population characteristics, and therefore are
more difficult to construct.
How to Construct a probability (representative) sample
Nonprobability samples (Non-representative samples)
As they are not truly representative, non-probability samples are less desirable than
probability samples. However, a researcher may not be able to obtain a random or
stratified sample, or it may be too expensive. A researcher may not care about
generalizing to a larger population. The validity of non-probability samples can be
increased by trying to approximate random selection, and by eliminating as many
sources of bias as possible.
Quota sample
The defining characteristic of a quota sample is that the
researcher deliberately sets the proportions of levels or strata
within the sample. This is generally done to insure the inclusion of
a particular segment of the population. The proportions may or
may not differ dramatically from the actual proportion in the
Two of each species
population. The researcher sets a quota, independent of
population characteristics.
Example: A researcher is interested in the attitudes of members of different religions
towards the death penalty. In Iowa a random sample might miss Muslims (because there
are not many in that state). To be sure of their inclusion, a researcher could set a quota
of 3% Muslim for the sample. However, the sample will no longer be representative of
the actual proportions in the population. This may limit generalizing to the state
population. But the quota will guarantee that the views of Muslims are represented in the
survey.
Purposive sample
A purposive sample is a non-representative subset of some larger
population, and is constructed to serve a very specific need or
purpose. A researcher may have a specific group in mind, such as
high level business executives. It may not be possible to specify the
population -- they would not all be known, and access will be
difficult. The researcher will attempt to zero in on the target group,
interviewing whomever is available.
A subset of a purposive sample is a snowball sample -- so named because one picks up the
sample along the way, analogous to a snowball accumulating snow. A snowball sample is achieved
by asking a participant to suggest someone else who might be willing or appropriate for the study.
Snowball samples are particularly useful in hard-to-track populations, such as truants, drug users,
etc.
Convenience sample
A convenience sample is a matter of taking what you can get. It is
an accidental sample. Although selection may be unguided, it probably is
not random, using the correct definition of everyone in the population
having an equal chance of being selected. Volunteers would constitute a
convenience sample.
Non-probability samples are limited with regard to generalization. Because they do not
truly represent a population, we cannot make valid inferences about the larger group
from which they are drawn. Validity can be increased by approximating random selection
as much as possible, and making every attempt to avoid introducing bias into sample
selection.
Examples of nonprobability samples
Self-test #1: Sample types
Self-test #2: Using the random numbers table
Continue on to Sample size
Sampling error and nonsampling error
Posted on 4 September, 2014 by Dr Nic
The subject of statistics is rife with misleading terms. I have written about this before in such
posts as Teaching Statistical Language and It is so random. But the terms sampling error and nonsampling error win the Dr Nic prize for counter-intuitivity and confusion generation.
Confusion abounds
To start with, the word error implies that a mistake has been made, so the term sampling error
makes it sound as if we made a mistake while sampling. Well this is wrong. And the term nonsampling error (why is this even a term?) sounds as if it is the error we make from not sampling.
And that is wrong too. However these terms are used extensively in the NZ statistics curriculum,
so it is important that we clarify what they are about.
Fortunately the Glossary has some excellent explanations:
Sampling Error
“Sampling error is the error that arises in a data collection process as a result of taking a sample
from a population rather than using the whole population.
Sampling error is one of two reasons for the difference between an estimate of a population
parameter and the true, but unknown, value of the population parameter. The other reason is
non-sampling error. Even if a sampling process has no non-sampling errors then estimates from
different random samples (of the same size) will vary from sample to sample, and each estimate is
likely to be different from the true value of the population parameter.
The sampling error for a given sample is unknown but when the sampling is random, for some
estimates (for example, sample mean, sample proportion) theoretical methods may be used to
measure the extent of the variation caused by sampling error.”
Non-sampling error:
“Non-sampling error is the error that arises in a data collection process as a result of factors
other than taking a sample.
Non-sampling errors have the potential to cause bias in polls, surveys or samples.
There are many different types of non-sampling errors and the names used to describe them are
not consistent. Examples of non-sampling errors are generally more useful than using names to
describe them.
And it proceeds to give some helpful examples.
These are great definitions, and I thought about turning them into a diagram, so here it is:
Table summarising types of error.
And there are now two videos to go with the diagram, to help explain sampling error and nonsampling error. Here is a link to the first:
Video about sampling error
One of my earliest posts, Sampling Error Isn’t, introduced the idea of using variation due to
sampling and other variation as a way to make sense of these ideas. The sampling video above is
based on this approach.
Students need lots of practice identifying potential sources of error in their own work, and in
critiquing reports. In addition I have found True/False questions surprisingly effective in
practising the correct use of the terms. Whatever engages the students for a time in consciously
deciding which term to use, is helpful in getting them to understand and be aware of the concept.
Then the odd terminology will cease to have its original confusing connotations.
About these ads
Non-sampling error
From Wikipedia, the free encyclopedia
In statistics, non-sampling error is a catch-all term for the deviations of estimates from their true
values that are not a function of the sample chosen, including varioussystematic
errors and random errors that are not due to sampling.[1] Non-sampling errors are much harder to
quantify than sampling errors.[2]
Non-sampling errors in survey estimates can arise from:[3]

Coverage errors, such as failure to accurately represent all population units in the sample, or
the inability to obtain information about all sample cases;

Response errors by respondents due for example to definitional differences,
misunderstandings, or deliberate misreporting;

Mistakes in recording the data or coding it to standard classifications;

Other errors of collection, nonresponse, processing, or imputation of values for missing or
inconsistent data.[3]
An excellent discussion of issues pertaining to non-sampling error can be found in several
sources such as Kalton (1983)[4] and Salant and Dillman (1995),[5]
See also[edit]
Sampling and non-sampling errors
Beyond the conceptual differences, many kinds of error can help explain differences
in the output of the programs that generate data on income. They are often classified
into two broad types: sampling errors and non-sampling errors.
Sampling errors occur because inferences about the entire population are based on
information obtained from only a sample of that population. Because SLID and the
long-form Census are sample surveys, their estimates are subject to this type of
error. The coefficient of variationis a measure of the extent to which the estimate
could vary, if a different sample had been used. This measure gives an indication of
the confidence that can be placed in a particular estimate. This data quality measure
will be used later in this paper to help explain why some of SLID's estimates, which
are based on a smaller sample, might differ from those of the other programs
generating income data. While the Census is also subject to this type of error,
reliable estimates can be made for much smaller populations because the sampling
rate is much higher for the Census (20%)1.
Non-sampling errors can be further divided into coverage errors, measurement
errors (respondent, interviewer, questionnaire, collection method…), non-response
errors and processing errors. The coverage errors are generally not well measured
for income and are usually inferred from exercises of data confrontation such as
this. Section 3 will review the population exclusions and other known coverage
differences between the sources.
The issues of various collection methods or mixed response modes and the different
types of measurement errors that could arise will be approached in section 4.
Non-response can be an issue in the case of surveys. It is not always possible to
contact and convince household members to respond to a survey. Sometimes as
well, even if the household responded, there may not be valid responses to all
questions. In both cases adjustments are performed to the data but error may result
as the quality of the adjustments often depends on then on-respondents being
similar to the respondents. For the 2005 income year,SLID had a response rate of
73.3% and for the Census, it was close to 97%. Still for 2005, because of item nonresponse, all income components were imputed for 2.7% of SLID's respondents and
at least some components were imputed for another 23.5%2. In the case of the
Census, income was totally imputed for 9.3% and partially imputed for 29.3%.
In administrative data – in particular the personal tax returns – the filing rates for
specific populations may depend on a variety of factors (amount owed, financial
activity during the year, personal interest, requirement for eligibility to support
programs, etc.) and this could also result in differences in the estimates generated
by the programs producing income data.
The systems and procedures used to process the data in each of the programs are
different and may have design variations that impact the data in special ways. When
such discrepancies have been identified, they will be mentioned in section 5. Beyond
the design variations, most processing errors in these data sources are thought to be
detected and corrected before the release of data to the public. However due to the
complexity and to the yearly modifications of processing systems, some errors may
remain undetected and they are therefore quite difficult to quantify.
More detail on the quality and methods of individual statistical programs is
accessible through the Surveys and statistical programs by subject section on
Statistics Canada's website.
Notes
1. The sampling error from one-year estimates of individual income based on
the LAD would also be of a similar magnitude as its sampling rate is also one in five.
2. Data Quality in the 2005 Survey of Labour and Income Dynamics , C. Duddek,
Income Research Paper Series, Statistics Canada catalogue no. 75F0002-No.003,
May 2007.
6 Sampling and Non-sampling Errors
The statistical quality or reliability of a survey may obviously be influenced by the errors that for
various reasons affect the observations. Error components are commonly divided into two major
categories: Sampling and non-sampling errors. In sampling literature the terms "variable errors" and
"bias" are also frequently used, though having a precise meaning which is slightly different from the
former concepts. The total error of a survey statistic is labeled the mean square error, being the sum
of variable errors and all biases. In this section we will give a fairly general and brief description of
the most common error components related to household sample surveys, and discuss their
presence in and impacts on this particular survey. Secondly, we will go into more detail as to those
components which can be assessed numerically.
Error Components and their Presence in the Survey
(1) Sampling errors are related to the sample design itself and the estimators used, and may
be seen as a consequence of surveying only a random sample of, and not the complete,
population. Within the family of probability sample designs - that is designs enabling the
establishment of inclusion probabilities (random samples) - sampling errors can be estimated.
The most common measure for the sampling error is the variance of an estimate, or
derivatives thereof. The derivative mostly used is the standard error, which is simply the
square root of the variance.
The variance or the standard error does not tell us exactly how great the error is in each
particular case. It should rather be interpreted as a measure of uncertainty, i.e. how much the
estimate is likely to vary if repeatedly selected samples (with the same design and of the same
size) had been surveyed. The variance is discussed in more detail in section 6.2.
(2) Non-sampling errors is a "basket" comprising all errors which are not sampling errors.
These type of errors may induce systematic bias in the estimates, as opposed
to random errors caused by sampling errors. The category may be further divided into
subgroups according to the various origins of the error components:





Imperfections in the sampling frame, i.e. when the population frame from which the sample
is selected does not comprise the complete population under study, or include foreign
elements. Exclusion of certain groups of the population from the sampling frame is one
example. As described in the Gaza section, it was decided to exclude "outside localities"
from being surveyed for cost reasons. It was maintained that the exclusion would have
negligible effects on survey results.
Errors imposed by implementary deviations from the theoretical sample design and field
work procedures. Examples: non-response, "wrong" households selected or visited, "wrong"
persons interviewed, etc. Except for non-response, which will be further discussed
subsequently, there were some cases in the present survey in which the standard
instructions for "enumeration walks" had to be modified in order to make sampling feasible.
Any departure from the standard rules has been particularly considered within the context
of inclusion probabilities. None of the practical solutions adopted imply substantial
alterations of the theoretical probabilities described in the previous sections.
The field work procedures themselves may imply unforeseen systematic biases in the
sample selection. In the present survey one procedure has been given particular
consideration as a potential source of error: the practical modification of choosing road
crossing corners - instead of any randomly selected spot - as starting points for the
enumeration walks. This choice might impose systematic biases as to the kind of households
being sampled. However, numerous inspection trials in the field proved it highly unlikely that
such bias would occur. According to the field work instructions, the starting points
themselves were never to be included in the sample. Such inclusion would have implied a
systematic over-representation of road corner households, and thus may have caused biases
for certain variables. (Instead, road corner households may now be slightly underrepresented in so far as they as starting points are excluded from the sample. Possible bias
induced by this under-representation is, however, negligible compared to the potential bias
accompanying the former alternative.)
Improper wording of questions, misquotations by the interviewer, misinterpretations and
other factors that may cause failure in obtaining the intended response. "Fake response"
(questions being answered by the interviewer himself/herself) may also be included in this
group of possible errors. Irregularities of this kind are generally difficult to detect. The best
ways of preventing them is to have well trained data collectors, to apply various verification
measures, and to introduce the internal control mechanisms by letting data collectors work
in pairs - possibly supplemented by the presence of the supervisor. A substantial part of the
training of supervisors and data collectors was devoted to such measures. Verification
interviews were carried out by the supervisors among a 10% randomly selected subsample.
No fake interviews were detected. However, a few additional re-interviews were carried out,
on suspicion of misunderstandings and incorrect responses.
Data processing errors include errors arising incidentally during the stages of response
recording, data entry and programming. In this survey the data entry programme used
included consistency controls wherever possible, aiming at correcting any logical
contradictions in the data. Furthermore, verification punch work was applied in order to
correct mis-entries not detected by the consistency control, implying that each and all
questionnaires have been punched twice.
Sampling Error - Variance of an Estimate
Generally, the prime objective of sample designing is to keep sampling error at the lowest level
possible (within a given budget). There is thus a unique theoretical correspondence between the
sampling strategy and the sampling error, which can be expressed mathematically by the variance of
the estimator applied. Unfortunately, design complexity very soon implies variance expressions to
be mathematically uncomfortable and sometimes practically "impossible" to handle. Therefore,
approximations are frequently applied in order to achieve interpretable expressions of the
theoretical variance itself, and even more to estimate it.
In real life practical shortcomings frequently challenge mathematical comfort. Absence of
sampling frames or other prior information forces one to use mathematically complex
strategies in order to find feasible solutions. The design of the present survey - stratified, 4-5
stage sampling with varying inclusion probabilities - is probably among the extremes in this
respect, implying that the variance of the estimator (5.2) will be of the utmost complexity - as
will be seen subsequently.
The (approximate) variance of the estimator (5.2) is in its simplest form:
The variances and covariances on the right hand side of (6.1) may be expressed in terms of
the stratum variances and covariances:
Proceeding one step further the stratum variance may be expressed as follows9:
where we have introduced the notation ps (k) = P1 (s, k). The ps (k, l) is the joint probability
of inclusion for PSU (s,k) and PSU (s,l), and
the variance of the PSU (s,k) unbiased
estimate
. The variance of
is obtained similarly by substituting x with N in the
above formula. The stratum covariance formula is somewhat more complicated and is not
expressed here.
The PSU (s,k) variance components in the latter formula have a structure similar to the
stratum one, as is realized by regarding the PSUs as separate "strata" and the cells as "PSUs".
Again, another variance component emerges for each of the cells, the structure of which is
similar to the preceding one. In order to arrive at the "ultimate" variance expression yet
another two or three similar stages have to be passed. It should be realized that the final
variance formula is extremely complicated, even if simplifying modifications and
approximations may reduce the complexities stemming from the 2nd - 5th sampling stages.
It should also be understood that attempts to estimate this variance properly and exhaustively
(unbiased or close to unbiased) would be beyond any realistic effort. Furthermore, for such
estimation to be accomplished certain preconditions have to be met. Some of these conditions
cannot, however, be satisfied (for instance: at least two PSUs have to be selected from each
stratum comprising more than one PSU). We thus have to apply a more simple method for
appraising the uncertainty of our estimates.
Any sampling strategy (sample selection approach and estimator) may be characterized by its
performance relative to a simple random sampling (SRS) design, applying the sample
average as the estimator for proportions. The design factor of a strategy is thus defined as the
fraction between the variances of the two estimators. If the design factor is, for instance, less
than 1, the strategy under consideration would be better than SRS. Usually, multi-stage
strategies are inferior to SRS, implying the design factor being greater than 1.
The design factor is usually determined empirically. Although there is no overwhelming
evidence in its favour, a factor of 1.5 is frequently used for stratified, multi-stage designs.
(The design factor may vary among survey variables). The rough approximate variance
estimator is thus:
where p is the estimate produced by (5.2) and nT is the number of observations underlying
the estimate (the "100%"). Although this formula oversimplifies the variance, it still takes
care of some of the basic features of the real variance; the variance decreases by increasing
sample size (n), and tends to be larger for proportions around 50% than at the tails (0% or
100%).
The square root of the variance, i.e.
or briefly s, is called the standard error, and is
tabulated in table A.12 for various values of p and n.
Table A.12 Standard error estimates for proportions (s and p are specified as percentages).
Number of obs.
(n)
Estimated proportion (p %)
5/95 10/90 20/80 30/70 40/60
50
10
8.4
11.6
15.5
17.7
19.0 19.4
20
6.0
8.2
11.0
12.5
13.4 13.7
50
3.8
5.2
6.9
7.9
8.5
8.7
75
3.1
4.2
5.7
6.5
6.9
7.1
100
2.7
3.7
4.9
5.6
6.0
6.1
150
2.2
3.0
4.0
4.6
4.9
5.0
200
1.9
2.6
3.5
4.0
4.2
4.3
250
1.7
2.3
3.1
3.5
3.8
3.9
300
1.5
2.1
2.8
3.2
3.5
3.5
350
1.4
2.0
2.6
3.0
3.2
3.3
400
1.3
1.8
2.5
2.8
3.0
3.1
500
1.2
1.6
2.2
2.5
2.7
2.7
700
1.0
1.4
1.9
2.1
2.3
2.3
1000
0.8
1.2
1.5
1.8
1.9
1.9
1500
0.7
0.9
1.3
1.4
1.5
1.6
2000
0.6
0.8
1.1
1.3
1.3
1.4
2500
0.5
0.7
1.0
1.2
1.2
1.2
Confidence Intervals
The sample which has been surveyed is one specific outcome of an "infinite" number of
random selections which might have been done within the sample design. Other sample
selections would most certainly have yielded survey results slightly different from the present
ones. The survey estimates should thus not be interpreted as accurately as the figures
themselves indicate.
A confidence interval is a formal measure for assessing the variability of survey estimates
from such hypothetically repeated sample selections. The confidence interval is usually
derived from the survey estimate itself and its standard error:
Confidence interval: [p - c s, p + c s] where the c is a constant which must be determined by
the choice of a confidence coefficient, fixing the probability of the interval including the true,
but unknown, population proportion for which p is an estimate. For instance, c=1 corresponds
to a confidence probability of 67%, i.e. one will expect that 67 out of 100 intervals will
include the true proportion if repeated surveys are carried out. In most situations, however, a
chance of one out of three to arrive at a wrong conclusion is not considered satisfactory.
Usually, confidence coefficients of 90% or 95% are preferred, 95% corresponding to
approximately c=2. Although our assessment as to the location of the true population
proportion thus becomes less uncertain, the assessment itself, however, becomes less precise
as the length of the interval increases.
Comparisons between groups
Comparing the occurrence of an attribute between different sub-groups of the population is
probably the most frequently used method for making inference from survey data. For
illustration of the problems involved in such comparisons, let us consider two separate subgroups for which the estimated proportions sharing the attribute are
, respectively,
while the unknown true proportions are denoted p1 and p2. The corresponding standard error
estimates are s1 and s2. The problem of inference is thus to evaluate the significance of the
difference between the two sub-group estimates: Can the observed difference be caused by
sampling error alone, or is it so great that there must be more substantive reasons for it?
We will assume that the estimate
is the larger of the two proportions observed. Our
problem of judgement will thus be equivalent to testing the following hypothesis:
Hypothesis:
p1 = p2
Alternative:
p1 > p2
In case the test rejects the hypothesis we will accept the alternative as a "significant" statement, and
thus conclude that the observed difference between the two estimates is too great to be caused by
randomness alone. However, as is the true nature of statistical inference, one can (almost) never
draw absolutely certain conclusions. The uncertainty of the test is indicated by the choice of a
"significance level", which is the probability of making a wrong decision by rejecting
a true hypothesis. This probability should obviously be as small as possible. Usually it is set at 2.5% or
5% - depending on the risk or loss involved in drawing wrong conclusions.
The test implies that the hypothesis is rejected if
where the constant c depends on the choice of significance level:
Significance level
-----------------2.5%
5.0%
10.0%
c-value
------2.0
1.6
1.3
As is seen, the test criteria comprise the two standard error estimates and thus imply some
calculation. It is also seen that smaller significance levels imply the requirement of larger observed
differences between sub-groups in order to arrive at significant conclusions. One should be aware
that the non-rejection of a hypothesis leaves one with no conclusions at all, rather than the
acceptance of the hypothesis itself.
Non-response
Non-response occurs when one fails to obtain an interview with a properly pre-selected
individual (unit non-response). The most frequent reasons for this kind of non-response are
refusals and absence ("not-at-homes"). Item non-response occurs when a single question is
left unanswered.
Non-response is generally the most important single source of bias in surveys. Most exposed
to non-response bias are variables related to the very phenomenon of being a (frequent) "notat-homer" or not (example: cinema attendance). In Western societies non-response rates of
15-30% are normal.
Various measures have been undertaken to keep non-response at the lowest level possible.
Most of all confidence-building has been of concern, implying contacts with local community
representatives have been made in order to enlist their support and approval. Furthermore,
many hours have been spent explaining the scope of the survey to respondents and anyone
else wanting to know, assuring that the survey makers neither would impose taxes on people
nor demolish their homes, or - equally important for the reliability of the survey - bring direct
material aid.
Furthermore, up to 4 call-backs were applied if selected respondents were not at home.
Usually the data collectors were able to get an appointment for a subsequent visit at the first
attempt, so that only one revisit was required in most cases. Unit non-response thus
comprises refusals and those not being at home at all after four attempts.
Table A.13 shows the net number of respondents and non-responses in each of the three parts
of the survey. The initial sizes of the various samples are deduced from the table by adding
responses and non-responses. For the household and RSI samples, the total size was 2,518
units, while the female sample size was 1,247. It is seen from the bottom line that the nonresponse rates are outstandingly small compared to the "normal" magnitudes of 10 - 20% in
similar surveys. Consequently, there should be fairly good evidence for maintaining that the
effects of non-response in this survey are insignificant.
Table A.13 Number of (net) respondents and non-respondents in the tree parts of the survey
Households
Region
970
8
968
10
482
4
1,023
16
1,004
35
502
14
486
15
478
23
240
5
2,479
39
2,450
68
1,224
23
Arab Jerusalem
Total
Women
Resp. Non-resp. Resp. Non-resp. Resp. Non-resp.
Gaza
West Bank
RSIs
Non-response rate
1.5%
al@mashriq
2.7%
960428/960710


Mobile
Survey Participant Information

About Us

Careers

Help
1.8%

Contact Us
Australian Bureau of
Statistics


Home
Complete Survey
 Statistics
 Services
 Census


Topics @ a Glance
Methods & Classifications
 News & Media

Education

Links

Help
search
ABS Home
Statistical Language - Census and Sample

Menu






Understanding Statistics
Draft Statistical Capability Framework
Statistical Language
ABS Presents...Videos
Statistical Skills for Official Statisticians
A Guide for Using Statistics for Evidence
Based Policy
Statistics - A Powerful Edge!
ABS Sports Stats



ABS Training
Census and Sample
Recommended: First read What is a Population?
This animation explains the concept of census and sample. If you are unable to access the video
a Transcript (.doc 27kb) has been provided. The animation requiresAdobe Flash Player to run.
There is no audio in this animation.
How do we study a population?
A population may be studied using one of two approaches: taking a census, or selecting a sample.
It is important to note that whether a census or a sample is used, both provide information that can be used to draw conclusions about the whole p
What is a census (complete enumeration)?
A census is a study of every unit, everyone or everything, in a population. It is known as a complete enumeration, which means a complete count
What is a sample (partial enumeration)?
A sample is a subset of units in a population, selected to represent all units in a population of interest. It is a partial enumeration because it is a c
Information from the sampled units is used to estimate the characteristics for the entire population of interest.
When to use a census or a sample?
Once a population has been identified a decision needs to be made about whether taking a census or selecting a sample will be the more suitable o
to using a census or sample to study a population:
Pros of a CENSUS



provides a true measure
of the population (no
sampling error)
benchmark data may be
obtained for future
studies
detailed information
about small sub-groups
within the population is
more likely to be
available
Pros of a SAMPLE



costs would generally be
lower than for a census
results may be available
in less time
if good sampling
techniques are used, the
results can be very
representative of the
actual population
Cons of a CENSUS



may be difficult to
enumerate all units of
the population within
the available time
higher costs, both in
staff and monetary
terms, than for a sample
generally takes longer to
collect, process, and
release data than from a
sample
Cons of a SAMPLE



data may not be
representative of the
total population,
particularly where the
sample size is small
often not suitable for
producing benchmark
data
as data are collected
from a subset of units
and inferences made

about the whole
population, the data are
subject to 'sampling'
error
decreased number of
units will reduce the
detailed information
available about subgroups within a
population
How are samples selected?
A sample must be robust in its design and large enough to provide a reliable representation of the whole population. Aspects to be considered whe
accuracy required, cost, and the timing. Sampling can be random or non-random.
In a random (or probability) sample each unit in the population has a chance of being selected, and this probability can be accurately determine
Probability or random sampling includes, but is not limited to, simple random sampling, systematic sampling, and stratified sampling. Random sam
estimates from the data obtained from the units included in the sample.
Simple random sample: All members of the sample are chosen at random and have the same chance of being in the sample. A lottery draw is a goo
the numbers are randomly generated from a defined range of numbers (i.e. 1 through to 45) with each number having an equal chance of being sel
Systematic random sample: The first member of the sample is chosen at random then the other members of the sample are taken at intervals (i.e.
Stratified random sample: Relevant subgroups from within the population are identified and random samples are selected from within each strata.
In a non-random (or non-probability) sample some units of the population have no chance of selection, the selection is non-random, or the prob
determined.
In this method the sampling error cannot be estimated, making it difficult to infer population estimates from the sample. Non-random sampling inc
sampling, quota sampling, and volunteer sampling
Convenience sampling: Units are chosen based on their ease of access;
Purposive sampling: The sample is chosen based on what the researcher thinks is appropriate for the study;
Quota sampling: The researcher can select units as they choose, as long as they reach a defined quota; and
Volunteer sampling: participants volunteer to be a part of the survey (a common method used for internet based opinion surveys where there is no
Collecting data about a population flowchart:
Collecting Data about a Population Flowchart: Census and Sample
Recommended: Read Data Sources next
Further information:
ABS:
1299.0 - An Introduction to Sample Surveys: A User's Guide
External links:
Sample Size calculator
Basic Survey Design: Samples and Censuses
Return to Statistical Language Homepage


Return to Top
Privacy | Disclaimer | Feedback |
| © Copyright| Sitemap| Online Security
Difference Between Census and Sampling
Posted on January 31, 2011 by Nedha Last updated on: May 28, 2015
Census vs Sampling
Census and sampling are two methods of collecting data between which certain
differences can be identified. Before we move forward to enumerate differences
between
Census
and
sampling,
it
is
better
to
understand
what
these
two techniques of generatinginformation mean. A census can simply be defined as a
periodic collection of information from the entirepopulation. Conducting a census can
be very time-consuming and costly. However, the advantage is that it allows
the researcher to gain accurate information. On the other hand, sampling is when the
researcher selects a sample from the population and gathers information. This is
less time consuming, but the reliability of the information gained is doubtful. Through
this article let us examine the differences between a census and sampling.
What is a Census?
Census refers to a periodic collection of information from the entire population. It is a
time-consuming affair as it involves counting all heads and generating information
about them. For better governance, every government requires specific data and
information about the populace to make programs andpolicies that match the needs
and requirements of the population. A census allows the government to gain such
information.
What is Sampling?
There are times when a government cannot wait for next Census and needs to
gather current information about the population. This is when a different technique of
collecting information that is less elaborate and cheaper than Census is employed.
This is called Sampling. This method of collecting information requires generating a
sample that is representative of the entire population.
When using a sample for data collection the researcher can use various methods of
sampling.
Simple
random
sampling,
stratified
sampling,
snowball
method,
nonrandom sampling are some of the mostly used sampling methods.
There are stark differences between Census and sampling though both serve the
purpose of providing data and information about a population. Howsoever
accurately, a sample from a population may be generated there will always be a
margin for error, whereas in case of Census, the entire population is taken into
account and as such it is most accurate. Data obtained from both Census and
sampling is extremely important for a government for various purposes such as
planning developmental programs and policies for weaker sections of the society.
What is the Difference Between Census and
Sampling?
Definitions of Census and Sampling:
Census: Census refers to a periodic collection of information about the populace
from the entire population.
Sampling: Sampling is a method of collecting information from a sample that is
representative of the entire population.
Characteristics of Census and Sampling:
Reliability:
Census: Data from the census is reliable and accurate.
Sampling: there is a margin of error in data obtained from sampling.
Time:
Census: Census is very time-consuming.
Sampling: Sampling is quick.
Cost:
Census: Census is very expensive
Sampling: Sampling is inexpensive.
Convenience:
Census: Census is not very convenient as the researcher has to allocate a lot of
effort in collecting data.
Sampling: Sampling is the most convenient method of obtaining data about the
population.
Image Courtesy:
1. “Volkstelling 1925 Census“. [Public Domain] via Wikimedia Commons
2. “Simple random sampling” by Dan Kernler [CC BY-SA 4.0] via Wikimedia Co
4. Enumerations versus Samples
Print
Sixteen U.S. Marshals and 650 assistants conducted the first U.S. census in 1791. They counted
some 3.9 million individuals, although as then-Secretary of State, Thomas Jefferson, reported to
President George Washington, the official number understated the actual population by at least
2.5 percent (Roberts, 1994). By 1960, when the U.S. population had reached 179 million, it was
no longer practical to have a census taker visit every household. The Census Bureau then began
to distribute questionnaires by mail. Of the 116 million households to which questionnaires were
sent in 2000, 72 percent responded by mail. A mostly-temporary staff of over 800,000 was
needed to visit the remaining households, and to produce the final count of 281,421,906. Using
statistically reliable estimates produced from exhaustive follow-up surveys, the Bureau's
permanent staff determined that the final count was accurate to within 1.6 percent of the actual
number (although the count was less accurate for young and minority residences than it was for
older and white residents). It was the largest and most accurate census to that time.
(Interestingly, Congress insists that the original enumeration or "head count" be used as the
official population count, even though the estimate calculated from samples by Census Bureau
statisticians is demonstrably more accurate.)
The mail-in response rate for the 2010 census was also 72 percent. As with most of the 20th
century censuses the official 2010 census count, by state, had to be delivered to the Office of the
President by December 31 of the census year. Then within one week of the opening of the next
session of the Congress, the President reported to the House of Representatives the
apportionment population counts and the number of Representatives to which each state was
entitled.
In 1791, census takers asked relatively few questions. They wanted to know the numbers of free
persons, slaves, and free males over age 16, as well as the sex and race of each individual. (You
can view replicas of historical census survey forms here(link is external)) As the U.S. population has
grown, and as its economy and government have expanded, the amount and variety of data
collected has expanded accordingly. In the 2000 census, all 116 million U.S. households were
asked six population questions (names, telephone numbers, sex, age and date of birth, Hispanic
origin, and race), and one housing question (whether the residence is owned or rented). In
addition, a statistical sample of one in six households received a "long form" that asked 46 more
questions, including detailed housing characteristics, expenses, citizenship, military service,
health problems, employment status, place of work, commuting, and income. From the sampled
data, the Census Bureau produced estimated data on all these variables for the entire
population.
In the parlance of the Census Bureau, data associated with questions asked of all households
are called100% data and data estimated from samples are called sample data. Both types of data
are available aggregated by various enumeration areas, including census block, block group,
tract, place, county, and state (see the illustration below). Through 2000, the Census Bureau
distributes the 100% data in a package called the "Summary File 1" (SF1) and the sample data
as "Summary File 3" (SF3). In 2005, the Bureau launched a new project called American
Community Survey that surveys a representative sample of households on an ongoing basis.
Every month, one household out of every 480 in each county or equivalent area receives a
survey similar to the old "long form." Annual or semi-annual estimates produced from American
Community Survey samples replaced the SF3 data product in 2010.
To protect respondents' confidentiality, as well as to make the data most useful to legislators, the
Census Bureau aggregates the data it collects from household surveys to several different types
of geographic areas. SF1 data, for instance, are reported at the block or tract level. There were
about 8.5 million census blocks in 2000. By definition, census blocks are bounded on all sides by
streets, streams, or political boundaries. Census tracts are larger areas that have between 2,500
and 8,000 residents. When first delineated, tracts were relatively homogeneous with respect to
population characteristics, economic status, and living conditions. A typical census tract consists
of about five or six sub-areas called block groups. As the name implies, block groups are
composed of several census blocks. American Community Survey estimates, like the SF3 data
that preceded them, are reported at the block group level or higher.
Figure 3.4.1 Relationships among the various census geographies.
U.S. Census Bureau, American FactFinder, 2005, http://factfinder2.census.gov/faces/nav/jsf/pages/index.xhtml(link is
external) An updated source for the diagram can be found at https://www.census.gov/geo/reference/hierarchy.html(link
is external)).
Try This!
Acquiring U.S. Census Data via the World Wide Web
The purpose of this practice activity is to guide you through the process of finding and acquiring
2000 census data from the U.S. Census Bureau data via the Web. Your objective is to look up
the total population of each county in your home state (or an adopted state of the U.S.).
1. Go to the U.S. Census Bureau site at http://www.census.gov(link is external).
2. At the Census Bureau home page, hover your mouse cursor over the Data tab, then
overData Tools and App and select American FactFinder. American FactFinder is the
Census Bureau's primary medium for distributing census data to the public.
3. Expand the ADVANCED SEARCH list, and click on the SHOW ME ALL button. Take note
of the three numbered steps featured on the page you are taken to. That’s what we are about
to do in this exercise.
4. Click the Topics search option box. In the Select Topics overlay window expand the Peoplelist.
Next expand the Basic Count/Estimate list. Then choose Population Total. Note that a
Population Total entry is placed in the Your Selections box in the upper left, and it disappears
from the Basic Count/Estimate list.
Close the Select Topics window.
The list of datasets in the resulting Search Results window is for the entire United States. We
want to narrow the search to county-level data for your home or adopted state.
5. Click the Geographies search options box. In the Select Geographies overlay window that
opens make sure the List tab is selected. Under Select a geographic type:, click County - 050.
Next, select the entry for your state from the Select a state list, and then, from the Select one or
more geographic areas.... list, select All counties within <your state> .
Last, click ADD TO YOUR SELECTIONS. This will place your All Counties… choice in
the Your Selections box.
Close the Select Geographies window.
6. The list of datasets in the Search Results window now pertains to the counties in your state.
Take a few moments to review the datasets that are listed. Note that there are SF1, SF2,
ACS (American Community Survey), etc., datasets, and that if you page through the list far
enough you will see that data from past years is listed. We are going to focus our effort on
the 2010 SF1 100% Data.
7. Given that our goal is to find the population of the counties in your home state, can you
determine which dataset we should look at?
There is a TOTAL POPULATION entry for 2010. Find it, and make certain you have located
the 2010 SF1 100% Data dataset. (You can use the Refine your search results: slot above the
dataset list to help narrow the search.)
Check the box for it, and then click View.
In the new results window that opens, you should be able to find the population of the
counties of your chosen state.
Note the row of Actions:, which includes Print and Download buttons.
There is a TOTAL POPULATION entry, probably on page 2. Find it, and make certain you
have located the 2010 SF1 100% Data dataset.
Check the box for it and click View.
In the new Results window that opens you should be able to find the population of the
counties your chosen state.
Note the row of Actions:, which includes Print and Download buttons.
I encourage you to experiment some with the American FactFinder site. Start slow, and just click
theBACK TO ADVANCED SEARCH button, un-check the TOTOL POPULATION dataset and
choose a different dataset to investigate. Registered students will need to answer a couple of
quiz questions based on using this site.
Pay attention to what is in the Your Selections window. You can easily remove entries by clicking
the red circle with the white X.
On the SEARCH page, with nothing in the Your Selections box, you might try typing “QT” or
“GCT” in the Search for: slot. QT stands for Quick Tables which are preformatted tables that
show several related themes for one or more geographic areas. GCT stands for Geographic
Comparison Tableswhich are the most convenient way to compare data collected for all the
counties, places, or congressional districts in a state, or all the census tracts in a county.
Students who register for this Penn State course gain access to assignments and instructor
feedback, and earn academic credit. Information about Penn State's Online Geospatial
Education programs is available at the Geospatial Education Program Office(link is external).
Probability distribution
From Wikipedia, the free encyclopedia
This article is about probability distribution. For generalized functions in mathematical analysis,
see Distribution (mathematics). For other uses, see Distribution.
This article includes a list of references, related reading or external links, but
its sources remain unclear because it lacksinline
citations. Please improve this article by introducing more precise
citations. (July 2011)
This article needs additional citations for verification. Please help improve
this article by adding citations to reliable sources. Unsourced material may
be challenged and removed. (July 2011)
In probability and statistics, a probability distribution assigns a probability to each measurable
subset of the possible outcomes of a random experiment, survey, or procedure of statistical
inference. Examples are found in experiments whose sample space is non-numerical, where the
distribution would be a categorical distribution; experiments whose sample space is encoded by
discrete random variables, where the distribution can be specified by a probability mass function;
and experiments with sample spaces encoded by continuous random variables, where the
distribution can be specified by a probability density function. More complex experiments, such
as those involving stochastic processesdefined in continuous time, may demand the use of more
general probability measures.
In applied probability, a probability distribution can be specified in a number of different ways,
often chosen for mathematical convenience:





by supplying a valid probability mass function or probability density function
by supplying a valid cumulative distribution function or survival function
by supplying a valid hazard function
by supplying a valid characteristic function
by supplying a rule for constructing a new random variable from other random variables
whose joint probability distribution is known.
A probability distribution can either be univariate or multivariate. A univariate distribution gives
the probabilities of a single random variable taking on various alternative values; a multivariate
distribution (a joint probability distribution) gives the probabilities of a random vector—a set of
two or more random variables—taking on various combinations of values. Important and
commonly encountered univariate probability distributions include the binomial distribution,
the hypergeometric distribution, and the normal distribution. The multivariate normal
distribution is a commonly encountered multivariate distribution.
Contents
[hide]










1Introduction
2Terminology
o 2.1Basic terms
3Cumulative distribution function
4Discrete probability distribution
o 4.1Measure theoretic formulation
o 4.2Cumulative density
o 4.3Delta-function representation
o 4.4Indicator-function representation
5Continuous probability distribution
6Some properties
7Kolmogorov definition
8Random number generation
9Applications
10Common probability distributions
o
o



10.1Related to real-valued quantities that grow linearly (e.g. errors, offsets)
10.2Related to positive real-valued quantities that grow exponentially (e.g. prices, incomes,
populations)
o 10.3Related to real-valued quantities that are assumed to be uniformly distributed over a
(possibly unknown) region
o 10.4Related to Bernoulli trials (yes/no events, with a given probability)
o 10.5Related to categorical outcomes (events with K possible outcomes, with a given probability
for each outcome)
o 10.6Related to events in a Poisson process (events that occur independently with a given rate)
o 10.7Related to the absolute values of vectors with normally distributed components
o 10.8Related to normally distributed quantities operated with sum of squares (for hypothesis
testing)
o 10.9Useful as conjugate prior distributions in Bayesian inference
11See also
12References
13External links
Introduction[edit]
The probability mass function (pmf) p(S) specifies the probability distribution for the sum S of counts from
two dice. For example, the figure shows that p(11) = 1/18. The pmf allows the computation of probabilities
of events such as P(S > 9) = 1/12 + 1/18 + 1/36 = 1/6, and all other probabilities in the distribution.
To define probability distributions for the simplest cases, one needs to distinguish
between discrete and continuous random variables. In the discrete case, one can easily assign
a probability to each possible value: for example, when throwing a fair die, each of the six
values 1 to 6 has the probability 1/6. In contrast, when a random variable takes values from a
continuum then, typically, probabilities can be nonzero only if they refer to intervals: in quality
control one might demand that the probability of a "500 g" package containing between 490 g
and 510 g should be no less than 98%.
The probability density function (pdf) of the normal distribution, also called Gaussian or "bell curve", the
most important continuous random distribution. As notated on the figure, the probabilities of intervals of
values correspond to the area under the curve.
If the random variable is real-valued (or more generally, if a total order is defined for its possible
values), the cumulative distribution function (CDF) gives the probability that the random variable
is no larger than a given value; in the real-valued case, the CDF is theintegral of the probability
density function (pdf) provided that this function exists.
Terminology[edit]
As probability theory is used in quite diverse applications, terminology is not uniform and
sometimes confusing. The following terms are used for non-cumulative probability distribution
functions:



Probability mass, Probability mass function, p.m.f.: for discrete random variables.
Categorical distribution: for discrete random variables with a finite set of values.
Probability density, Probability density function, p.d.f.: most often reserved for continuous
random variables.
The following terms are somewhat ambiguous as they can refer to non-cumulative or cumulative
distributions, depending on authors' preferences:


Probability distribution function: continuous or discrete, non-cumulative or cumulative.
Probability function: even more ambiguous, can mean any of the above or other things.
Finally,

Probability distribution: sometimes the same as probability distribution function, but usually
refers to the more complete assignment of probabilities to all measurable subsets of
outcomes, not just to specific outcomes or ranges of outcomes.
Basic terms[edit]










Mode: for a discrete random variable, the value with highest probability (the location at which
the probability mass function has its peak); for a continuous random variable, the location at
which the probability density function has its peak.
Support: the smallest closed set whose complement has probability zero.
Head: the range of values where the pmf or pdf is relatively high.
Tail: the complement of the head within the support; the large set of values where the pmf or
pdf is relatively low.
Expected value or mean: the weighted average of the possible values, using their
probabilities as their weights; or the continuous analog thereof.
Median: the value such that the set of values less than the median has a probability of onehalf.
Variance: the second moment of the pmf or pdf about the mean; an important measure of
the dispersion of the distribution.
Standard deviation: the square root of the variance, and hence another measure of
dispersion.
Symmetry: a property of some distributions in which the portion of the distribution to the left
of a specific value is a mirror image of the portion to its right.
Skewness: a measure of the extent to which a pmf or pdf "leans" to one side of its mean.
Cumulative distribution function[edit]
Because a probability distribution Pr on the real line is determined by the probability of
a scalar random variable X being in a half-open interval (-∞, x], the probability distribution is
completely characterized by its cumulative distribution function:
Discrete probability distribution[edit]
See also: Probability mass function and Categorical distribution
The probability mass function of a discrete probability distribution. The probabilities of
the singletons {1}, {3}, and {7} are respectively 0.2, 0.5, 0.3. A set not containing any of these points
has probability zero.
The cdf of a discrete probability distribution, ...
... of a continuous probability distribution, ...
... of a distribution which has both a continuous part and a discrete part.
A discrete probability distribution is a probability distribution characterized by a probability
mass function. Thus, the distribution of arandom variable X is discrete, and X is called
a discrete random variable, if
as u runs through the set of all possible values of X. Hence, a random variable can
assume only a finite or countably infinite number of values—the random variable is
a discrete variable. For the number of potential values to be countably infinite, even
though their probabilities sum to 1, the probabilities have to decline to zero fast enough.
for example, if
+ 1/4 + 1/8 + ... = 1.
for n = 1, 2, ..., we have the sum of probabilities 1/2
Well-known discrete probability distributions used in statistical modeling include
the Poisson distribution, the Bernoulli distribution, thebinomial distribution, the geometric
distribution, and the negative binomial distribution. Additionally, the discrete uniform
distribution is commonly used in computer programs that make equal-probability random
selections between a number of choices.
Measure theoretic formulation[edit]
A measurable function
between a probability space
and
a measurable space
is called a discrete random variable provided its image is
a countable set and the pre-image of singleton sets are measurable,
i.e.,
for all
function
disjoint sets are disjoint
. The latter requirement induces a probability mass
via
. Since the pre-images of
This recovers the definition given above.
Cumulative density[edit]
Equivalently to the above, a discrete random variable can be defined as a random
variable whose cumulative distribution function (cdf) increases only by jump
discontinuities—that is, its cdf increases only where it "jumps" to a higher value, and
is constant between those jumps. The points where jumps occur are precisely the
values which the random variable may take.
Delta-function representation[edit]
Consequently, a discrete probability distribution is often represented as a
generalized probability density function involving Dirac delta functions, which
substantially unifies the treatment of continuous and discrete distributions. This is
especially useful when dealing with probability distributions involving both a
continuous and a discrete part.
Indicator-function representation[edit]
For a discrete random variable X, let u0, u1, ... be the values it can take with non-zero
probability. Denote
These are disjoint sets, and by formula (1)
It follows that the probability that X takes any value except for u0, u1, ... is
zero, and thus one can write X as
except on a set of probability zero, where
is the indicator
function of A. This may serve as an alternative definition of discrete
random variables.
Continuous probability distribution[edit]
See also: Probability density function
A continuous probability distribution is a probability distribution that
has a cumulative distribution function that is continuous. Most often they
are generated by having aprobability density function. Mathematicians
call distributions with probability density functions absolutely
continuous, since their cumulative distribution function is absolutely
continuous with respect to the Lebesgue measure λ. If the distribution
of X is continuous, then X is called a continuous random variable.
There are many examples of continuous probability
distributions: normal, uniform, chi-squared, and others.
Intuitively, a continuous random variable is the one which can take
a continuous range of values—as opposed to a discrete distribution,
where the set of possible values for the random variable is at
most countable. While for a discrete distribution
an event with probability zero is impossible (e.g., rolling 31/2 on a
standard die is impossible, and has probability zero), this is not so in the
case of a continuous random variable. For example, if one measures
the width of an oak leaf, the result of 3½ cm is possible; however, it has
probability zero because uncountably many other potential values exist
even between 3 cm and 4 cm. Each of these individual outcomes has
probability zero, yet the probability that the outcome will fall into
the interval (3 cm, 4 cm) is nonzero. This apparent paradox is resolved
by the fact that the probability that X attains some value within
aninfinite set, such as an interval, cannot be found by naively adding the
probabilities for individual values. Formally, each value has
an infinitesimally small probability, whichstatistically is equivalent to
zero.
Formally, if X is a continuous random variable, then it has a probability
density function ƒ(x), and therefore its probability of falling into a given
interval, say [a, b] is given by the integral
In particular, the probability for X to take any single value a (that
is a ≤ X ≤ a) is zero, because an integral with coinciding upper and
lower limits is always equal to zero.
The definition states that a continuous probability distribution must
possess a density, or equivalently, its cumulative distribution
function be absolutely continuous. This requirement is stronger than
simple continuity of the cumulative distribution function, and there is
a special class of distributions, singular distributions, which are
neither continuous nor discrete nor a mixture of those. An example
is given by the Cantor distribution. Such singular distributions
however are never encountered in practice.
Note on terminology: some authors use the term "continuous
distribution" to denote the distribution with continuous cumulative
distribution function. Thus, their definition includes both the
(absolutely) continuous and singular distributions.
By one convention, a probability distribution is
called continuous if its cumulative distribution
function
is continuous and, therefore, the
probability measure of singletons
for all
.
Another convention reserves the term continuous probability
distribution for absolutely continuous distributions. These
distributions can be characterized by a probability density function:
a non-negative Lebesgue integrable function
numbers such that
defined on the real
Discrete distributions and some continuous distributions (like
the Cantor distribution) do not admit such a density.
Some properties[edit]


The probability distribution of the sum of two independent
random variables is the convolution of each of their
distributions.
Probability distributions are not a vector space—they are
not closed under linear combinations, as these do not
preserve non-negativity or total integral 1—but they are
closed under convex combination, thus forming a convex
subset of the space of functions (or measures).
Kolmogorov definition[edit]
Main articles: Probability space and Probability measure
In the measure-theoretic formalization of probability theory,
a random variable is defined as a measurable function X from
a probability space
to measurable space
.
A probability distribution of X is the pushforward
measure X*P of X , which is a probability measure
on
satisfying X*P = PX −1.
Random number generation[edit]
Main article: Pseudo-random number sampling
A frequent problem in statistical simulations (the Monte Carlo
method) is the generation of pseudo-random numbers that are
distributed in a given way. Most algorithms are based on
a pseudorandom number generator that produces
numbers X that are uniformly distributed in the interval [0,1).
These random variates X are then transformed via some
algorithm to create a new random variate having the required
probability distribution.
Applications[edit]
The concept of the probability distribution and the random
variables which they describe underlies the mathematical
discipline of probability theory, and the science of statistics.
There is spread or variability in almost any value that can be
measured in a population (e.g. height of people, durability of a
metal, sales growth, traffic flow, etc.); almost all measurements
are made with some intrinsic error; in physics many processes
are described probabilistically,from the kinetic properties of
gases to the quantum mechanicaldescription of fundamental
particles. For these and many other reasons,
simple numbers are often inadequate for describing a quantity,
while probability distributions are often more appropriate.
As a more specific example of an application, the cache
language models and other statistical language models used
in natural language processing to assign probabilities to the
occurrence of particular words and word sequences do so by
means of probability distributions.
Common probability distributions[edit]
Main article: List of probability distributions
The following is a list of some of the most common probability
distributions, grouped by the type of process that they are
related to. For a more complete list, see list of probability
distributions, which groups by the nature of the outcome being
considered (discrete, continuous, multivariate, etc.)
Note also that all of the univariate distributions below are singly
peaked; that is, it is assumed that the values cluster around a
single point. In practice, actually observed quantities may
cluster around multiple values. Such quantities can be modeled
using a mixture distribution.
Related to real-valued quantities that grow
linearly (e.g. errors, offsets)[edit]

Normal distribution (Gaussian distribution), for a single such
quantity; the most common continuous distribution
Related to positive real-valued quantities that
grow exponentially (e.g. prices, incomes,
populations)[edit]


Log-normal distribution, for a single such quantity whose
log is normally distributed
Pareto distribution, for a single such quantity whose log
is exponentially distributed; the prototypical power
law distribution
Related to real-valued quantities that are
assumed to be uniformly distributed over a
(possibly unknown) region[edit]


Discrete uniform distribution, for a finite set of values (e.g.
the outcome of a fair die)
Continuous uniform distribution, for continuously distributed
values
Related to Bernoulli trials (yes/no events, with a
given probability)[edit]

Basic distributions:
 Bernoulli distribution, for the outcome of a single
Bernoulli trial (e.g. success/failure, yes/no)


Binomial distribution, for the number of "positive
occurrences" (e.g. successes, yes votes, etc.) given a
fixed total number of independent occurrences
 Negative binomial distribution, for binomial-type
observations but where the quantity of interest is the
number of failures before a given number of successes
occurs
 Geometric distribution, for binomial-type observations
but where the quantity of interest is the number of
failures before the first success; a special case of
the negative binomial distribution
Related to sampling schemes over a finite population:
 Hypergeometric distribution, for the number of "positive
occurrences" (e.g. successes, yes votes, etc.) given a
fixed number of total occurrences, using sampling
without replacement
 Beta-binomial distribution, for the number of "positive
occurrences" (e.g. successes, yes votes, etc.) given a
fixed number of total occurrences, sampling using
a Polya urn scheme (in some sense, the "opposite"
of sampling without replacement)
Related to categorical outcomes (events
with K possible outcomes, with a given
probability for each outcome)[edit]



Categorical distribution, for a single categorical outcome
(e.g. yes/no/maybe in a survey); a generalization of
the Bernoulli distribution
Multinomial distribution, for the number of each type of
categorical outcome, given a fixed number of total
outcomes; a generalization of the binomial distribution
Multivariate hypergeometric distribution, similar to
the multinomial distribution, but using sampling without
replacement; a generalization of the hypergeometric
distribution
Related to events in a Poisson process (events
that occur independently with a given rate) [edit]



Poisson distribution, for the number of occurrences of a
Poisson-type event in a given period of time
Exponential distribution, for the time before the next
Poisson-type event occurs
Gamma distribution, for the time before the next k Poissontype events occur
Related to the absolute values of vectors with
normally distributed components[edit]


Rayleigh distribution, for the distribution of vector
magnitudes with Gaussian distributed orthogonal
components. Rayleigh distributions are found in RF signals
with Gaussian real and imaginary components.
Rice distribution, a generalization of the Rayleigh
distributions for where there is a stationary background
signal component. Found in Rician fading of radio signals
due to multipath propagation and in MR images with noise
corruption on non-zero NMR signals.
Related to normally distributed quantities
operated with sum of squares (for hypothesis
testing)[edit]



Chi-squared distribution, the distribution of a sum of
squared standard normal variables; useful e.g. for inference
regarding the sample variance of normally distributed
samples (see chi-squared test)
Student's t distribution, the distribution of the ratio of
a standard normal variable and the square root of a
scaled chi squared variable; useful for inference regarding
the meanof normally distributed samples with unknown
variance (see Student's t-test)
F-distribution, the distribution of the ratio of two scaled chi
squared variables; useful e.g. for inferences that involve
comparing variances or involving R-squared (the
squaredcorrelation coefficient)
Useful as conjugate prior distributions in
Bayesian inference[edit]
Main article: Conjugate prior




Beta distribution, for a single probability (real number
between 0 and 1); conjugate to the Bernoulli
distribution and binomial distribution
Gamma distribution, for a non-negative scaling parameter;
conjugate to the rate parameter of a Poisson
distribution or exponential distribution,
the precision (inversevariance) of a normal distribution, etc.
Dirichlet distribution, for a vector of probabilities that must
sum to 1; conjugate to the categorical
distribution and multinomial distribution; generalization of
the beta distribution
Wishart distribution, for a symmetric non-negative
definite matrix; conjugate to the inverse of the covariance
matrix of a multivariate normal distribution; generalization of
thegamma distribution
See also[edit]
Statistics portal








Copula (statistics)
Empirical probability
Histogram
Joint probability distribution
Likelihood function
List of statistical topics
Kirkwood approximation
Moment-generating function


Quasiprobability distribution
Riemann–Stieltjes integral application to probability theory
References[edit]



B. S. Everitt: The Cambridge Dictionary of
Statistics, Cambridge University Press, Cambridge (3rd
edition, 2006). ISBN 0-521-69027-7
Bishop: Pattern Recognition and Machine
Learning, Springer, ISBN 0-387-31073-8
den Dekker A. J., Sijbers J., (2014) "Data distributions in
magnetic resonance images: a review", Physica Medica, [1]
External links[edit]
5.6.2 - "Greater than" Probabilities
Printer-friendly version
Sometimes we want to know the probability that a variable has a value greater than some value.
For instance, we might want to know the probability that a randomly selected vehicle speed is
greater than 73 mph, written P(X>73)P(X>73).
Previously we found P(X<73)=.9452P(X<73)=.9452. The general rule for a "greater than"
situation isP(X>x)=1−P(X≤x)P(X>x)=1−P(X≤x).
Thus, P(X>73)=1−.9452=.0548P(X>73)=1−.9452=.0548. The probability that a randomly
selected vehicle will be going 73 mph or greater is .0548.
If we did not know P(X≤73)P(X≤73) we could compute this probability by constructing a
probability distribution in Minitab Express or Minitab.

Using Minitab Express

Using Minitab
1.
2.
3.
4.
5.
In Minitab Express:
Open Minitab Express without any data.
From the menu bar, select Statistics > Probability Distributions > Distribution Plot
Click Display Probability
For Distribution, select Normal (this is the default). In this scenario, our mean is 65 and our
standard deviation is 5.
Under Shade the area corresponding to the following: select A specified x value andRight
tail. The X value is 73.
The result is the following output which shows us that 0.0547993 of the distribution is greater
than 73 mph.
‹ 5.6.1 - Cumulative Probabilitiesup
5.6.3 - "In between" Probabilities
Printer-friendly version
Suppose we want to know the probability a normal random variable is within a specified interval.
For instance, suppose we want to know the probability a randomly selected vehicle is between 60
and 73 mph?
We could compute the probability that the speed is less than 73 mph and the probability that the
speed is less than 60 mph and subtract the two. In other words:
P(60<X<73)=P(X<73)−P(X<60)P(60<X<73)=P(X<73)−P(X<60)
Or, we could use statistical software to find this range:

Using Minitab Express

1.
2.
3.
4.
5.
Using Minitab
In Minitab Express:
Open Minitab Express without any data.
From the menu bar, select Statistics > Probability Distributions > Distribution Plot
Click Display Probability
For Distribution, select Normal (this is the default). In this scenario, our mean is 65 and our
standard deviation is 5.
Under Shade the area corresponding to the following: select A specified x
value andMiddle. The X value 1 is 60 and X value 2 is 73.
The result is the following output which shows us that 0.786545 of the distribution is between 60
mph and 73 mph.
‹ 5.6.2 - "Greater than" Probabilitiesup
ility Distributions » 5.6 - Finding Probabilities using Software
5.6.4 - Finding Percentiles
Printer-friendly version
Percentile: Proportion of values below a given value
For example, if your test score is in the 88th percentile, then you scored better than 88% of test
takers.
We may wish to know the value of a variable that is a specified percentile. For example, what
speed is the 99.99th percentile of speeds at the highway location in our earlier example? Recall,
the mean vehicle speed is 65 mph with a standard deviation of 5 mph.

Using Minitab Express

Using Minitab
To calculate percentiles in Minitab Express:
1. Open Minitab Express without data
2. On a PC: From the menu bar, select Statistics > Probability Distributions > CDF/PDF > Inverse
(ICDF)
3.
4.
5.
6.
7.
8.
On a MAC: From the menu bar, select Statistics > Probability Distributions > Inverse
Cumulative Distribution Function
Form of input is A single value
Value is .9999.
Distribution is Normal
Our mean is 65 and our standard deviation is 5
Under Output, select Display a table of inverse cumulative probabilities
Click OK
The result should be the following output:
Video Review
5.8 - Review of Finding the Proportion
Under the Normal Curve
Printer-friendly version
Video Review: Working with Continuous Random Variables
Finding the Score for a Given Proportion
This video walks through one example. A group of instructors have decided to assign grades on a
curve. Given the mean and standard deviation of their students' scores, they want to know what
point ranges are associated with which grades. Minitab Express is used.
On Your Own
Practice finding the proportion of observations under the normal curve. Each question can be
answered using either Minitab Express or the z table. Work through each example then click the
icon to view the solution and compare your answers.
HINT: Drawing the normal curve and shading in the region you are looking for is often helpful.
1. What proportion of the standard normal curve is less than a z score of 1.64?
2. What proportion of the standard normal curve falls above a z score of 1.33?
3. What proportion of the standard normal curve falls between a z score of -.50 and a z
score of +.50?
4. At one private school, a minimum IQ score of 125 is necessary to be considered for
admission. IQ scores have a mean of 100 and standard deviation of 15. Given this information,
what proportion of children are eligible for consideration for admission to this school?
5. ACT scores have a mean of 18 and a standard deviation of 6. What proportion of test
takers score between a 20 and 26?
6. A men’s clothing company is doing research on the height of adult American men in
order to inform the sizing of the clothing that they offer. The height of males in the United States
is normally distributed with a mean of 175 cm and a standard deviation of 15 cm. Men who are
more than 30 cm different (shorter or taller) from the mean are classified by the apparel company
as special cases because they do not fit in their regular length clothing. Given this information,
what proportion of men would be classified as special cases?
‹ 5.7 - Finding Probabilities using a Standard Normal Tableup
Statistical Distributions (e.g. Normal,
Poisson, Binomial) and their uses
Statistics: Distributions
Summary
Normal distribution describes continuous data which have a symmetric
distribution, with a characteristic 'bell' shape.
Binomial distribution describes the distribution of binary data from a finite
sample. Thus it gives the probability of getting r events out of n trials.
Poisson distribution describes the distribution of binary data from an
infinite sample. Thus it gives the probability of getting r events in a
population.
The Normal Distribution
It is often the case with medical data that the histogram of a continuous
variable obtained from a single measurement on different subjects will
have a characteristic `bell-shaped' distribution known as a Normal
distribution. One such example is the histogram of the birth weight (in
kilograms) of the 3,226 new born babies shown in Figure 1.
Figure 1 Distribution of birth weight in 3,226 newborn babies (data from O'
Cathain et al 2002)
To distinguish the use of the same word in normal range and Normal
distribution we have used a lower and upper case convention throughout
this book.
The histogram of the sample data is an estimate of the population
distribution of birth weights in new born babies. This population
distribution can be estimated by the superimposed smooth `bell-shaped'
curve or `Normal' distribution shown. We presume that if we were able to
look at the entire population of new born babies then the distribution of
birth weight would have exactly the Normal shape. We often infer, from a
sample whose histogram has the approximate Normal shape, that the
population will have exactly, or as near as makes no practical difference,
that Normal shape.
The Normal distribution is completely described by two parameters μ and
σ, where μ represents the population mean or centre of the distribution
and σ the population standard deviation. Populations with small values of
the standard deviation σ have a distribution concentrated close to the
centre μ; those with large standard deviation have a distribution widely
spread along the measurement axis. One mathematical property of the
Normal distribution is that exactly 95% of the distribution lies between
μ - (1.96 x σ) and μ + (1.96 x σ)
Changing the multiplier 1.96 to 2.58, exactly 99% of the Normal
distribution lies in the corresponding interval.
In practice the two parameters of the Normal distribution, μ and σ, must
be estimated from the sample data. For this purpose a random sample
from the population is first taken. The sample mean and the sample
standard deviation, SD( ) = s, are then calculated. If a sample is taken
from such a Normal distribution, and provided the sample is not too
small, then approximately 95% of the sample will be covered by
-[1.96xSD(x)] to
+[1.96xSD(x)]
This is calculated by merely replacing the population parameters μ and σ
by the sample estimates and s in the previous expression.
In appropriate circumstances this interval may estimate the reference
interval for a particular laboratory test which is then used for diagnostic
purposes.
We can use the fact that our sample birth weight data appear Normally
distributed to calculate a reference range. We have already mentioned
that about 95% of the observations (from a Normal distribution) lie within
+/-1.96SDs of the mean. So a reference range for our sample of babies
is:
3.39-[1.96x0.55] to 3.39+[1.96x0.55]
2.31kg to 4.47kg
A baby's weight at birth is strongly associated with mortality risk during
the first year and, to a lesser degree, with developmental problems in
childhood and the risk of various diseases in adulthood. If the data are
not Normally distributed then we can base the normal reference range on
the observed percentiles of the sample. i.e. 95% of the observed data lie
between the 2.5 and 97.5 percentiles. So a percentile-based reference
range for our sample is:
2.19kg to 4.43kg.
Most reference ranges are based on samples larger than 3500 people.
Over many years, and millions of births, the WHO has come up with a
with a normal birth weight range for new born babies. These ranges
represent results than are acceptable in newborn babies and actually
cover the middle 80% of the population distribution i.e. the 10th and 90th
centiles. Low birth weight babies are usually defined (by the WHO) as
weighing less than 2500g (the 10th centile) regardless of gestational age,
since and large birth weight babies are defined as weighing above 4000kg
(the 90th centile). Hence the normal birth weight range is around 2.5kg
to 4kg. For our sample data, the 10 to 90th centile range was similar,
2.75 to 4.03kg.
The Binomial Distribution
If a group of patients is given a new drug for the relief of a particular
condition, then the proportion p being successively treated can be
regarded as estimating the population treatment success rate .
The sample proportion p is analogous to the sample mean , in that if we
score zero for those s patients who fail on treatment, and unity for those r
who succeed, then p=r/n, where n=r+s is the total number of patients
treated. Thus p also represents a mean.
Data which can take only a 0 or 1 response, such as treatment failure or
treatment success, follow the binomial distribution provided the
underlying population response rate does not change. The binomial
probabilities are calculated from
for successive values of R from 0 through to n. In the above n! is read as
n factorial and R! as R factorial. For R=4, R!=4×3×2×1=24. Both 0! and
1! are taken as equal to unity. The shaded area marked in Figure 2
corresponds to the above expression for the binomial distribution
calculated for each of R=8,9,...,20 and then added. This area totals
0.1018. So the probability of eight or more responses out of 20 is 0.1018.
For a fixed sample size n the shape of the binomial distribution depends
only on . Suppose n =20 patients are to be treated, and it is known that
on average a quarter or =0.25 will respond to this particular treatment.
The number of responses actually observed can only take integer values
between 0 (no responses) and 20 (all respond). The binomial distribution
for this case is illustrated in Figure 2.
The distribution is not symmetric, it has a maximum at five responses and
the height of the blocks corresponds to the probability of obtaining the
particular number of responses from the 20 patients yet to be treated. It
should be noted that the expected value for r, the number of successes
yet to be observed if we treated n patients, is n . The potential variation
about this expectation is expressed by the corresponding standard
deviation
SE(r) = √ [n (1- )]
Figure 2 also shows the Normal distribution arranged to have μ=n =5
and σ=√[n (1- )]=1.94, superimposed on to a binomial distribution with
π=0.25 and n=20. The Normal distribution describes fairly precisely the
binomial distribution in this case.
If n is small, however, or close to 0 or 1, the disparity between the
Normal and binomial distributions with the same mean and standard
deviation, similar to those illustrated in Figure 2, increases and the
Normal distribution can no longer be used to approximate the binomial
distribution. In such cases the probabilities generated by the binomial
distribution itself must be used.
It is also only in situations in which reasonable agreement exists between
the distributions that we would use the confidence interval expression
given previously. For technical reasons, the expression given for a
confidence interval for is an approximation. The approximation will
usually be quite good provided p is not too close to 0 or 1, situations in
which either almost none or nearly all of the patients respond to
treatment. The approximation improves with increasing sample size n.
Figure 2: Binomial distribution for n=20 with
=0.25 and the Normal
approximation
The Poisson Distribution
The Poisson distribution is used to describe discrete quantitative data
such as counts in which the populations size n is large, the probability of
an individual event is small, but the expected number of events, n , is
moderate (say five or more). Typical examples are the number of deaths
in a town from a particular disease per day, or the number of admissions
to a particular hospital.
Example
Wight et al (2004) looked at the variation in cadaveric heart beating
organ donor rates in the UK. They found that they were 1330 organ
donors, aged 15-69, across the UK for the two years 1999 and 2000
combined. Heart-beating donors are patients who are seriously ill in an
intensive care unit (ICU) and are placed on a ventilator.
Now it is clear that the distribution of number of donors takes integer
values only, thus the distribution is similar in this respect to the binomial.
However, there is no theoretical limit to the number of organ donors that
could happen on a particular day. Here the population is the UK
population aged 15-69, over two years, which is over 82 million people,
so in this case each member can be thought to have a very small
probability of actually suffering an event, in this case being admitted to a
hospital ICU and placed on a ventilator with a life threatening condition.
The mean number of organ donors per day over the two year period is
calculated as
It should be noted that the expression for the mean is similar to that for
, except here multiple data values are common; and so instead of writing
each as a distinct figure in the numerator they are first grouped and
counted. For data arising from a Poisson distribution the standard error,
that is the standard deviation of r, is estimated by SE(r) = √(r/n), where
n is the total number of days. Provided the organ donation rate is not too
low, a 95% confidence interval for the underlying (true) organ donation
rate λ can be calculated by
r-1.96×SE(r) to r+1.96× SE(r).
In the above example r=1.82, SE(r)=√(1.82/730)=0.05, and therefore
the 95% confidence interval for λ is 1.72 to 1.92 organ donations per
day. Exact confidence intervals can be calculated as described by Altman
et al. (2000).
The Poisson probabilities are calculated from
Prob(R responses) = e-λλR
R!
for successive values of R from 0 to infinity. Here e is the exponential
constant 2.7182…, and λ is the population rate which is estimated by r in
the example above.
Example
Suppose that before the study of Wight et al. (2004) was conducted it
was expected that the number of organ donations per day was
approximately one. Then assuming λ = 2, we would anticipate the
probability of 0 organ donations to be e-110/0!=e-1=0.1353. (Remember
that 10 and 0! are both equal to 1.) The probability of one organ donation
would be e-111/1!=e-1=27.07. Similarly the probability of two organ
donations per day is e-112/2!=e-1/2=0.2707; and so on to give for three
donations 0.1804, four donations 0.0902, five donations 0.0361, six
donations 0.0120, etc. If the study is then to be conducted over 2 years
(730 days), each of these probabilities is multiplied by 730 to give the
expected number of days during which 0, 1, 2, 3, etc. donations will
occur. These expectations are 98.8, 197.6, 197.6, 131.7, 26.3, 8.8 days.
A comparison can then be made between what is expected and what is
actually observed.
References

Altman D.G., Machin D., Bryant T.N., & Gardner M.J. Statistics with
Confidence. Confidence intervals and statistical guidelines (2nd Edition).
London: British Medical Journal, 2000

Campbell MJ Machin D. Medical Statistics : A Commonsense Approach.
Chichester: Wiley, 1999.

O'Cathain A., Walters S.J., Nicholl J.P., Thomas K.J., & Kirkham M. Use of
evidence based leaflets to promote informed choice in maternity care:
randomised controlled trial in everyday practice. British Medical Journal 2002;
324: 643-646.

Melchart D, Streng a, Hoppe A, Brinkhaus B, Witt C, et al Acupuncture in
patients with tension-type headache: randomised controlled trial BMJ
2005;331:376-382

Wight J., Jakubovic M., Walters S., Maheswaran R., White P., Lennon V.
Variation in cadaveric organ donor rates in the UK. Nephrology Dialysis
Transplantation 2004; 19(4): 963-968, 2004.
Cart
SEARCH
Sign In


Scientific Software
Data Analysis Resource Center



Company
Support
How to Buy
1. Select category
2. Choose calculator
3. Enter data
4. View results
Binomial, Poisson and Gaussian distributions
Binomial distribution
The binomial distribution applies when there are two possible outcomes. You know the
probability of obtaining either outcome (traditionally called "success" and "failure") and want to
know the chance of obtaining a certain number of successes in a certain number of trials.
How many trials (or subjects) per experiment?
What is the probability of "success" in each trial or subject?
Calculate Probabilities
Poisson distribution
The Poisson distribution applies when you are counting the number of objects in a certain
volume or the number of events in a certain time period. You know the average number of
counts, and wish to know the chance of actually observing various numbers of objects or
events.
Average number of objects per area (or events per unit time)?
Calculate Probabilities
Gaussian distribution
The Gaussian distribution applies when the outcome is expressed as a number that can have
a fractional value. If there are numerous reasons why any particular measurement is different
than the mean, the distribution of measurements will tend to follow a Gaussian bell-shaped
distribution. If you know the mean and SD of this distribution, you can compute the fraction of
the population that is greater (or less) than any particular value.
Normal Distribution, Binomial Distribution, Poisson Distribution
1. 1. Binomial Distribution and Applications
2. 2. Binomial Probability Distribution Is the binomial distribution is a continuous
distribution?Why? Notation: X ~ B(n,p) There are 4 conditions need to be satisfied for a
binomial experiment: 1. There is a fixed number of n trials carried out. 2. The outcome of
a given trial is either a “success” or “failure”. 3. The probability of success (p) remains
constant from trial to trial. 4. The trials are independent, the outcome of a trial is not
affected by the outcome of any other trial.
3. 3. Comparison between binomial and normal distributions
4. 4. Binomial Distribution If X ~ B(n, p), then where successof
trials.insuccessesofnumberr 11!and10!also,1...)2()1(! yprobabilitP n nnnn .,...,1,0r)1( )!(! !
)1()( npp rnr n ppcrXP rnrrnr n r
5. 5. Exam Question  Ten percent of computer parts produced by a certain supplier are
defective. What is the probability that a sample of 10 parts contains more than 3
defective ones?
6. 6. Solution :  Method 1(Using Binomial Formula):
7. 7. Method 2(Using Binomial Table):
8. 8.  From table of binomial distribution :
9. 9. Example 2 If X is binomially distributed with 6 trials and a probability of success equal
to ¼ at each attempt. What is the probability of a)exactly 4 succes. b)at least one
success.
10. 10. Example 3 Jeremy sells a magazine which is produced in order to raise money for
homeless people. The probability of making a sale is, independently, 0.50 for each
person he approaches. Given that he approaches 12 people, find the probability that he
will make: (a)2 or fewer sales; (b)exactly 4 sales; (c)more than 5 sales.
11. 11. Normal Distribution
12. 12. Normal Distribution  In general, when we gather data, we expect to see a particular
pattern to the data, called a normal distribution. A normal distribution is one where the
data is evenly distributed around the mean, which when plotted as a histogram will result
in a bell curve also known as a Gaussian distribution.
13. 13.  thus, things tend towards the mean – the closer a value is to the mean, the more
you’ll see it; and the number of values on either side of the mean at any particular
distance are equal or in symmetry.
14. 14. 
15. 15. Z-score  with mean and standard deviation of a set of scores which are normally
distributed, we can standardize each "raw" score, x, by converting it into a z score by
using the following formula on each individual score:
16. 16. Example 1 a) Find the z-score corresponding to a raw score of 132 from a normal
distribution with mean 100 and standard deviation 15. b) A z-score of 1.7 was found from
an observation coming from a normal distribution with mean 14 and standard deviation 3.
Find the raw score. Solution a)We compute 132 - z = __________ = 2.133 15 b) We
have x - 1.7 = ________ 3 To solve this we just multiply both sides by the denominator 3,
(1.7)(3) = x - 14 5.1 = x - 14 x = 19.1
17. 17. Example 2 Find a) P(z < 2.37) b) P(z > 1.82) Solution a)We use the table. Notice the
picture on the table has shaded region corresponding to the area to the left (below) a zscore. This is exactly what we want. Hence P(z < 2.37) = .9911 b) In this case, we want
the area to the right of 1.82. This is not what is given in the table. We can use the identity
P(z > 1.82) = 1 - P(z < 1.82) reading the table gives P(z < 1.82) = .9656 Our answer is
P(z > 1.82) = 1 - .9656 = .0344
18. 18. Example 3 Find P(-1.18 < z < 2.1) Solution Once again, the table does not exactly
handle this type of area. However, the area between -1.18 and 2.1 is equal to the area to
the left of 2.1 minus the area to the left of -1.18. That is P(-1.18 < z < 2.1) = P(z < 2.1) P(z < -1.18) To find P(z < 2.1) we rewrite it as P(z < 2.10) and use the table to get P(z <
2.10) = .9821. The table also tells us that P(z < -1.18) = .1190 Now subtract to get P(1.18 < z < 2.1) = .9821 - .1190 = .8631
19. 19. Poisson distribution
20. 20. Definitions  a discrete probability distribution for the count of events that occur
randomly in a given time.  a discrete frequency distribution which gives the probability of
a number of independent events occurring in a fixed time.
21. 21. Poisson distribution only apply one formula: Where:  X = the number of events  λ =
mean of the event per interval Where e is the constant, Euler's number (e = 2.71828...)
22. 22. Example: Births rate in a hospital occur randomly at an average rate of 1.8 births per
hour. What is the probability of observing 4 births in a given hour at the hospital?
Assuming X = No. of births in a given hour i) Events occur randomly ii) Mean rate λ = 1.8
Using the poisson formula, we cam simply calculate the distribution. P(X = 4) =( e^1.8)(1.8^4)/(4!) Ans: 0.0723
23. 23.  If the probability of an item failing is 0.001, what is the probability of 3 failing out of
a population of 2000? Λ = n * p = 2000 * 0.001 = 2 Hence, use the Poisson formula X =
3, P(X = 3) = Ans: 0.1804
24. 24. Example: A small life insurance company has determined that on the average it
receives 6 death claims per day. Find the probability that the company receives at least
seven death claims on a randomly selected day.
25. 25. Analysis method  1st: analyse the given data.  2nd: label the value of x, λ  At
least 7 days, means the probability must be ≥ 7. but the value will be to the infinity.
Hence, must apply the probability rule which is  P(X ≥ 7) = 1 – P(X ≤ 6)  P(X ≤ 6)
means that the value of x must be from 0, 1, 2, 3, 4, 5, 6.  Total them up using Poisson,
then 1 subtract the answer.  Ans = 0.3938
26. 26. Example: The number of traffic accidents that occurs on a particular stretch of road
during a month follows a Poisson distribution with a mean of 9.4. Find the probability that
less than two accidents will occur on this stretch of road during a randomly selected
month. P(x < 2) = P(x = 0) + P(x = 1) Ans: 0.000860
Business Statistics:
Revealing Facts From Figures
URL for this site is:
http://ubmail.ubalt.edu/~harsham/Business-stat/opre504.htm
Interactive Online Version
Europe Mirror Site
I am always happy to help students who are not enrolled in my
courses with questions and problems. But unfortunately, I don't
have enough time to respond to everyone. Thank you for your
understanding.
Professor Hossein Arsham
MENU



Course Information (for students enrolled in my class)
Introduction
Towards Statistical Thinking For Decision Making Under
Uncertainties




Probability for Statistical Inference
Topics in Business Statistics
Statistical Books List
Interesting and Useful Sites
 Introduction
Towards Statistical Thinking For Decision Making Under
Uncertainties
The Birth of Statistics
What is Business Statistics
Belief, Opinion, and Fact
Kinds of Lies: Lies, Damned Lies and Statistics
Probability for Statistical Inference
Different Schools of Thought in Inferential Statistics
Bayesian, Frequentist, and Classical Methods
Probability, Chance, Likelihood, and Odds
How to Assign Probabilities
General Laws of Probability
Mutually Exclusive versus Independent Events
Entropy Measure
Applications of and Conditions for Using Statistical Tables
Relationships Among Distributions and Unification of Statistical
Tables






Normal Distribution
Binomial Distribution
Poisson Distribution
Exponential Distribution
Uniform Distribution
Student's t-Distributions
Topics in Business Statistics
Greek Letters Commonly Used in Statistics
Type of Data and Levels of Measurement
Sampling Methods
Number of Class Intervals in a Histogram
How to Construct a Box Plot
Outlier Removal
Statistical Summaries





What
What
What
What
Representative of a Sample: Measures of Central Tendency
Selecting Among the Mean, Median, and Mode
Quality of a Sample: Measures of Dispersion
Guess a Distribution to Fit Your Data: Skewness & Kurtosis
A Numerical Example & Discussions
Is
Is
Is
Is
So Important About the Normal Distributions
a Sampling Distribution
Central Limit Theorem
"Degrees of Freedom"
Parameters' Estimation and Quality of a 'Good' Estimate
Procedures for Statistical Decision Making
Statistics with Confidence and Determining Sample Size
Hypothesis Testing: Rejecting a Claim
The Classical Approach to the Test of Hypotheses
The Meaning and Interpretation of P-values (what the data say)
Blending the Classical and the P-value Based Approaches in Test
of Hypotheses
Conditions Under Which Most Statistical Testings Apply



Homogeneous Population (Don't mix apples and oranges)
Test for Randomness: The Runs Test
Lilliefors Test for Normality
Statistical Tests for Equality of Populations Characteristics

Two-Population Independent Means (T-test)



Two Dependent Means (T-test for paired data sets)
More Than Two Independent Means (ANOVA)
More Than Two Dependent Means (ANOVA)
Power of a Test
Parametric vs. Non-Parametric vs. Distribution-free Tests
Chi-square Tests
Bonferroni Method
Goodness-of-fit Test for Discrete Random Variables
When We Should Pool Variance Estimates
Resampling Techniques: Jackknifing, and Bootstrapping
What is a Linear Least Squares Model
Pearson's and Spearman's Correlations
How to Compare Two Correlations Coefficients
Independence vs. Correlated
Correlation, and Level of Significance
Regression Analysis: Planning, Development, and Maintenance
Predicting Market Response
Warranties: Statistical Planning and Analysis
Factor Analysis
Interesting and Useful Sites (topical category)
Selected Reciprocal Web Sites
Review of Statistical Tools on the Internet
General References
Statistical Societies & Organizations
Statistics References
Statistics Resources
Statistical Data Analysis
Probability Resources
Data and Data Analysis
Computational Probability and Statistics Resources
Questionnaire Design, Surveys Sampling and Analysis
Statistical Software
Learning Statistics
Econometric and Forecasting
Selected Topics
Glossary Collections Sites
Statistical Tables
Introduction
This Web site is a course in statistics appreciation, i.e. to
acquire a feel for the statistical way of thinking. An
introductory course in statistics designed to provide you with
the basic concepts and methods of statistical analysis for
processes and products. Materials in this Web site are
tailored to meet your needs in business decision making. It
promotes think statistically. The cardinal objective for this
Web site is to increase the extent to which statistical
thinking is embedded in management thinking for decision
making under uncertainties. It is already an accepted fact
that "Statistical thinking will one day be as necessary for
efficient citizenship as the ability to read and write." So, let's
be ahead of our time.
To be competitive, business must design quality into
products and processes. Further, they must facilitate a
process of never-ending improvement at all stages of
manufacturing. A strategy employing statistical methods,
particularly statistically designed experiments, produces
processes that provide high yield and products that seldom
fail. Moreover, it facilitates development of robust products
that are insensitive to changes in the environment and
internal component variation. Carefully planned statistical
studies remove hindrances to high quality and productivity
at every stage of production, saving time and money. It is
well recognized that quality must be engineered into
products as early as possible in the design process. One
must know how to use carefully planned, cost-effective
experiments to improve, optimize and make robust products
and processes.
Business Statistics is a science assisting you to make
business decisions under uncertainties based on some
numerical and measurable scales. Decision making process
must be based on data neither on personal opinion nor on
belief.
Know that data are only crude information and not
knowledge by themselves. The sequence from data to
knowledge is: from Data to Information, from
Information to Facts, and finally, from Facts to
Knowledge. Data becomes information when it becomes
relevant to your decision problem. Information becomes fact
when the data can support it. Fact becomes knowledge
when it is used in the successful completion of decision
process. The following figure illustrates the statistical
thinking process based on data in constructing statistical
models for decision making under uncertainties.
Knowledge is more than knowing something technical.
Knowledge needs wisdom, and wisdom comes with age and
experience. Wisdom is about knowing how something
technical can be best used to meet the needs of the
decision-maker. Wisdom, for example, creates statistical
software that is useful, rather than technically brilliant.
The Devil is in the Deviations: Variation is an inevitability
in life! Every process has variation. Every measurement.
Every sample! Managers need to understand variation for
two key reasons. First, so that they can lead others to apply
statistical thinking in day to day activities and secondly, to
apply the concept for the purpose of continuous
improvement. This course will provide you with hands-on
experience to promote the use of statistical thinking and
techniques to apply them to make educated decisions
whenever you encounter variation in business data. You will
learn techniques to intelligently assess and manage the risks
inherent in decision-making. Therefore, remember that:
Just like weather, if you cannot control something,
you should learn how to measure and analyze, in
order to predict it, effectively.
If you have taken statistics before, and have a feeling of
inability to grasp concepts, it is largely due to your former
non-statistician instructors teaching statistics. Their
deficiencies lead students to develop phobias for the sweet
science of statistics. In this respect, the following remark
is made by Professor Herman Chernoff, in Statistical
Science, Vol. 11, No. 4, 335-350, 1996:
"Since everybody in the world thinks he can
teach statistics even though he does not know
any, I shall put myself in the position of teaching
biology even though I do not know any"
Plugging numbers in the formulas and crunching them has
no value by themselves. You should continue to put effort
into the concepts and concentrate on interpreting the
results.
Even, when you solve a small size problem by hand, I would
like you to use the available computer software and Webbased computation to do the dirty work for you.
You must be able to read off the logical secrete in any
formulas not memorizing them. For example, in computing
the variance, consider its formula. Instead of memorizing,
you should start with some whys:
i. Why we square the deviations from the mean.
Because, if we add up all deviations we get always zero. So
to get away from this problem, we square the deviations.
Why not raising to the power of four (three will not work)?
Since squaring does the trick why should we make life more
complicated than it is. Notice also that squaring also
magnifies the deviations, therefore it works to our
advantage to measure the quality of the data.
ii. Why there is a summation notation in the formula.
To add up the squared deviation of each data point to
compute the total sum of squared deviations.
iii. Why we divide the sum of squares by n-1.
The amount of deviation should reflects also how large is the
sample, so we must bring in the sample size. That is, in
general larger sample size have larger sum of square
deviation from the mean. Okay. Why n-1 and not n. The
reason for it is that when you divide by n-1 the sample's
variance provide a much closer to the population variance
than when you divide by n, on average. You note that for
large sample size n (say over 30) it really does not matter
whether you divide by n or n-1. The results are almost the
same and acceptable. The factor n-1 is the so called the
"degrees of freedom".
This was just an example for you to show as how to
question the formulas rather than memorizing them. If fact
when you try to understand the formulas you do not need to
remember them, they are parts of your brain
connectivity. Clear thinking is always more important than
the ability to do a lot of arithmetic.
When you look at a statistical formula the formula should
talk to you, as when a musician looks at a piece of musicalnotes he/she hears the music.How to become a statistician
who is also a musician?
The objectives for this course are to learn statistical
thinking; to emphasize more data and concepts, less theory
and fewer recipes; and finally to foster active learning using,
e.g., the useful and interesting Web-sites.
Some Topics in Business Statistics
Greek Letters Commonly Used as Statistical Notations
We use Greek letters in statistics and other scientific areas
to honor the ancient Greek philosophers who invented
science (such as Socrates, the inventor of dialectic
reasoning).
Greek Letters Commonly Used as Statistical Notations
alpha beta ki-sqre delta mu nu pi rho sigma tau theta


2

   



Note: ki-square (ki-sqre, Chi-square), 2, is not the square
of anything, its name imply Chi-square (read, ki-square). Ki
does not exist in statistics. I'm glad that you're overcoming
all the confusions that exist in learning statistics.
The Birth of Statistics
The original idea of "statistics" was the collection of
information about and for the "State".
The birth of statistics occurred in mid-17th century. A
commoner, named John Graunt, who was a native of
London, begin reviewing a weekly church publication issued
by the local parish clerk that listed the number of births,
christenings, and deaths in each parish. These so called Bills
of Mortality also listed the causes of death. Graunt who was
a shopkeeper organized this data in the forms we call
descriptive statistics, which was published asNatural and
Political Observation Made upon the Bills of Mortality.
Shortly thereafter, he was elected as a member of Royal
Society. Thus, statistics has to borrow some concepts from
sociology, such as the concept of "Population". It has been
argued that since statistics usually involves the study of
human behavior, it cannot claim the precision of the
physical sciences.
Probability has much longer history. It originated from the
study of games of chance and gambling during the sixteenth
century. Probability theory was a branch of mathematics
studied by Blaise Pascal and Pierre de Fermat in the
seventeenth century. Currently, in 21st centuray,
probabilistic modeling are used to control the flow of traffic
through a highway system, a telephone interchange, or a
computer processor; find the genetic makeup of individuals
or populations; quality control; insurance; investment; and
other sectors of business and industry.
New and ever growing diverse fields of human activities are
using statistics, however, it seems that this field itself
remains obscure to the public. Professor Bradley Efron
expressed this fact nicely:
During the 20th Century statistical thinking and
methodology have become the scientific framework for
literally dozens of fields including education,
agriculture, economics, biology, and medicine, and
with increasing influence recently on the hard sciences
such as astronomy, geology, and physics. In other
words, we have grown from a small obscure field into a
big obscure field.
For the history of probability, and history of statistics,
visit History of Statistics Material. I also recommend the
following books.
Further Readings:
Daston L., Classical Probability in the Enlightenment,
Princeton University Press, 1988.
The book points out that early Enlightenment thinkers could
not face uncertainty. A mechanistic, deterministic machine,
was the Enlightenment view of the world.
Gillies D., Philosophical Theories of Probability, Routledge,
2000. Covers the classical, logical, subjective, frequency,
and propensity views.
Hacking I., The Emergence of Probability, Cambridge
University Press, London, 1975.
A philosophical study of early ideas about probability,
induction and statistical inference.
Peters W., Counting for Something: Statistical Principles and
Personalities, Springer, New York, 1987.
It teaches the principles of applied economic and social
statistics in a historical context. Featured topics include
public opinion polls, industrial quality control, factor
analysis, Bayesian methods, program evaluation, nonparametric and robust methods, and exploratory data
analysis.
Porter T., The Rise of Statistical Thinking, 1820-1900,
Princeton University Press, 1986.
The author states that statistics has become known in the
twentieth century as the mathematical tool for analyzing
experimental and observational data. Enshrined by public
policy as the only reliable basis for judgments as the efficacy
of medical procedures or the safety of chemicals, and
adopted by business for such uses as industrial quality
control, it is evidently among the products of science whose
influence on public and private life has been most pervasive.
Statistical analysis has also come to be seen in many
scientific disciplines as indispensable for drawing reliable
conclusions from empirical results.This new field of
mathematics found so extensive a domain of applications.
Stigler S., The History of Statistics: The Measurement of
Uncertainty Before 1900, U. of Chicago Press, 1990.
It covers the people, ideas, and events underlying the birth
and development of early statistics.
Tankard J., The Statistical Pioneers, Schenkman Books, New
York, 1984.
This work provides the detailed lives and times of theorists
whose work continues to shape much of the modern
statistics.
What is Business Statistics?
In this diverse world of ours, no two things are exactly the
same. A statistician is interested in both
the differences and the similarities, i.e. both patterns and
departures.
The actuarial tables published by insurance companies
reflect their statistical analysis of the average life
expectancy of men and women at any given age. From
these numbers, the insurance companies then calculate the
appropriate premiums for a particular individual to purchase
a given amount of insurance.
Exploratory analysis of data makes use of numerical and
graphical techniques to study patterns and departures from
patterns. The widely used descriptive statistical techniques
are: Frequency Distribution Histograms; Box & Whisker and
Spread plots; Normal plots; Cochrane (odds ratio) plots;
Scattergrams and Error Bar plots; Ladder, Agreement and
Survival plots; Residual, ROC and diagnostic plots; and
Population pyramid. Graphical modeling is a collection of
powerful and practical techniques for simplifying and
describing inter-relationships between many variables,
based on the remarkable correspondence between the
statistical concept of conditional independence and the
graph-theoretic concept of separation.
The controversial "Million Man March on Washington" was in
1995 demonstrated the size of a rally can have important
political consequences. March organizers steadfastly
maintained the official attendance estimates offered by the
U. S. Park Service (300,000) were too low. Is it?
In examining distributions of data, you should be able to
detect important characteristics, such as shape, location,
variability, and unusual values. From careful observations of
patterns in data, you can generate conjectures about
relationships among variables. The notion of how one
variable may be associated with another permeates almost
all of statistics, from simple comparisons of proportions
through linear regression. The difference between
association and causation must accompany this conceptual
development.
Data must be collected according to a well-developed plan if
valid information on a conjecture is to be obtained. The plan
must identify important variables related to the conjecture
and specify how they are to be measured. From the data
collection plan, a statistical model can be formulated from
which inferences can be drawn.
Statistical models are currently used in various fields of
business and science. However, the terminology differs from
field to field. For example, the fitting of models to data,
called calibration, history matching, and data assimilation,
are all synonymous with parameter estimation.
Know that data are only crude information and not
knowledge by themselves. The sequence from data to
knowledge is: from Data to Information, from
Information to Facts, and finally, from Facts to
Knowledge. Data becomes information when it becomes
relevant to your decision problem. Information becomes fact
when the data can support it. Fact becomes knowledge
when it is used in the successful completion of decision
process. The following figure illustrates the statistical
thinking process based on data in constructing statistical
models for decision making under uncertainties.
That's why we need Business Statistics. Statistics arose
from the need to place knowledge on a systematic evidence
base. This required a study of the laws of probability, the
development of measures of data properties and
relationships, and so on.
The main objective of Business Statistics is to make
inference (prediction, making decisions) about certain
characteristics of a population based on information
contained in a random sample from the entire population, as
depicted below:
Business Statistics is the science of ‘good' decision making
in the face of uncertainty and is used in many disciplines
such as financial analysis, econometrics, auditing,
production and operations including services improvement,
and marketing research. It provides knowledge and skills to
interpret and use statistical techniques in a variety of
business applications. A typical Business Statistics course is
intended for business majors, and covers statistical study,
descriptive statistics (collection, description, analysis, and
summary of data), probability, and the binomial and normal
distributions, test of hypotheses and confidence intervals,
linear regression, and correlation.
The following discussion refers to the above chart. Statistics
is a science of making decisions with respect to the
characteristics of a group of persons or objects on the basis
of numerical information obtained from a randomly selected
sample of the group.
At the planning stage of a statistical investigation the
question of sample size (n) is critical. This course provides a
practical introduction to sample size determination in the
context of some commonly used significance tests.
Population: A population is any entire collection of people,
animals, plants or things from which we may collect data. It
is the entire group we are interested in, which we wish to
describe or draw conclusions about. In the above figure the
life of the light bulbs manufactured say by GE, is the
concerned population.
Statistical Experiment
In order to make any generalization about a population, a
random sample from the entire population, that is meant to
be representative of the population, is often studied. For
each population there are many possible samples. A sample
statistic gives information about a corresponding population
parameter. For example, the sample mean for a set of data
would give information about the overall population
mean .
It is important that the investigator carefully and completely
defines the population before collecting the sample,
including a description of the members to be included.
Example: The population for a study of infant health might
be all children born in the U.S.A. in the 1980's. The sample
might be all babies born on 7th May in any of the years.
An experiment is any process or study which results in the
collection of data, the outcome of which is unknown. In
statistics, the term is usually restricted to situations in which
the researcher has control over some of the conditions
under which the experiment takes place.
Example: Before introducing a new drug treatment to
reduce high blood pressure, the manufacturer carries out an
experiment to compare the effectiveness of the new drug
with that of one currently prescribed. Newly diagnosed
subjects are recruited from a group of local general
practices. Half of them are chosen at random to receive the
new drug, the remainder receive the present one. So, the
researcher has control over the type of subject recruited and
the way in which they are allocated to treatment.
Experimental (or Sampling) Unit: A unit is a person, animal,
plant or thing which is actually studied by a researcher; the
basic objects upon which the study or experiment is carried
out. For example, a person; a monkey; a sample of soil; a
pot of seedlings; a postcode area; a doctor's practice.
Design of experiments is a key tool for increasing the rate
of acquiring new knowledge–knowledge that in turn can be
used to gain competitive advantage, shorten the product
development cycle, and produce new products and
processes which will meet and exceed your customer's
expectations.
The major task of statistics is to study the characteristics of
populations whether these populations are people, objects,
or collections of information. For two major reasons, it is
often impossible to study an entire population:
The process would be too expensive or time consuming.
The process would be destructive.
In either case, we would resort to looking at a sample
chosen from the population and trying to infer information
about the entire population by only examining the smaller
sample. Very often the numbers which interest us most
about the population are the mean  and standard
deviation . Any number -- like the mean or standard
deviation -- which is calculated from an entire population is
called a Parameter. If the very same numbers are derived
only from the data of a sample, then the resulting numbers
are called Statistics. Frequently, parameters are represented
by Greek letters and statistics by Latin letters (as shown in
the above Figure). The step function in this figure is
the Empirical Distribution Function (EDF), known also
as Ogive, which is used to graph cumulative frequency. An
EDF is constructed by placing a point corresponding to
the middle point of each class at a height equal to the
cumulative frequency of the class. EDF represents the
distribution function Fx.
Parameter
A parameter is a value, usually unknown (and therefore has
to be estimated), used to represent a certain population
characteristic. For example, the population mean is a
parameter that is often used to indicate the average value of
a quantity.
Within a population, a parameter is a fixed value which does
not vary. Each sample drawn from the population has its
own value of any statistic that is used to estimate this
parameter. For example, the mean of the data in a sample
is used to give information about the overall mean in the
population from which that sample was drawn.
Statistic: A statistic is a quantity that is calculated from a
sample of data. It is used to give information about
unknown values in the corresponding population. For
example, the average of the data in a sample is used to give
information about the overall average in the population from
which that sample was drawn.
It is possible to draw more than one sample from the same
population and the value of a statistic will in general vary
from sample to sample. For example, the average value in a
sample is a statistic. The average values in more than one
sample, drawn from the same population, will not
necessarily be equal.
Statistics are often assigned Roman letters (e.g.
and s),
whereas the equivalent unknown values in the population
(parameters ) are assigned Greek letters (e.g. µ, ).
The word estimate means to esteem, that is giving a value
to something. A statistical estimate is an indication of the
value of an unknown quantity based on observed data.
More formally, an estimate is the particular value of an
estimator that is obtained from a particular sample of data
and used to indicate the value of a parameter.
Example: Suppose the manager of a shop wanted to
know , the mean expenditure of customers in her shop in
the last year. She could calculate the average expenditure of
the hundreds (or perhaps thousands) of customers who
bought goods in her shop, that is, the population mean .
Instead she could use an estimate of this population
mean by calculating the mean of a representative sample
of customers. If this value was found to be $25, then $25
would be her estimate.
There are two broad subdivisions of statistics: Descriptive
statistics and Inferential statistics.
The principal descriptive quantity derived from sample data
is the mean (
), which is the arithmetic average of the
sample data. It serves as the most reliable single measure
of the value of a typical member of the sample. If the
sample contains a few values that are so large or so small
that they have an exaggerated effect on the value of the
mean, the sample is more accurately represented by the
median -- the value where half the sample values fall below
and half above.
The quantities most commonly used to measure the
dispersion of the values about their mean are the variance
s2 and its square root , the standard deviation s. The
variance is calculated by determining the mean, subtracting
it from each of the sample values (yielding the deviation of
the samples), and then averaging the squares of these
deviations. The mean and standard deviation of the sample
are used as estimates of the corresponding characteristics of
the entire group from which the sample was drawn. They do
not, in general, completely describe the distribution (Fx) of
values within either the sample or the parent group; indeed,
different distributions may have the same mean and
standard deviation. They do, however, provide a complete
description of the Normal Distribution, in which positive and
negative deviations from the mean are equally common and
small deviations are much more common than large ones.
For a normally distributed set of values, a graph showing
the dependence of the frequency of the deviations upon
their magnitudes is a bell-shaped curve. About 68 percent of
the values will differ from the mean by less than the
standard deviation, and almost 100 percent will differ by
less than three times the standard deviation.
Statistical inference refers to extending your knowledge
obtained from a random sample from the entire population
to the whole population. This is known in mathematics
as Inductive Reasoning. That is, knowledge of the whole
from a particular. Its main application is in hypotheses
testing about a given population.
Inferential statistics is concerned with making inferences
from samples about the populations from which they have
been drawn. In other words, if we find a difference between
two samples, we would like to know, is this a "real"
difference (i.e., is it present in the population) or just a
"chance" difference (i.e. it could just be the result of random
sampling error). That's what tests of statistical significance
are all about.
Statistical inference guides the selection of appropriate
statistical models. Models and data interact in statistical
work. Models are used to draw conclusions from data, while
the data are allowed to criticize, and even falsify the model
through inferential and diagnostic methods. Inference from
data can be thought of as the process of selecting a
reasonable model, including a statement in probability
language of how confident one can be about the selection.
Inferences made in statistics are of two types. The first
is estimation, which involves the determination, with a
possible error due to sampling, of the unknown value of a
population characteristic, such as the proportion having a
specific attribute or the average value  of some numerical
measurement. To express the accuracy of the estimates of
population characteristics, one must also compute the
"standard errors" of the estimates; these are margins that
determine the possible errors arising from the fact that the
estimates are based on random samples from the entire
population and not on a complete population census. The
second type of inference is hypothesis testing. It involves
the definitions of a "hypothesis" as one set of possible
population values and an "alternative," a different set. There
are many statistical procedures for determining, on the
basis of a sample, whether the true population characteristic
belongs to the set of values in the hypothesis or the
alternative.
The statistical inference is grounded in probability, idealized
concepts of the group under study, called the population,
and the sample. The statistician may view the population as
a set of balls from which the sample is selected at random,
that is, in such a way that each ball has the same chance as
every other one for inclusion in the sample.
Notice that to be able to estimate the population
parameters, the sample size n most be greater than one.
For example, with a sample size of one the variation (s2)
within the sample is 0/1 = 0. An estimate for the variation
(2) within the population would be 0/0, which is
indeterminate quantity, meaning impossible. For working
with zero correctly, visit the Web site The Zero Saga &
Confusions With Numbers.
Probability is the tool used for anticipating what the
distribution of data should look like under a given model.
Random phenomena are not haphazard: they display an
order that emerges only in the long run and is described by
a distribution. The mathematical description of variation is
central to statistics. The probability required for statistical
inference is not primarily axiomatic or combinatorial, but is
oriented toward describing data distributions.
Statistics is a tool that enables us to impose order on the
disorganized cacophony of the real world of modern society.
The business world has grown both in size and competition.
Corporations must perform risky businesses, hence the
growth in popularity and need for business statistics.
Business statistics has grown out of the art of constructing
charts and tables! It is a science of basing decisions on
numerical data in the face of uncertainty.
Business statistics is a scientific approach to decision making
under risk. In practicing business statistics, we search for an
insight, not the solution. Our search is for the one solution
that meets all the business's needs with the lowest level of
risk. Business statistics can take a normal business situation
and with the proper data gathering, analysis, and re-search
for a solution, turn it into an opportunity.
While business statistics cannot replace the knowledge and
experience of the decision maker, it is a valuable tool that
the manager can employ to assist in the decision making
process in order to reduce the inherent risk.
Business Statistics provides justifiable answers to the
following concerns for every consumer and producer:
1. What is your or your customer's Expectation of the
product/service you buy or that you sell? That is, what
is a good estimate for ?
2. Given the information about your or your customer's
expectation, what is the Quality of the product/service
you buy or you sell. That is, what is a good estimate
for ?
3. Given the information about your or your customer's
expectation, and the quality of the product/service you
buy or you sell, does the product/servive Compare with
other existing similar types? That is, comparing
several 's.
Visit also the following Web sites:
What is Statistics?
How to Study Statistics
Decision Analysis
Kinds of Lies: Lies, Damned Lies and Statistics
"There are three kinds of lies -- lies, damned lies, and
statistics." quoted in Mark Twain's autobiography.
It is already an accepted fact that "Statistical thinking will
one day be as necessary for efficient citizenship as the
ability to read and write."
The following are some examples as how statistics could be
misused in advertising, which can be described as the
science of arresting human unintelligence long enough to
get money from it. The founder of Revlon says "In factory
we make cosmetics; in the store we sell hope."
In most cases, the deception of advertising is achieved by
omission:
1. The Incredible Expansion Toyota: "How can it be
that an automobile that's a mere nine inches longer on
the outside give you over two feet more room on the
inside? May be it's the new math!" Toyota Camry Ad.
Where is the fallacy in this statement? Taking volume
as length! For example : 3x6x4=72 feet (cubic),
3x6x4.75=85.5 feet (cubic). It could be even more
than 2 feet!
2. Pepsi Cola Ad.: " In recent side-by-side blind taste
tests, nationwide, more people preferred Pepsi over
Coca-Cola".
The questions are: Was it just some of taste tests,
what was the sample size? It does not say "In all
recent…"
3. Correlation? Consortium of Electric Companies Ad.
"96% of streets in the US are under-lit and, moreover,
88% of crimes take place on under-lit streets".
4. Dependent or Independent Events? "If the
probability of someone carrying a bomb on a plane is
.001, then the chance of two people carrying a bomb is
.000001. Therefore, I should start carrying a bomb on
every flight."
5. Paperboard Packaging Council's
concerns: "University studies show paper milk cartons
give you more vitamins to the gallon."
How was the design of experiment? The research was
sponsored by the council! Paperboard sales is
declining!
6. All the vitamins or just one? "You'd have to eat four
bowls of Raisin Bran to get the vitamin nutrition in one
bowl of Total".
7. Six Times as Safe: "Last year 35 people drowned in
boating accidents. Only 5 were wearing life jackets.
The rest were not. Always wear life jacket when
boating".
What percentage of boaters wear life jackets?
Conditional probability.
8. A Tax Accountant Firm Ad.: "One of our officers
would accompany you in the case of Audit".
This sounds like a unique selling proposition, but it
conceals the fact that the statement is a US Law.
9. Dunkin Donuts Ad.: "Free 3 muffins when you buy
three at the regular 1/2 dozen price."
References and Further Readings:
200% of Nothing, by A. Dewdney, John Wiley, New York,
1993. Based on his articles about math abuse in Scientific
American, Dewdney lists the many ways we are manipulated
with fancy mathematical footwork and faulty thinking in
print ads, the news, company reports and product labels. He
shows how to detect the full range of math abuses and
defend against them.
The Informed Citizen: Argument and Analysis for Today, by
W. Schindley, Harcourt Brace, 1996. This rhetoric/reader
explores the study and practice of writing argumentative
prose. The focus is on exploring current issues in
communities, from the classroom to cyberspace. The
"interacting in communities" theme and the high-interest
readings engage students, while helping them develop
informed opinions, effective arguments, and polished
writing.
Visit also the Web site: Glossary of Mathematical Mistakes.
Belief, Opinion, and Fact
The letters in your course number: OPRE 504, stand for
OPerations RE-search. OPRE is a science of making
decisions (based on some numerical and measurable
scales) by searching, and re-searching for a solution. I refer
you to What Is OR/MS? for a deeper understanding of what
OPRE is all about. Decision making under uncertainty must
be based on facts not on personal opinion nor on belief.
Belief, Opinion, and Fact
Self says
Belief
Opinion
Fact
I'm right
This is my view
This is a fact
Says to others You're wrong That is yours I can prove it to you
Sensible decisions are always based on facts. We should not
confuse facts with beliefs or opinions. Beliefs are defined as
someone's own understanding or needs. In belief, "I am"
always right and "you" are wrong. There is nothing that can
be done to convince the person that what they believe in is
wrong. Opinions are slightly less extreme than beliefs. An
opinion means that a person has certain views that they
think are right. They also know that others are entitled to
their own opinions. People respect other's opinions and in
turn expect the same. Contrary to beliefs and opinions are
facts. Facts are the basis of decisions. A fact is something
that is right, and one can prove it to be true based on
evidence and logical arguments.
Examples for belief, opinion, and facts can be found in
religion, economics, and econometrics, respectively.
With respect to belief, Henri Poincaré said "Doubt everything
or believe everything: these are two equally convenient
strategies. With either we dispense with the need to think."
How to Assign Probabilities?
Probability is an instrument to measure the likelihood of the
occurrence of an event. There are three major approaches
of assigning probabilities as follows:
1. Classical Approach: Classical probability is predicated
on the condition that the outcomes of an experiment
are equally likely to happen. The classical probability
utilizes the idea that the lack of knowledge implies that
all possibilities are equally likely. The classical
probability is applied when the events have the same
chance of occurring (called equally likely events), and
the set of events are mutually exclusive and
collectively exhaustive. The classical probability is
defined as:
P(X) = Number of favorable outcomes / Total number
of possible outcomes
2. Relative Frequency Approach: Relative probability is
based on accumulated historical or experimental data.
Frequency-based probability is defined as:
P(X) = Number of times an event occurred / Total
number of opportunities for the event to occur.
Note that relative probability is based on the ideas that
what has happened in the past will hold.
3. Subjective Approach: The subjective probability is
based on personal judgment and experience. For
example, medical doctors sometimes assign subjective
probability to the length of life expectancy for a person
who has cancer.
General Laws of Probability
1. General Law of Addition: When two or more events
will happen at the same time, and the events are
not mutually exclusive, then:
P(X or Y) = P(X) + P(Y) - P(X and Y)
2. Special Law of Addition: When two or more events
will happen at the same time, and the
events are mutually exclusive, then:
P(X or Y) = P(X) + P(Y)
3. General Law of Multiplication: When two or more
events will happen at the same time, and the
events are dependent, then the general rule of
multiplicative law is used to find the joint probability:
P(X and Y) = P(X) . P(Y|X),
where P(X|Y) is a conditional probability.
4. Special Law of Multiplicative: When two or more
events will happen at the same time, and the
events are independent, then the special rule of
multiplication law is used to find the joint probability:
P(X and Y) = P(X) . P(Y)
5. Conditional Probability Law: A conditional
probability is denoted by P(X|Y). This phrase is read:
the probability that X will occur given that Y is known
to have occurred.
Conditional probabilities are based on knowledge of
one of the variables. The conditional probability of an
event, such as X, occurring given that another event,
such as Y, has occurred is expressed as:
P(X|Y) = P(X and Y) / P(Y)
Provided P(y) is not zero. Note that when using the
conditional law of probability, you always divide the
joint probability by the probability of the event after
the word given. Thus, to get P(X given Y), you divide
the joint probability of X and Y by the unconditional
probability of Y. In other words, the above equation is
used to find the conditional probability for any
two dependent events.
A special case of the Bayes Theorem is:
P(X|Y) = P(Y|X). P(X) / P(Y)
If two events, such as X and Y,
are independent then:
P(X|Y) = P(X),
and
P(Y|X) = P(Y)
Mutually Exclusive versus Independent Events
Mutually Exclusive (ME): Event A and B are M.E if both
cannot occur simultaneously. That is, P[A and B] = 0.
Independency (Ind.): Events A and B are independent if
having the information that B already occurred does not
change the probability that A will occur. That is P[A given B
occurred] = P[A].
If two events are ME they are also Dependent: P(A given B)
= P[A and B]/P[B], and since P[A and B] = 0 (by ME), then
P[A given B] = 0. Similarly,
If two events are Dependent then they are also not ME.
If two events are Dependent then they may or may not be
ME.
If two events are not ME, then they may or may not be
Independent.
The following Figure contains all possibilities. The notations
used in this table are as follows: X means does not imply,
question mark ? means it may or may not imply, while
the check mark means it implies.
Bernstein was the first to discovere that (probabilistic)
pairwise independency and mutual independency for a
collection of events A1,..., An are different notions.
Different Schools of Thought in Inferential Statistics
There are few different schools of thoughts in statistics.
They are introduced sequentially in time by necessity.
The Birth Process of a New School of Thought
The process of devising a new school of thought in any field
has always taken a natural path. Birth of new schools of
thought in statistics is not an exception. The birth process is
outlined below:
Given an already established school, one must work within
the defined framework.
A crisis appears, i.e., some inconsistencies in the framework
result from its own laws.
Response behavior:
1. Reluctance to consider the crisis.
2. Try to accommodate and explain the crisis within the
existing framework.
3. Conversion of some well-known scientists attracts
followers in the new school.
The following Figure illustrates the three major schools of
thought; namely, the Classical (attributed to Laplace),
Relative Frequency (attributed toFisher), and Bayesian
(attributed to Savage). The arrows in this figure represent
some of the main criticisms among Objective, Frequentist,
and Subjective schools of thought. To which school do you
belong? Read the conclusion in this figure.
Bayesian, Frequentist, and Classical Methods
The problem with the Classical Approach is that what
constitutes an outcome is not objectively determined. One
person's simple event is another person's compound event.
One researcher may ask, of a newly discovered planet,
"what is the probability that life exists on the new planet?"
while another may ask "what is the probability that carbonbased life exists on it?"
Bruno de Finetti, in the introduction to his two-volume
treatise on Bayesian ideas, clearly states that "Probabilities
Do not Exist". By this he means that probabilities are not
located in coins or dice; they are not characteristics of
things like mass, density, etc.
Some Bayesian approaches consider probability theory as an
extension of deductive logic to handle uncertainty. It
purports to deduce from first principles the uniquely correct
way of representing your beliefs about the state of things,
and updating them in the light of the evidence. The laws of
probability have the same status as the laws of logic. These
Bayesian approahe is explicitly "subjective" in the sense that
it deals with the plausibility which a rational agent ought to
attach to the propositions she considers, "given her current
state of knowledge and experience." By contrast, at least
some non-Bayesian approaches consider probabilities as
"objective" attributes of things (or situations) which are
really out there (availability of data).
A Bayesian and a classical statistician analyzing the same
data will generally reach the same conclusion. However, the
Bayesian is better able to quantify the true uncertainty in his
analysis, particularly when substantial prior information is
available. Bayesians are willing to assign probability
distribution function(s) to the population's parameter(s)
while frequentists are not.
From a scientist's perspective, there are good grounds to
reject Bayesian reasoning. The problem is that Bayesian
reasoning deals not with objective, but subjective
probabilities. The result is that any reasoning using a
Bayesian approach cannot be publicly checked -- something
that makes it, in effect, worthless to science, like non
replicative experiments.
Bayesian perspectives often shed a helpful light on classical
procedures. It is necessary to go into a Bayesian framework
to give confidence intervals the probabilistic interpretation
which practitioners often want to place on them. This insight
is helpful in drawing attention to the point that another prior
distribution would lead to a different interval.
A Bayesian may cheat by basing the prior distribution on the
data; a Frequentist can base the hypothesis to be tested on
the data. For example, the role of a protocol in clinical trials
is to prevent this from happening by requiring the
hypothesis to be specified before the data are collected. In
the same way, a Bayesian could be obliged to specify the
prior in a public protocol before beginning a study. In a
collective scientific study, this would be somewhat more
complex than for Frequentist hypotheses because priors
must be personal for coherence to hold.
A suitable quantity that has been proposed to measure
inferential uncertainty; i.e., to handle the a priori
unexpected, is the likelihood function itself.
If you perform a series of identical random experiments
(e.g., coin tosses), the underlying probability distribution
that maximizes the probability of the outcome you observed
is the probability distribution proportional to the results of
the experiment.
This has the direct interpretation of telling how (relatively)
well each possible explanation (model), whether obtained
from the data or not, predicts the observed data. If the data
happen to be extreme ("atypical") in some way, so that the
likelihood points to a poor set of models, this will soon be
picked up in the next rounds of scientific investigation by the
scientific community. No long run frequency guarantee nor
personal opinions are required.
There is a sense in which the Bayesian approach is oriented
toward making decisions and the frequentist hypothesis
testing approach is oriented toward science. For example,
there may not be enough evidence to show scientifically that
agent X is harmful to human beings, but one may be
justified in deciding to avoid it in one's diet.
Since the probability (or the distribution of possible
probabilities) is continuous, the probability that the
probability is any specific point estimate is really zero. This
means that in a vacuum of information, we can make no
guess about the probability. Even if we have information, we
can really only guess at a range for the probability.
Further Readings:
Land F., Operational Subjective Statistical Methods, Wiley,
1996. Presents a systematic treatment of subjectivist
methods along with a good discussion of the historical and
philosophical backgrounds of the major approaches to
probability and statistics.
Plato, Jan von, Creating Modern Probability, Cambridge
University Press, 1994. This book provides a historical point
of view on subjectivist and objectivist probability school of
thoughts.
Weatherson B., Begging the question and
Bayesians, Studies in History and Philosophy of Science,
30(4), 687-697, 1999.
Zimmerman H., Fuzzy Set Theory, Kluwer Academic
Publishers, 1991. Fuzzy logic approaches to probability
(based on L.A. Zadeh and his followers) present a difference
between "possibility theory" and probability theory.
For more information, visit the Web sites Bayesian Inference
for the Physical Sciences, Bayesians vs. NonBayesians, Society for Bayesian Analysis,Probability Theory
As Extended Logic, and Bayesians worldwide.
Type of Data and Levels of Measurement
Information can be collected in statistics using qualitative or
quantitative data.
Qualitative data, such as eye color of a group of individuals,
is not computable by arithmetic relations. They are labels
that advise in which category or class an individual, object,
or process fall. They are called categorical variables.
Quantitative data sets consist of measures that take
numerical values for which descriptions such as means and
standard deviations are meaningful. They can be put into an
order and further divided into two groups: discrete data or
continuous data. Discrete data are countable data, for
example, the number of defective items produced during a
day's production. Continuous data, when the parameters
(variables) are measurable, are expressed on a continuous
scale. For example, measuring the height of a person.
The first activity in statistics is to measure or count.
Measurement/counting theory is concerned with the
connection between data and reality. A set of data is a
representation (i.e., a model) of the reality based on a
numerical and mensurable scales. Data are called "primary
type" data if the analyst has been involved in collecting the
data relevant to his/her investigation. Otherwise, it is called
"secondary type" data.
Data come in the forms of Nominal, Ordinal, Interval and
Ratio (remember the French word NOIR for color black).
Data can be either continuous or discrete.
Level of Measurements
_________________________________________
Nominal
Ordinal
Interval/Ratio
Ranking?
no
yes
yes
Numerical
no
no
yes
difference
Zero and unit of measurement are arbitrary in the Interval
scale. While the unit of measurement is arbitrary in Ratio
scale, its zero point is a natural attribute. The categorical
variable is measured on an ordinal or nominal scale.
Measurement theory is concerned with the connection
between data and reality. Both statistical theory and
measurement theory are necessary to make inferences
about reality.
Since statisticians live for precision, they prefer
Interval/Ratio levels of measurement.
Visit the Web site Measurement theory: Frequently Asked
Questions
Number of Class Intervals in a Histogram
Before we can construct our frequency distribution we must
determine how many classes we should use. This is purely
arbitrary, but too few classes or too many classes will not
provide as clear a picture as can be obtained with some
more nearly optimum number. An empirical relationship,
known as Sturge's Rule, may be used as a useful guide to
determine the optimal number of classes (k) is given by
k = the smallest integer greater than or equal to 1 + 3.332
Log(n)
where k is the number of classes, Log is in base 10, n is the
total number of the numerical values which comprise the
data set.
Therefore, class width is:
(highest value - lowest value) / (1 + 3.332 Logn)
where n is the total number of items in the data set.
To have an "optimum" you need some measure of quality -presumably in this case, the "best" way to display whatever
information is available in the data. The sample size
contributes to this; so the usual guidelines are to use
between 5 and 15 classes, with more classes possible if you
have a larger sample. You should take into account a
preference for tidy class widths, preferably a multiple of 5 or
10, because this makes it easier to understand.
Beyond this it becomes a matter of judgement. Try out a
range of class widths, and choose the one that works best.
(This assumes you have a computer and can generate
alternative histograms fairly readily.)
There are often management issues that come into play as
well. For example, if your data is to be compared to similar
data -- such as prior studies, or from other countries -- you
are restricted to the intervals used therein.
If the histogram is very skewed, then unequal classes
should be considered. Use narrow classes where the class
frequencies are high, wide classes where they are low.
The following approaches are common:
Let n be the sample size, then the number of class intervals
could be
MIN {
n, 10 Log(n) }.
The Log is the logarithm in base 10. Thus for 200
observations you would use 14 intervals but for 2000 you
would use 33.
Alternatively,
1. Find the range (highest value - lowest value).
2. Divide the range by a reasonable interval size: 2, 3, 5,
10 or a multiple of 10.
3. Aim for no fewer than 5 intervals and no more than 15.
Visit also the Web site Histogram Applet, and Histogram
Generator
How to Construct a BoxPlot
A BoxPlot is a graphical display that has many
characteristics. It includes the presence of possible outliers.
It illustrates the range of data. It shows a measure of
dispersion such as the upper quartile, lower quartile and
interquartile range (IQR) of the data set as well as the
median as a measure of central location which is useful for
comparing sets of data. It also gives an indication of the
symmetry or skewness of the distribution. The main reason
for the popularity of boxplots is that they offer a lot of
information in a compact way.
Steps to Construct a BoxPlot:
1. Horizontal lines are drawn at the median and at the
upper and lower quartiles. These horizontal lines are
joined by vertical lines to produce the box.
2. A vertical lines is drawn up from the upper quartile to
the most extreme data point that is within a distance
of 1.5 (IQR) of the upper quartile. A similar defined
vertical line is drawn from the lower quartile.
3. Each data point beyond the end of the vertical line is
marked with and asterisk (*).
Probability, Chance, Likelihood, and Odds
"Probability" has an exact technical meaning -- well, in fact
it has several, and there is still debate as to which term
ought to be used. However, for most events for which
probability is easily computed e.g. rolling of a die the
probability of getting a four [::], almost all agree on the
actual value (1/6), if not the philosophical interpretation. A
probability is always a number between 0 [not "quite" the
same thing as impossibility: it is possible that "if" a coin
were flipped infinitely many times, it would never show
"tails", but the probability of an infinite run of heads is 0]
and 1 [again, not "quite" the same thing as certainty but
close enough].
The word "chance" or "chances" is often used as an
approximate synonym of "probability", either for variety or
to save syllables. It would be better practice to leave
"chance" for informal use, and say "probability" if that is
what is meant.
In cases where the probability of an observation is described
by a parametric model, the "likelihood" of a parameter value
given the data is defined to be the probability of the data
given the parameter. One occasionally sees "likely" and
"likelihood", however, these terms are used casually as
synonyms for "probable" and "probability".
"Odds" is a probabilistic concept related to probability. It is
the ratio of the probability (p) of an event to the probability
(1-p) that it does not happen: p/(1-p). It is often expressed
as a ratio, often of whole numbers; e.g., "odds" of 1 to 5 in
the die example above, but for technical purposes the
division may be carried out to yield a positive real number
(here 0.2). The logarithm of the odds ratio is useful for
technical purposes, as it maps the range of probabilities
onto the (extended) real numbers in a way that preserves
symmetry between the probability that an event occurs and
the probability that it does not occur.
Odds are a ratio of nonevents to events. If the event rate
for a disease is 0.1 (10 per cent), its nonevent rate is 0.9
and therefore its odds are 9:1. Note that this is not the
same expression as the inverse of event rate.
Another way to compare probabilities and odds is using
"part-whole thinking" with a binary (dichotomous) split in a
group. A probability is often a ratio of a part to a whole;
e.g., the ratio of the part [those who survived 5 years after
being diagnosed with a disease] to the whole [those who
were diagnosed with the disease]. Odds are often a ratio of
a part to a part; e.g., the odds against dying are the ratio of
the part that succeeded [those who survived 5 years after
being diagnosed with a disease] to the part that 'failed'
[those who did not survive 5 years after being diagnosed
with a disease].
Obviously, probability and odds are intimately related: Odds
= p / (1-p). Note that probability is always between zero
and one, whereas odds range from zero to infinity.
Aside from their value in betting, odds allow one to specify a
small probability (near zero) or a large probability (near
one) using large whole numbers (1,000 to 1 or a million to
one). Odds magnify small probabilities (or large
probabilities) so as to make the relative differences visible.
Consider two probabilities: 0.01 and 0.005. They are both
small. An untrained observer might not realize that one is
twice as much as the other. But if expressed as odds (99 to
1 versus 199 to 1) it may be easier to compare the two
situations by focusing on large whole numbers (199 versus
99) rather than on small ratios or fractions.
Visit also the Web site Counting and Combinatorial
What Is "Degrees of Freedom"
Recall that in estimating the population's variance, we used
(n-1) rather than n, in the denominator. The factor (n-1) is
called "degrees of freedom."
Estimation of the Population Variance: Variance in a
population is defined as the average of squared deviations
from the population mean. If we draw a random sample of n
cases from a population where the mean is known, we can
estimate the population variance in an intuitive way. We
sum the deviations of scores from the population mean and
divide this sum by n. This estimate is based on n
independent pieces of information and we have n degrees of
freedom. Each of the n observations, including the last one,
is unconstrained ('free' to vary).
When we do not know the population mean, we can still
estimate the population variance, but now we compute
deviations around the sample mean. This introduces an
important constraint because the sum of the deviations
around the sample mean is known to be zero. If we know
the value for the first (n-1) deviations, the last one is
known. There are only n-1 independent pieces of
information in this estimate of variance.
If you study a system with n parameters xi, i =,1..., n you
can represent it in a n-dimension space. Any point of this
space shall represent a potential state of your system. If
your n parameters could vary independently, then your
system would be fully described in a n-dimension hypervolume. Now, imagine you've got one constraint between
the parameters (an equation relying your n parameters),
then your system would be described by a (n-1)-dimension
hyper-surface. For example, in three dimensional space, a
linear relationship means a plane which is 2-dimensional.
In statistics, your n parameters are your n data. To evaluate
variance, you first need to infer the mean E(X). So when
you evaluate the variance, you've got one constraint on your
system (which is the expression of the mean), and it only
remains (n-1) degrees of freedom to your system.
Therefore, we divide the sum of squared deviations by n-1
rather than by n when we have sample data. On average,
deviations around the sample mean are smaller than
deviations around the population mean. This is because our
sample mean is always in the middle of our sample scores;
in fact the minimum possible sum of squared deviations for
any sample of numbers is around the mean for that sample
of numbers. Thus, if we sum the squared deviations from
the sample mean and divide by n, we have an
underestimate of the variance in the population (which is
based on deviations around the population mean).
If we divide the sum of squared deviations by n-1 instead of
n, our estimate is a bit larger, and it can be shown that this
adjustment gives us an unbiased estimate of the population
variance. However, for large n, say, over 30, it does not
make too much of difference if we divide by n, or n-1.
Degrees of Freedom in ANOVA: You will see the key parse
"degrees of freedom" also appearing in the Analysis of
Variance (ANOVA) tables. If I tell you about 4 numbers, but
don't say what they are, the average could be anything. I
have 4 degrees of freedom in the data set. If I tell you 3 of
those numbers, and the average, you can guess the fourth
number. The data set, given the average, has 3 degrees of
freedom. If I tell you the average and the standard
deviation of the numbers, I have given you 2 pieces of
information, and reduced the degrees of freedom to from 4
to 2. You only need to know 2 of the numbers' values to
guess the other 2.
In an ANOVA table, degree of freedom (df) is the divisor in
SS/df which will result in an unbiased estimate of the
variance of a population.
df = N - k, where N is the sample size, and k is a small
number, equal to the number of "constraints", the number
of "bits of information" already "used up". Degree of
freedom is an additive quantity; total amounts of it can be
"partitioned" into various components.
For example, suppose we have a sample of size 13 and
calculate its mean, and then the deviations from the mean,
only 12 of the deviations are free to vary: once one has
found 12 of the deviations, the thirteenth one is determined.
Therefore, if one is estimating a population variance from a
sample, k = 1.
In bivariate correlation or regression situations, k = 2: the
calculation of the sample means of each variable "uses up"
two bits of information, leaving N - 2 independent bits of
information.
In a one-way analysis of variance (ANOVA) with g groups,
there are three ways of using the data to estimate the
population variance. If all the data are pooled, the
conventional SST/(n-1) would provide an estimate of the
population variance.
If the treatment groups are considered separately, the
sample means can also be considered as estimates of the
population mean, and thus SSb/(g - 1) can be used as an
estimate. The remaining ("within-group", "error") variance
can be estimated from SSw/(n - g). This example
demonstrates the partitioning of df: df total = n - 1 =
df(between) + df(within) = (g - 1) + (n - g).
Therefore, the simple 'working definition' of df is ‘sample
size minus the number of estimated parameters'. A fuller
answer would have to explain why there are situations in
which the degrees of freedom is not an integer. After, we
said all this, the best explanation, is mathematical in that
we use df to obtain an unbiased estimate.
In summary, the concept of degrees of freedom is used for
the following two different purposes:


Parameter(s) of certain distributions, such as F, and tdistribution are called degrees of freedom. Therefore,
degrees of freedom could be positive non-integer
number(s).
Degrees of freedom is used to obtain unbiased
estimate for the population parameters.
Outlier Removal
Because of the potentially large variance, outliers could be
the outcome of sampling. It's perfectly correct to have such
an observation that legitimately belongs to the study group
by definition. Lognormally distributed data (such as
international exchange rate), for instance, will frequently
exhibit such values.
Therefore, you must be very careful and cautious: before
declaring an observation "an outlier," find out why and how
such observation occurred. It could even be an error at the
data entering stage.
First, construct the BoxPlot of your data. Form the Q1, Q2,
and Q3 points which divide the samples into four equally
sized groups. (Q2 = median) Let IQR = Q3 - Q1. Outliers
are defined as those points outside the values Q3+k*IQR
and Q1-k*IQR. For most case one sets k=1.5.
Another alternative is the following algorithm
a) Compute  of whole sample.
b) Define a set of limits off the mean: mean + k, mean k sigma (Allow user to enter k. A typical value for k is 2.)
c) Remove all sample values outside the limits.
Now, iterate N times through the algorithm, each time
replacing the sample set with the reduced samples after
applying step (c).
Usually we need to iterate through this algorithm 4 times.
As mentioned earlier, a common "standard" is any
observation falling beyond 1.5 (interquartile range) i.e., (1.5
IQRs) ranges above the third quartile or below the first
quartile. The following SPSS program, helps you in
determining the outliers.
$SPSS/OUTPUT=LIER.OUT
TITLE
'DETERMINING IF OUTLIERS EXIST'
DATA LIST
FREE FILE='A' / X1
VAR LABLE
X1 'INPUT DATA'
LIST CASE
CASE=10/VARIABLE=X1/
CONDESCRIPTIVE
X1(ZX1)
LIST CASE
CASE=10/VARIABLES=X1,ZX1/
SORT CASES BY ZX1(A)
LIST CASE
CASE=10/VARIABLES=X1,ZX1/
FINISH
Statistical Summaries
Representative of a Sample: Measures of Central
Tendency Summaries
How do you describe the "average" or "typical" piece of
information in a set of data? Different procedures are used
to summarize the most representative information
depending of the type of question asked and the nature of
the data being summarized.
Measures of location give information about the location of
the central tendency within a group of numbers. The
measures of location presented in this unit for ungrouped
(raw) data are the mean, the median, and the mode.
Mean: The arithmetic mean (or the average or simple
mean) is computed by summing all numbers in an array of
numbers (xi) and then dividing by the number of
observations (n) in the array.
The mean uses all of the observations, and each observation
affects the mean. Even though the mean is sensitive to
extreme values, i.e., extremely large or small data can
cause the mean to be pulled toward the extreme data, it is
still the most widely used measure of location. This is due to
the fact that the mean has valuable mathematical properties
that make it convenient for use with inferential statistical
analysis. For example, the sum of the deviations of the
numbers in a set of data from the mean is zero, and the
sum of the squared deviations of the numbers in a set of
data from the mean is the minimum value.
Weighted Mean: In some cases, the data in the sample or
population should not be weighted equally, rather each
value should be weighted according to its importance.
Median: The median is the middle value in
an ordered array of observations. If there is an even
number of observations in the array, the median is
the average of the two middle numbers. If there is an odd
number of data in the array, the median is
the middle number.
The median is often used to summarize the distribution of
an outcome. If the distribution is skewed, the median and
the IQR may be better than other measures to indicate
where the observed data are concentrated.
Generally, the median provides a better measure of location
than the mean when there are some extremely large or
small observations; i.e., when the data are skewed to the
right or to the left. For this reason, median income is used
as the measure of location for the U.S. household income.
Note that if the median is less than the mean, the data set
is skewed to the right. If the median is greater than the
mean, the data set is skewed to the left.
Mode: The mode is the most frequently occurring value in a
set of observations. Why use the mode? The classic example
is the shirt/shoe manufacturer who wants to decide what
sizes to introduce. Data may have two modes. In this case,
we say the data are bimodal, and sets of observations with
more than two modes are referred to as multimodal. Note
that the mode does not have important mathematical
properties for future use. Also, the mode is not a helpful
measure of location, because there can be more than one
mode or even no mode.
Whenever, more than one mode exist, then the population
from which the sample came is a mixture of more than one
population. Almost all standard statistical analyses assume
that the population is homogeneous, meaning that its
density is unimodal.
Notice that Excel is a very limited statistical software. For
example, it displays only one mode, the first one.
Unfortunately, this is very misleading. However, you may
find out if there are others by inspection only, as follow:
Create a frequency distribution, invoke the menu sequence:
Tools, Data analysis, Frequency and follow instructions on
the screen. You will see the frequency distribution and then
find the mode visually. Unfortunately, Excel does not draw a
Stem and Leaf diagram. All commercial off-the-shelf
software, such as SAS and SPSS display a Stem and Leaf
diagram which is a frequency distribution of a given data
set.
Quartiles & Percentiles: Quantiles are values that
separate a ranked data set into four equal classes. Whereas
percentiles are values that separate a ranked the data into
100 equal classes. The widely used quartiles are the 25th,
50th, and 75th percentiles.
Selecting Among the Mean, Median, and Mode
It is a common mistake to specify the wrong index for
central tenancy.
The first consideration is the type of data, if the variable is
categorical, the mode is the single measure that best
describes that data.
The second consideration in selecting the index is to ask
whether the total of all observations is of any interest. If the
answer is yes, then the mean is the proper index of central
tendency.
If the total is of no interest, then depending on whether the
histogram is symmetric or skewed one must use either
mean or median, respectively.
In all cases the histogram must be unimodal.
Suppose that four people want to get together to play
poker. They live on 1st Street, 3rd Street, 7th Street, and
15th Street. They want to select a house that involves the
minimum amount of driving for all parties concerned.
Let's suppose that they decide to minimize the absolute
amount of driving. If they met at 1st Street, the amount of
driving would be 0 + 2 + 6 + 14 = 22 blocks. If they met at
3rd Street, the amount of driving would be 2 + 0+ 4 + 12 =
18 blocks. If they met at 7th Street, 6 + 4 + 0 + 8 = 18
blocks. Finally, at 15th Street, 14 + 12 + 8 + 0 = 34 blocks.
So the two houses that would minimize the amount of
driving would be 3rd or 7th Street. Actually, if they wanted a
neutral site, any place on 4th, 5th, or 6th Street would also
work.
Note that any value between 3 and 7 could be defined as
the median of 1, 3, 7, and 15. So the median is the value
that minimizes the absolute distance to the data points.
Now the person at 15th is upset at always having to do
more driving. So the group agrees to consider a different
rule. The decide to minimize the square of the distance
driving. This is the least squares principle. By squaring, we
give more weight to a single very long commute than to a
bunch of shorter commutes. With this rule, the 7th Street
house (36 + 16 + 0 + 64 = 116 square blocks) is preferred
to the 3rd Street house (4 + 0 + 16 + 144 = 164 square
blocks). If you consider any location, and not just the
houses themselves, then 9th Street is the location that
minimizes the square of the distances driven.
Find the value of x that minimizes
(1 - x)2 + (3 - x)2 +(7 - x)2 + (15 - x)2.
The value that minimizes the sum of squared values is 6.5
which is also equal to the arithmetic mean of 1, 3, 7, and
15. With calculus, it's easy to show that this holds in
general.
For moderately asymmetrical distributions the mode,
median and mean satisfy the formula: mode=3 (median) 2(mean).
Consider a small sample of scores with an even number of
cases, for example, 1, 2, 4, 7, 10, and 12. The median is
5.5, the midpoint of the interval between the scores of 4
and 7.
As we discussed above, it is true that the median is a point
around which the sum absolute deviations is minimized. In
this example the sum of absolute deviation is 22.
However, it is not a unique point. Any point in the 4 to 7
region will have the same value of 22 for the sum of the
absolute deviations.
Indeed, medians are tricky. The 50%-50% (above-below) is
not quite correct. For example, 1, 1, 1, 1, 1, 1, 8 has no
median. The convention says that, the median is 1, however
about 14% of the data lie strictly above it, 100% of the data
is
the median. This generalizes to other percentiles.
We will make use of this idea in regression analysis. In an
analogous argument, the regression line is a unique line
which minimizes the sum of the squared deviations from it.
There is no unique line which minimizes the sum of the
absolute deviations from it.
Quality of a Sample: Measures of Dispersion
Average by itself is not a good indication of quality. You
need to know the variance to make any educated
assessment. We are reminded of the dilemma of the six-foot
tall statistician who drowned in a stream that had an
average depth of three feet.
These are statistical procedures for describing the nature
and extent of differences among the information in the
distribution. A measure of variability is generally reported
with a measure of central tendency.
Statistical measures of variation are numerical values that
indicate the variability inherent in a set of data
measurements. Note that a small value for a measure of
dispersion indicates that the data are concentrated around
the mean; therefore, the mean is a good representative of
the data set. On the other hand, a large measure of
dispersion indicates that the mean is not a good
representative of the data set. Also, measures of dispersion
can be used when we want to compare the distributions of
two or more sets of data. Quality of a data set is measured
by its variability: Larger variability indicates lower
quality. That is why high variation makes the manager very
worried. Your job, as a statistician is to measure the
variation, and if it is too high and unacceptable, then it is
the job of the technical staff, such as engineers, to fix the
process.
The decision situations with flat uncertainty have the
largest risk. For simplicity, consider the case when there are
only two outcomes one with probability of p. Then, the
variation in the outcomes is p(1-p). This variation is the
largest if we set p = 50%. That is, equal chance for each
outcome. In such a case, the quality of information is at its
lowest level. Remember, quality of information and
variation are inversely related. Larger the variation in
the data, the lower the quality of the data (i.e.,
information). Remember that the Devil is in the
Deviations.
The four most common measures of variation are
the range, variance, standard deviation,
and coefficient of variation.
Range: The range of a set of observations is the absolute
value of the difference between the largest and smallest
values in the set. It measures the size of the smallest
contiguous interval of real numbers that encompasses all of
the data values. It is not useful when extreme values are
present. It is based solely on two values, not on the entire
data set. In addition, it cannot be defined for open-ended
distributions such as Normal distribution.
Normal distribution does not have a range. A student said
"since the tails of normal density function never touch the xaxis, at the same time since for an observation to contribute
to forming the such a curve, very large positive and
negative values must exist" Yet such remote values are
always possible, but increasingly improbable. This
encapsulates the asymptotic behavior of normal density very
well.
Variance: An important measure of variability is variance.
Variance is the average of the squared deviations of each
observation in the set from the arithmetic mean of all of
observations.
Variance =  (xi -
)
2
/ (n - 1), n
2.
The variance is a measure of spread or dispersion among
values in a data set. Therefore, the greater the variance, the
lower the quality.
The variance is not expressed in the same units as the
observations. In other words, the variance is hard to
understand because the deviations from the mean are
squared, making it too large for logical explanation. This
problem can be solved by working with the square root of
the variance, which is called the standard deviation.
Standard Deviation: Both variance and standard deviation
provide the same information; one can always be
obtained from the other. In other words, the process of
computing a standard deviation always involves computing a
variance. Since standard deviation is the square root of the
variance, it is always expressed in the same units as the
raw data:
For large data set (more than 30, say), approximately 68%
of the data will fall within one standard deviation of the
mean, 95% fall within two standard deviations, and 97.7%
(or almost 100% ) fall within three standard deviations (S)
from the mean.
Standard Error: Standard error is a statistic indicating the
accuracy of an estimate. That is, it tells us to assess how
different the estimate ( such as
) is from the population
parameter (such as ). It is therefore, the standard
deviation of a sampling distribution of the estimator such
as
's.
Coefficient of Variation: Coefficient of Variation (CV) is
the relative deviation with respect to size
:
CV is independent of the unit of measurement. In estimation
of a parameter when CV is less than say 10%, the estimate
is assumed acceptable. The inverse of CV; namely 1/CV is
called the Signal-to-noise Ratio.
The coefficient of variation is used to represent the
relationship of the standard deviation to the mean, telling
how much representative the mean is of the numbers from
which it came. It expresses the standard deviation as a
percentage of the mean; i.e., it reflects the variation in a
distribution relative to the mean.
Z Score: how many standard deviations a given point (i.e.
observations) is above or below the mean. In other words,
a Z score represents the number of standard deviations an
observation (x) is above or below the mean. The larger the
Z value, the further away a value will be from the mean.
Note that values beyond three standard deviations are very
unlikely. Note that if a Z score is negative, the observation
(x) is below the mean. If the Z score is positive, the
observation (x) is above the mean. The Z score is found as:
Z = (x -
) / standard deviation of X
The Z score is a measure of the number of standard
deviations that an observation is above or below the mean.
Since the standard deviation is never negative, a positive Z
score indicates that the observation is above the mean, a
negative Z score indicate that the observation is below the
mean. Note that Z is a dimensionless value, and is therefore
a useful measure by which to compare data values from two
different populations even those measured by different
units.
Z-Transformation: Applying the formula z = (X - )
/ will always produce a transformed variable with a mean
of zero and a standard deviation of one. However, the shape
of the distribution will not be affected by the transformation.
If X is not normal then the transformed distribution will not
be normal either. In the following SPSS command variable x
is transformed to zx.
descriptives variables=x(zx)
You have heard the terms z value, z test, z transformation,
and z score. Do all of these terms mean the same thing?
Certainly not:
The z value is refereed to the critical value (a point on the
horizontal axes) of the Normal (o, 1) density function, for a
given area to the left of that z-value.
The z test is refereed to the procedures for testing the
equality of mean (s) of one (or two) population(s).
z score of a given observation x in a sample of size n, is
simply (x - average of the sample) divided by the standard
deviation of the sample.
The z transformation of a set of observations of size n is
simply (each observation - average of all observation)
divided by the standard deviation among all observations.
The aim is to produce a transformed data set with a mean of
zero and a standard deviation of one. This makes the
transformed set dimensionless and manageable with respect
to its magnitudes. It also used in comparing several data
sets measured using different scales of measurements.
Pearson coined the term "standard deviation" sometime
near 1900. The idea of using squared deviations goes back
to Laplace in the early 1800's.
Finally, notice again, that the trandforming raw scores to z
scores does NOT normalize the data.
Guess a Distribution to Fit Your Data: Skewness &
Kurtosis
A pair of statistical measures skewness and kurtosis is a
measuring tool which is used in selecting a distribution(s) to
fit your data. To make an inference with respect to the
population distribution, you may first compute skewness and
kurtosis from your random sample from the entire
population. Then, locating a point with these coordinates on
some widely used Skewness-Kurtosis Charts (available
from your instructor upon request), guess a couple of
possible distributions to fit your data. Finally, you might use
the goodness-of-fit test to rigorously come up with the best
candidate fitting your data. Removing outliers improves both
skewness and kurtosis.
Skewness: Skewness is a measure of the degree to which
the sample population deviates from symmetry with the
mean at the center.
Skewness =  (xi -
)
3
/ [ (n - 1) S
3
], n
2.
Skewness will take on a value of zero when the distribution
is a symmetrical curve. A positive value indicates the
observations are clustered more to the left of the mean with
most of the extreme values to the right of the mean. A
negative skewness indicates clustering to the right. In this
case we have: Mean
Median
Mode. The reverse
order holds for the observations with positive skewness.
Kurtosis: Kurtosis is a measure of the relative peakedness
of the curve defined by the distribution of the observations.
Kurtosis =  (xi -
)
4
/ [ (n - 1) S
4
], n
2.
Standard normal distribution has kurtosis of +3. A kurtosis
larger than 3 indicates the distribution is more peaked than
the standard normal distribution.
Coefficient of Excess Kurtosis = Kurtosis - 3.
A less than 3 kurtosis value means that the distribution is
flatter than the standard normal distribution.
Skewness and kurtosis can be used to check for normality
via the the Jarque-Bera test. For large n, under the
normality condition the quantity
n {Skewness2 / 6 +((Kurtosis - 3)2) / 24)}
follows a chi-square distribution with d.f. = 2.
Further Reading:
Tabachnick B., and L. Fidell, Using Multivariate Statistics,
HarperCollins, 1996. Has a good discussion on applications
and significance tests for skewness and kurtosis.
Numerical Example & Discussions
A Numerical Example: Given the following, small (n = 4)
data set, compute the descriptive statistics: x1 = 1, x2 = 2,
x3 = 3, and x4 = 6.
i
xi
1
1
-2
4
-8
16
2
2
-1
1
-1
1
3
3
0
0
0
0
4
6
3
9
27
81
Sum 12
0
14
18
98
(xi-
) 2 (xi -
) (xi -
) 3 (xi -
)4
The mean
is 12 / 4 = 3, the variance is s2 = 14 / 3 =
4.67, the standard deviation is s = (14/3) 0.5 = 2.16, the
skewness is 18 / [3 (2.16) 3 ] = 0.5952, and finally, the
kurtosis is 18 / [3 (2.16) 4 ] = 1.5.
A Short Discussion
Deviations about the mean of a distribution is the basis
for most of the statistical tests we will learn. Since we are
measuring how much a set of scores is dispersed about the
mean , we are measuring variability. We can calculate
the deviations about the mean and express it as
variance 2or standard deviation . It is very important to
have a firm grasp of this concept because it will be a
central concept throughout your statistics course.
Both variance 2 and standard deviation  measure
variability within a distribution. Standard deviation  is a
number that indicates how much on average each of the
values in the distribution deviates from the mean (or
center) of the distribution. Keep in mind that
variance 2 measures the same thing as standard
deviation  (dispersion of scores in a distribution).
Variance 2, however, is the average squared deviations
about the mean. Thus, variance 2 is the square of the
standard deviation .
Expected value and variance of
respectively.
are  and 2/n,
Expected value and variance of S2 are 2 and 24 / (n-1),
respectively.
and S2 are the best estimators for and 2. They are
Unbiased (you may update your estimate); Efficient (they
have the smallest variation among other estimators);
Consistent (increasing sample size provides a better
estimate); and Sufficient (you do not need to have the
whole data set; what you need are xi and xi2 for
estimations). Note also that the above variance for of S2 is
justified only in the case where the population distribution
tends to be normal, otherwise one may use bootstrapping
techniques.
In general, it is believed that the pattern of mode, median,
and mean go from lower to higher in positive skewed data
sets, and just the opposite pattern in negative skewed data
sets. However, for example, in the following 23 numbers,
mean=2.87, median=3, but the data is positively skewed:
42764353131243121152231
and, the following 10 numbers have
mean=median=mode=4, but the data set is left skewed:
1234445566
Note also that, most commercial software donot correctly
compute skewness and kurtosis. There is no easy way to
determine confidence intervals about a computed skewness
or kurtosis value from a small to medium sample. The
literature gives tables based on asymptotic methods for
sample sets larger than 100 for normal distributions only.
You may have noticed that using the above numerical
example on some computer packages such as SPSS, the
skewness and the kurtosis are different from what we have
computed. For example, the SPSS output for the skewness
is 1.190. However, for large a sample size n, the results are
identical.
Reference and Further Readings:
David H., Early Sample Measures of Variability, Statistical
Science, 13, 1998, 368-377. This article provides a good
historical accounts of statistical measures.
Groeneveld R., A class of quantile measures for
kurtosis, The American Statistician, 325, Nov. 1998.
Hosking J., M, Moments or L moments? An example
comparing two measures of distributional shape, The
American Statistician, Vo.l 46, 186-189, 1992.
Parameters' Estimation and Quality of a 'Good'
Estimate
Estimation is the process by which sample data are used to
indicate the value of an unknown quantity in a population.
Results of estimation can be expressed as a single value,
known as a point estimate; or a range of values, known as a
confidence interval.
Whenever we use point estimation, we calculate the margin
of error associated with that point estimation. For example,
for the estimation of the population mean , the margin of
errors calculated as follows: ±1.96 SE(
).
In newspapers and television reports on public opinion
pools, the margin of error is the margin of "sampling error".
There are many nonsampling errors that can and do affect
the accuracy of polls. Here we talk about sampling error.
The fact that subgroups have larger sampling error than one
must include the following statement: "Other sources of
error include but are not limited to, individuals refusing to
participate in the interview and inability to connect with the
selected number. Every feasible effort is made to obtain a
response and reduce the error, but the reader (or the
viewer) should be aware that some error is inherent in all
research."
To estimate means to esteem (to give value to). An
estimator is any quantity calculated from the sample data
which is used to give information about an unknown
quantity in the population. For example, the sample mean is
an estimator of the population mean .
Estimators of population parameters are sometimes
distinguished from the true value by using the symbol 'hat'.
For example, true population standard deviation  is
estimated (from a sample) population standard deviation.
Example: The usual estimator of the population mean
is
= xi / n, where n is the size of the sample and x1,
x2, x3,.......,xn are the values of the sample. If the value of
the estimator in a particular sample is found to be 5, then 5
is the estimate of the population mean µ.
A "Good" estimator is the one which provides an estimate
with the following qualities:
Unbiasedness: An estimate is said to be an unbiased
estimate of a given parameter when the expected value that
of estimator can be shown to be equal to the parameter
being estimated. For example, the mean of a sample is an
unbiased estimate of the mean of the population from which
the sample was drawn. Unbiasedness is a good quality for
an estimate since in such a case, using weighted average of
several estimates provides a better estimate than each one
of those estimates. Therefore, unbiasedness allows us to
upgrade our estimates. For example is your estimate of the
population mean µ are say, 10, and 11.2 from two
independent samples of equal sizes 20, and 30 respectively,
then the estimate of the population mean µ based on both
samples is [20 (10) + 30 (11.2)] (20 + 30) = 10.75.
Consistency: The standard deviation of an estimate is
called the standard error of that estimate. The larger the
standard error means more error in your estimate. It is a
commonly used index of the error entailed in estimating a
population parameter based on the information in a random
sample of size n from the entire population.
An estimator is said to be "consistent" if increasing the
sample size produces an estimate with smaller standard
error. Therefore, your estimate is "consistent" with the
sample size. That is, spending more money (to obtain a
larger sample) produces a better estimate.
Efficiency: An efficient estimate is the one which has the
smallest standard error among all other estimators of equal
size.
Sufficiency: A sufficient estimator based on a statistic
contains all the information which is present in the raw data.
For example, the sum of your data is sufficient to estimate
the mean of the population. You don't have to know the
data set itself. This saves a lot of money if the data has to
be transmitted by telecommunication network. Simply, send
out the total, and the sample.
A sufficient statistic t for a parameter is a function of
the sample data x1,...,xn, which contains all information in
the sample about the parameter. More formally,
sufficiency is defined in terms of the likelihood function
for . For a sufficient statistic t, the Likelihood
L(x1,...,xn|) can be written as
g (t | )*k(x1,...,xn)
Since the second term does not depend on , t is said to be
a sufficient statistic for .
Another way of stating this for the usual problems is that
one could construct a random process starting from the
sufficient statistic, which will have exactly the same
distribution as the full sample for all states of nature.
To illustrate, let the observations be independent Bernoulli
trials with the same probability of success. Suppose that
there are n trials, and that person A observes which
observations are successes, and person B only finds out the
number of successes. Then if B places these successes at
random points without replication, the probability that B will
now get any given set of successes is exactly the same as
the probability that A will see that set, no matter what the
true probability of success happens to be.
The widely used estimator of the population mean µ
is
= xi/n, where n is the size of the sample and x1, x2,
x3,......., xn are the values of the sample that have all the
above properties. Therefore, it is a "good" estimator.
If you want an estimate of central tendency as a parameter
for a test or for comparison, then small sample sizes are
unlikely to yield any stable estimate. The mean is sensible in
a symmetrical distribution, as a measure of central
tendency, but, e.g., with ten cases you will not be able to
judge whether you have a symmetrical distribution.
However, the mean estimate is useful if you are trying to
estimate the a population sum, or some other function of
the expected value of the distribution. Would the median be
a better measure? In some distributions (e.g., shirt size) the
mode may be better. Box-plot will indicate outliers in the
data set. If there are outliers, median is better than mean
as a measure of the central tendency.
If you have a yes/no question you probably want to
calculate a proportion p of yeses (or noes). Under simple
random sampling, the variance of p is p(1-p)/n, ignoring the
finite population correction. Now a 95% confidence interval
is 1.96 [p(1-p)/n]2. A conservative interval can be calculated
assuming p(1-p) takes its maximum value, which it does
when p = 1/2. Replace 1.96 by 2, put p = 1/2 and you have
a 95% confidence interval of 1/n1/2. This approximation
works well as long as p is not too close to 0 or 1. This useful
approximation allows you to calculate approximate 95%
confidence intervals.
Conditions Under Which Most Statistical Testing Apply
Don't just learn formulas and number-crunching: learn
about the conditions under which statistical testing
procedures apply. The following conditions are common to
almost all tests:
1. homogeneous population (see if there are more than
one mode)
2. sample must be random (to test this, perform the Runs
Test).
3. In addition to requirement No. 1, each population has
a normal distribution (perform Test for Normality)
4. homogeneity of variances. Variation in each population
is almost the same as in the others.
For 2 populations use the F-test. For 3 or more
populations, there is a practical rule known as the
"Rule of 2". In this rule one divides the highest
variance of a sample to the lowest variance of the
other sample. Given that the sample sizes are almost
the same, and the value of this division is less than 2,
then, the variations of the populations are almost the
same.
Notice: This important condition in analysis of
variance (ANOVA and the t-test for mean differences)
is commonly tested by the Levene or its modified test
known as the Brown-Forsythe test. Unfortunately, both
tests rely on the homogeneity of variances assumption!
These assumptions are crucial, not for the
method/computation, but for the testing using the resultant
statistic. Otherwise, we can do, for example, ANOVA and
regression without any assumptions, and the numbers come
out the same -- simple computations give us least-square
fits, partitions of variance, regression coefficients, and so
on. Only when testing certain assumptions about
independence, and homogeneous distribution of error terms
known as residuals.
Homogeneous Population
Homogeneous Population: A homogeneous population is a
statistical population which has a unique mode. To
determine if a given population is homogeneous or not,
construct the histogram of a random sample from the entire
population. If there is more than one mode, then you have a
mixture of population. Know that to perform any statistical
testing, you need to make sure you are dealing with
homogeneous population.
Test for Randomness: The Runs Test
A "run" is a maximal subsequence of like elements.
Consider the following sequence (D for Defective, N for nondefective items) out of a production line: DDDNNDNDNDDD.
Number of runs is R = 7, with n1 = 8, and n2 = 4 which are
number of D's and N's (whichever).
A sequence is random if it is neither "over-mixed" nor
"under-mixed". An example of over-mixed sequence is
DDDNDNDNDNDD, with R = 9 while under-mixed looks like
DDDDDDDDNNNN with R = 2. There the above sequence
seems to be random.
The Runs Tests, which is also known as Wald-Wolfowitz
Test, is designed to test the randomness of a given sample
at 100(1- )% confidence level. To conduct a runs test on a
sample, perform the following steps:
Step 1: compute the mean of the sample.
Step 2: going through the sample sequence, replace any
observation with +, or - depending on wether it is above or
below the mean. Discard any ties.
Step 3: compute R, n1, and n2.
Step 4: compute the expected mean and variance of R, as
follows:
 =1 + 2n1n2/(n 1 + n2).
2 = 2n1n2(2n 1n2-n1- n2)/[[n1 + n2)2 (n1 + n2 -1)].
Step 5: Compute z = (R-)/ .
Step 6: Conclusion:
If z  Z, then there might be cyclic, seasonality behavior
(under-mixing).
If z  - Z, then there might be a trend.
If z  - Z, or z  Z, reject the randomness.
Note: This test is valid for cases for which both n1 and
n2 are large, say greater than 10. For small sample sizes
special tables must be used.
The SPSS command for the runs test:
NPAR TEST RUNS(MEAN) X (the name of the variable).
For example, suppose for a given sample of size 50, we
have R = 24, n1 = 14 and n2 = 36. Test for randomness
at  = 0.05.
The Plugging these into the above formulas we have  =
16.95,  = 2.473, and z = -2.0 From Z-table, we have Z =
1.645. Therefore, there might be a trend, which means that
the sample is not random.
Visit the Web site Test for Randomness
Lilliefors Test for Normality
The following SPSS program computes the KolmogrovSmirinov-Lilliefors statistic called LS. It can easily be
converted and run in any other platforms.
$SPSS/OUTPUT=L.OUT
TITLE
'K-S LILLIEFORS TEST FOR NORMALITY'
DATA LIST
FREE FILE='L.DAT'/X
VAR LABELS
X 'SAMPLE VALUES'
LIST CASE
CASE=20/VARIABLES=ALL
CONDESCRIPTIVE X(ZX)
LIST CASE CASE=20/VARIABLES=X ZX/
SORT CASES BY ZX(A)
RANK VARIABLES=ZX/RFRACTION INTO CRANK/TIES=HIGH
COMPUTE Y=CDFNORM(ZX)
COMPUTE SPROB=CRANK
COMPUTE DA=Y-SPROB
COMPUTE DB=Y-LAG(SPROB,1)
COMPUTE DAABS=ABS(DA)
COMPUTE DBABS=ABS(DB)
COMPUTE LS=MAX(DAABS,DBABS)
LIST VARIABLES=X,ZX,Y,SPROB,DA,DB
LIST VARIABLES=LS
SORT CASES BY LS(D)
LIST CASES CASE=1/VARIABLES=LS
FINISH
The output is the statistic LS, which should be compared
with the following critical values after setting a significance
level  (as a function of the sample size n).
Critical Values for the Lilliefors Test
Significance Level
Critical Value
 = 0.15
0.775 / ( n ½ - 0.01 + 0.85 n -½ )
 = 0.10
0.819 / ( n ½ - 0.01 + 0.85 n -½ )
 = 0.05
0.895 / ( n ½ - 0.01 + 0.85 n -½ )
 = 0.025
0.995 / ( n ½ - 0.01 + 0.85 n -½ )
A normal probability plot will also help you distinguish
between a systematic departure from normality when it
shows up as a curve. In SAS do a PROC UNIVARIATE
NORMAL PLOT. Bera-Jarque test, which is widely used by
econometricians, might also be applicable.
Further Reading
Statistical inference by normal probability paper, by T.
Takahashi, Computers & Industrial Engineering, Vol. 37, Iss.
1 - 2, pp 121-124, 1999.
Bonferroni Method
One may combine several t-tests by using the Bonferroni
method. It works reasonably well when there are only a few
tests, but as the number of comparisons increases above 8,
the value of 't' required to conclude that a difference exists
becomes much larger than it really needs to be and the
method becomes over conservative.
One way to make the Bonferroni t test less conservative is
to use the estimate of the population variance computed
from within the groups in the analysis of variance.
t=(
1-
2 )/ ( 2 / n1 + 2 / n2 )1/2
where VW is the population variance computed from within
the groups.
Chi-square Tests
The Chi-square is a distribution, as is the Normal and
others. The Normal (or Gaussian or bell-shaped) often
occurs naturally in real life. When we know the mean and
variance of a Normal then it allows us to find probabilities.
So if, for example, you knew some things about the average
height of women in the nation (including the fact that
heights are distributed normally, you could measure all the
women in your extended family, find the average height,
and determine a probability associated with your result; if
the probability of getting your result, given your knowledge
of women nationwide, is high, then your family's female
height cannot be said to be different from average. If that
probability is low, then your result is rare (given the
knowledge about women nationwide), and you can say your
family is different. You've just completed a test of the
hypothesis that the average height of women in your family
is different from the overall average.
There are other (similar) tests where finding that probability
means NOT using Normal distribution. One of these is a Chisquare test. For instance, if you tested the variance of your
family's female heights (which is analogous to your previous
test of the mean), you can't assume that the normal
distribution is appropriate to use. This should make sense,
since the Normal is bell-shaped, and variances have a lower
limit of zero. So, while a variance could be any huge
number, it gets bounded on the low side by zero. If you
were to test whether the variance of heights in your family
is different from the nation, a Chi-square test happens to be
appropriate, given our original above conditions. The
formula and procedure is in your textbook.
Crosstables: The variance is not the only thing for which you
use a Chi-square test for. Often times it is used to test
relationship among two categorical type data, or
independence of two variables, such as cigarette smoking
and drug use. If you were to survey 1000 people on whether
or not they smoke and whether or not they use drugs, you
will get one of four answers: (no,no) (no,yes) (yes,no)
(yes,yes).
By compiling the number of people in each category, you
can ultimately test whether drug usage is independent of
cigarette smoking by using the Chi-square distribution (this
is approximate, but works well). Again, the methodology for
this is in your textbook. The degrees of freedom is equal to
(number of rows-1)(number of columns -1). That is, these
many figures needed to fill in the entire body of the
crosstable, the rest will be determined by using the rows
and columns sum figures.
Don't forget the conditions for the validity of Chi-square test
and related expected values greater than 5 in 80% or more
of the cells. Otherwise, one could use an "exact" test, using
either a permutation or resampling approach. Both SPSS
and SAS are capable of doing this test.
For a 2-by-2 table, you should use the Yates correction to
the chi-square. Chi-square distribution is used as an
approximation of the binomial distribution. By applying a
continuity correction we get a better approximation to the
binomial distribution for the purposes of calculating tail
probabilities.
Use a relative risk measure such as the risk ratio or odds
ratio. In the table:
ab
cd
The most usual measures are:
Rate difference a/(a+c) - b/(b+d)
Rate ratio (a/(a+c))/(b/(b+d))
Odds ratio ad/bc
The rate difference and rate ratio are appropriate when you
are contrasting two groups, whose sizes (a+c and b+d) are
given. The odds ratio is for when the issue is association
rather than difference. Confidence interval methods are
available for all of these - though not as well available in
software as should be. If the hypothesis test is highly
significant, the confidence interval will be well away from
the null hypothesis value (0 for the rate difference, 1 for the
rate ratio or odds ratio).
The risk ratio is the ratio of the proportion (a/(a+b)) to the
proportion (c/(c+d)):
RR = (a / (a + b)) / (c / (c + d))
RR is thus a measure of how much larger the proportion in
the first row is compared to the second row and ranges from
0 to infinity with  1.00 indicating a 'negative' association
[a/(a+b)  c/(c+d)], 1.00 indicating no association [a/(a+b)
= c/(c+d)], and 1.00 indicating a 'positive' association
[a/(a+b)  c/(c+d)]. The further from 1.00, the stronger the
association. Most stats packages will calculate the RR and
confidence intervals for you. A related measure is the odds
ratio (or cross product ratio) which is (a/b)/(c/d).
You could also look at the  statistic which is:
 = (2/N)½
where 2 is the Pearson's chi-square and N is the sample
size. This statistic ranges between 0 and 1 and can be
interpreted like the correlation coefficient.
Visit Critical Values for the Chi- square Distribution
Visit also, the Web sites Exact Unconditional
Tests, Statistical tests
Reference:
Fleiss J., Statistical Methods for Rates and Proportions,
Wiley, 1981.
Goodness-of-fit Test for Discrete Random Variables
There are other tests which might use the Chi-square, such
as goodness-of-fit test for discrete random variables. Again
don't forget the conditions for the validity of Chi-square test
and related expected values greater than 5 in 80% or more
of the cells. Therefore, Chi-square is a statistical test that
measures "goodness-of-fit". In other words, it measures
how much the observed or actual frequencies differ from the
expected or predicted frequencies. Using a Chi-square table
will enable you to discover how significant the difference is.
A null hypothesis in the context of the Chi-square test is the
model that you use to calculate your expected or predicted
values. If the value you get from calculating the Chi-square
statistic is sufficiently high (as compared to the values in the
Chi-square table) it tells you that your null hypothesis is
probably wrong.
Let Y1, Y 2, . . ., Y n be a set of independent and identically
distributed random variables. Assume that the probability
distribution of the Y i's has the density function
f o (y). We can divide the set of all possible values of Yi,
i  {1, 2, ..., n}, into m non-overlapping intervals D1, D2,
...., Dm. Define the probability values p1, p2 , ..., pm as;
p1 = P(Yi  D1)
p2 = P(Yi D2)
:
:
pm = P(Yi  Dm)
Since the union of the mutually exclusive intervals D1, D2,
...., Dm is the set of all possible values for the Yi's,
(p1 + p2 + .... + pm) = 1. Define the set of discrete random
variables X1, X2, ...., Xm, where
X1= number of Yi's whose value D1
X2= number of Yi's whose value  D2
:
:
Xm= number of Yi's whose value  Dm
and (X1+ X2+ .... + Xm) = n. Then the set of discrete
random variables X1, X2, ...., Xmwill have a multinomial
probability distribution with parameters n and the set of
probabilities {p1, p2, ..., pm}. If the intervals D1, D2, ....,
Dm are chosen such that npi 5 for i = 1, 2, ..., m, then;
C =  (Xi - npi) 2/ npi. The sum is over i= 1, 2,..., m. The
results is distributed as 2 m-1.
For the goodness-of-fit sample test, we formulate the null
and alternative hypothesis as
Ho : fY(y) = fo(y)
H1 : fY(y)  fo(y)
At the  level of significance, Ho will be rejected in favor of
H1 if
C =  (Xi - npi) 2/ npi is greater than 2
m
However, it is possible that in a goodness-of-fit test, one or
more of the parameters of fo(y) are unknown. Then the
probability values p1, p2, ..., pmwill have to be estimated by
assuming that Ho is true and calculating their estimated
values from the sample data. That is, another set of
probability values p'1, p'2, ..., p'm will need to be computed
so that the values (np'1, np'2, ..., np'm) are the estimated
expected values of the multinomial random variable (X1, X2,
...., Xm). In this case, the random variable C will still have a
chi-square distribution, but its degrees of freedom will be
reduced. In particular, if the density function fo(y)
has r unknown parameters,
C =  (Xi - npi) 2/ npi is distributed as 2
m-1-r.
For this goodness-of-fit test, we formulate the null and
alternative hypothesis as
Ho: fY(y) = fo(y)
H1: fY(y) fo(y)
At the  level of significance, Ho will be rejected in favor of
H1 if C is greater than
2 m-1-r.
Using chi-square in a 2x2 table requires the Yates's
correction. One first subtracts 0.5 from the absolute
differences between observed and expected frequencies for
each of the 3 genotypes before squaring, dividing by the
expected frequency, and summing. The formula for the chisquare value in a 2x2 table can be derived from the Normal
Theory comparison of the two proportions in the table using
the total incidence to produce the standard errors. The
rationale of the correction is a better equivalence of the area
under the normal curve and the probabilities obtained from
the discrete frequencies. In other words, the simplest
correction is to move the cut-off point for the continuous
distribution from the observed value of the discrete
distribution to midway between that and the next value in
the direction of the null hypothesis expectation. Therefore,
the correction essentially only applied to 1 df tests where
the "square root" of the chi-square looks like a "normal/ttest" and where a direction can be attached to the 0.5
addition.
For more, visit the Web sites Chi-Square Lesson, and Exact
Unconditional Tests.
Statistics with Confidence
In practice, a confidence interval is used to express the
uncertainty in a quantity being estimated. There is
uncertainty because inferences are based on a random
sample of finite size from the entire population or process of
interest. To judge the statistical procedure we can ask what
would happen if we were to repeat the same study, over and
over, getting different data (and thus different confidence
intervals) each time.
In most studies investigators are usually interested in
determining the size of difference of a measured outcome
between groups, rather than a simple indication of whether
or not it is statistically significant. Confidence intervals
present a range of values, on the basis of the sample data,
in which the population value for such a difference may lie.
Know that a confidence interval computed from one sample
will be different from a confidence interval computed from
another sample.
Understand the relationship between sample size and width
of confidence interval.
Know that sometimes the computed confidence interval
does not contain the true mean value (that is, it is
incorrect) and understand how this coverage rate is related
to confidence level.
Just a word of interpretive caution. Let's say you compute a
95% confidence interval for a mean . The way to interpret
this is to imagine an infinite number of samples from the
same population, 95% of the computed intervals will contain
the population mean . However, it is wrong to state, "I am
95% confident that the population mean falls within the
interval."
Again, the usual definition of a 95% confidence interval is an
interval constructed by a process such that the interval will
contain the true value 95% of the time. This means that
"95%" is a property of the process, not the interval.
Is the probability of occurrence of the population mean
greater in the confidence interval center and lowest at the
boundaries? Does the probability of occurrence of the
population mean in a confidence interval vary in a
measurable way from the center to the boundaries? In a
general sense, normality is assumed, and then the interval
between CI limits is represented by a bell shaped t
distribution. The expectation (E) of another value is highest
at the calculated mean value, and decreases as the values
approach the CI interval limits.
An approximation for the single measurement tolerance
interval is
n times confidence interval of the mean.
Determining sample size: At the planning stage of a
statistical investigation the question of sample size (n) is
critical. The above figure also provides a practical guide to
sample size determination in the context of statistical
estimations and statistical significance tests.
The confidence level of conclusions drawn from a set of data
depends on the size of data set. The larger the sample, the
higher is the associated confidence. However, larger
samples also require more effort and resources. Thus, your
goal must be to find the smallest sample size that will
provide the desirable confidence. In the above figure,
formulas are presented for determining the sample size
required to achieve a given level of accuracy and
confidence.
In estimating the sample size, when the standard deviation
is not known, one may use 1/4 of the range for sample of
size over 30 as a "good" estimate for the standard deviation.
It is a good practice to compare the result with IQR/1.349.
A Note on Multiple Comparison via the Individual
Intervals: Notice that, if the confidence intervals from two
samples do not overlap there is a statistically significant
difference, say at 5%. However, the other way is not true
two confidence intervals can overlap quite a lot yet there is
a significant difference between them. One should examine
the confidence interval for the difference explicitly. Even if
the C.I.'s are overlapping it is hard to find the exact overall
confidence level. However, the sum of individual confidence
levels can serve as an upper limit upper limit. This is evident
from the fact that: P(A and B)
P(A) + P(B).
Further Reading
Hahn G. and W. Meeker, Statistical Intervals: A Guide for
Practitioners, Wiley, 1991.
Also visit the Web sites Confidence Interval
Applet, statpage.
Entropy Measure
Inequality coefficients used in sociology, economy,
biostatistics, ecology, physics, image analysis and
information processing are analyzed in order to shed light on
economic disparity world-wide. Variability of a categorical
data is measured by the entropy function:
E= -  pi ln(pi)
where, sum is over all the categories and pi is the relative
frequency of the ith category. It is interesting to note that
this quantity is maximized when all pi's, are equal.
For a rXc contingency table it is E=  pij ln(pij) - ( pij)
ln((pij) - ( pij) ln((pij)
The sums are over all i and j, and j and i's.
Another measure is the Kullback-Liebler distance (related to
information theory):
((Pi - Qi)*log(Pi/Qi)) =
(Pi*log(Pi/Qi )) + (Qi*log(Qi/Pi ))
or the variation distance
( | Pi - Qi | )/2
where Pi and Qi are the probabilities for the i-th category for
the two populations.
For more on entropy visit the Web sites Entropy on
WWW, Entropy and Inequality Measures, and Biodiversity.
What Is Central Limit Theorem?
The central limit theorem (CLT) is a "limit" that is "central"
to statistical practice. For practical purposes, the main idea
of the CLT is that the average (center of data) of a sample
of observations drawn from some population is
approximately distributed as a normal distribution if certain
conditions are met. In theoretical statistics there are several
versions of the central limit theorem depending on how
these conditions are specified. These are concerned with the
types of conditions made about the distribution of the parent
population (population from which the sample is drawn) and
the actual sampling procedure.
One of the simplest versions of the theorem says that if we
take a random sample of size (n) from the entire population,
then the sample mean which is a random variable defined
by  xi / n has a histogram which converges to a normal
distribution shape if n is large enough (say more than 30).
Equivalently, the sample mean distribution approaches to a
normal distribution as the sample size increases.
In applications of the central limit theorem to practical
problems in statistical inference, however, statisticians are
more interested in how closely the approximate distribution
of the sample mean follows a normal distribution for finite
sample sizes, than the limiting distribution itself. Sufficiently
close agreement with a normal distribution allows
statisticians to use normal theory for making inferences
about population parameters (such as the mean ) using the
sample mean, irrespective of the actual form of the parent
population.
It can be shown that, if the parent population has
mean and finite standard deviation , then the sample
mean distribution has the same mean but with smaller
standard deviation which is  divided by n½.
You know by now that, whatever the parent population is,
the standardized variable will have a distribution with a
mean = 0 and standard deviation =1 under random
sampling. Moreover, if the parent population is normal, then
z is distributed exactly as a standard normal variable. The
central limit theorem states the remarkable result that, even
when the parent population is non-normal, the standardized
variable is approximately normal if the sample size is large
enough. It is generally not possible to state conditions under
which the approximation given by the central limit theorem
works and what sample sizes are needed before the
approximation becomes good enough. As a general
guideline, statisticians have used the prescription that if the
parent distribution is symmetric and relatively short-tailed,
then the sample mean reaches approximate normality for
smaller samples than if the parent population is skewed or
long-tailed.
Under certain conditions, in large samples, the sampling
distribution of the sample mean can be approximated by a
normal distribution. The sample size needed for the
approximation to be adequate depends strongly on the
shape of the parent distribution. Symmetry (or lack thereof)
is particularly important.
For a symmetric parent distribution, even if very different
from the shape of a normal distribution, an adequate
approximation can be obtained with small samples (e.g., 10
or 12 for the uniform distribution). For symmetric shorttailed parent distributions, the sample mean reaches
approximate normality for smaller samples than if the
parent population is skewed and long-tailed. In some
extreme cases (e.g. binomial with ) samples sizes far
exceeding the typical guidelines (e.g., 30 or 60) are needed
for an adequate approximation. For some distributions
without first and second moments (e.g., Cauchy), the
central limit theorem does not hold.
For some distributions, extremely large (impractical)
samples would be required to approach a normal
distribution. In manufacturing, for example, when defects
occur at a rate of less than 100 parts per million, using a
Beta distribution yields an honest CI of total defects in the
population.
Review also Central Limit Theorem Applet, Sampling
Distribution Simulation, and CLT.
What Is a Sampling Distribution
The sampling distribution describes probabilities associated
with a statistic when a random sample is drawn from the
entire population.
The sampling distribution is the probability distribution or
probability density function of the statistic.
Derivation of the sampling distribution is the first step in
calculating a confidence interval or carrying out a hypothesis
test for a parameter.
Example: Suppose that x1,.......,xn are a simple random
sample from a normally distributed population with expected
value and known variance2. Then the sample mean is a
statistic used to give information about the population
parameter is normally distributed with expected value and
variance 2/n.
The main idea of statistical inference is to take a random
sample from the entire population and then to use the
information from the sample to make inferences about
particular population characteristics such as the
mean (measure of central tendency), the standard
deviation (measure of spread)  or the proportion of units in
the population that have a certain characteristic. Sampling
saves money, time, and effort. Additionally, a sample can, in
some cases, provide as much or more accuracy than a
corresponding study that would attempt to investigate an
entire population-careful collection of data from a sample
will often provide better information than a less careful
study that tries to look at everything.
One must also study the behavior of the mean of sample
values from a different specified populations. Because a
sample examines only part of a population, the sample
mean will not exactly equal the corresponding mean of the
population . Thus, an important consideration for those
planning and interpreting sampling results is the degree to
which sample estimates, such as the sample mean, will
agree with the corresponding population characteristic.
In practice, only one sample is usually taken (in some cases
a small "pilot sample" is used to test the data-gathering
mechanisms and to get preliminary information for planning
the main sampling scheme). However, for purposes of
understanding the degree to which sample means will agree
with the corresponding population mean , it is useful to
consider what would happen if 10, or 50, or 100 separate
sampling studies, of the same type, were conducted. How
consistent would the results be across these different
studies? If we could see that the results from each of the
samples would be nearly the same (and nearly correct!),
then we would have confidence in the single sample that will
actually be used. On the other hand, seeing that answers
from the repeated samples were too variable for the needed
accuracy would suggest that a different sampling plan
(perhaps with a larger sample size) should be used.
A sampling distribution is used to describe the distribution of
outcomes that one would observe from replication of a
particular sampling plan.
Know that estimates computed from one sample will be
different from estimates that would be computed from
another sample.
Understand that estimates are expected to differ from the
population characteristics (parameters) that we are trying to
estimate, but that the properties of sampling distributions
allow us to quantify, based on probability, how they will
differ.
Understand that different statistics have different sampling
distributions with distribution shape depending on (a) the
specific statistic, (b) the sample size, and (c) the parent
distribution.
Understand the relationship between sample size and the
distribution of sample estimates.
Understand that the variability in a sampling distribution can
be reduced by increasing the sample size.
See that in large samples, many sampling distributions can
be approximated with a normal distribution.
To learn more, visit the Web sites Sample, and Sampling
Distribution Applet
Applications of and Conditions for Using Statistical
Tables
Some widely used applications of the popular statistical
tables can be categorized as follows:
Z - Table: Tests concerning µ for one or two-population
based on their large size random sample(s), (say,  30, to
invoke the Central Limit Theorem).
Test concerning proportions, with large size random sample
size n (say, n 50, to invoke a convergence theorem).
Conditions for using this table: Test for randomness of
the data is needed before using this table. Test for normality
of the sample distribution is also needed if the sample size is
small or it may not be possible to invoke the Central Limit
Theorem.
T - Table: Tests concerning µ for one or two-population
based on small random sample size (s).
Tests concerning regression coefficients (slope, and
intercepts), df = n - 2.
Notes: As you know by now, in test of hypotheses
concerning , and construction of confidence interval for it,
we start with  known, since the critical value (and the pvalue) of the Z-Table distribution can be used. Considering
the more realistic situations when we don't know  the TTable is used. In both cases we need to verify the normality
of the population's distribution, however, if the sample size
n is very large, we can in fact switch back to Z-Table by the
virtue of the central limit theorem. For perfectly normal
population, the t-distribution corrects for any errors
introduced by estimating  with s when doing inference.
Note also that, in hypothesis testing concerning the
parameter of binomial and Poisson distributions for large
sample sizes, the standard deviation is known under the null
hypotheses. That's why you may use the normal
approximations to both of these distributions.
Conditions for using this table: Test for randomness of
the data is needed before using this table. Test for normality
of the sample distribution is also needed if the sample size is
small or it may not be possible to invoke the Central Limit
Theorem.
Chi-Square - Table: Tests concerning 2 for one population
based on a random sample from the entire population.
Contingency tables (test for independency of categorical
data).
Goodness-of-fit test for discrete random variables.
Conditions for using this table: Tests for randomness of
the data and normality of the sample distribution are
needed before using this table.
F - Table: ANOVA: Tests concerning µ for three or more
populations based on their random samples.
Tests concerning 2 for two-population based on their
random samples.
Overall assessment in regression analysis using the F-value.
Conditions for using this table: Tests for randomness of
the data and normality of the sample distribution are
needed before using this table for ANOVA. Same conditions
must be satisfied for the residuals in regression analysis.
The following chart summarizes statistical tables application
with respect to test of hypotheses and construction of
confidence intervals for meanand variance 2 in one pr
comparison of two or more populations.
Further Reading:
Kagan. A., What students can learn from tables of basic
distributions, Int. Journal of Mathematical Education in
Science & Technology, 30(6), 1999.
Statistical Tables on the Web:
The following Web sites provide critical values useful in
statistical testing and construction of confidence intervals.
The results are identical to those given in statistic textbooks.
However, in most cases they are more extensive (therefore
more accurate).






Normal Curve Area
Normal Calculator
Normal Probability Calculation
Critical Values for the t-Distribution
Critical Values for the F-Distribution
Critical Values for the Chi- square Distribution
Read also
Kanji G., 100 Statistical Tests, Sage Publisher, 1995.
Relationships Among Distributions and Unification of Statistical
Tables
Particular attention must be paid to a first course in
statistics. When I first began studying statistics, it bothered
me that there were different tables for different tests. It
took me a while to learn that this is not as haphazard as it
appeared. Binomial, Normal, Chi-square, t, and F
distributions that you will learn about are actually closely
connected.
A problem with elementary statistical textbooks is that they
not only don't provide information of this kind, to permit a
useful understanding of the principles involved, but they
usually don't provide these conceptual links. If you want to
understand connections between statistical concepts, then
you should practice in making these connections. Learning
by doing statistics lends itself to active rather than passive
learning. Statistics is a highly interrelated set of concepts,
and to be successful at it, you must learn to make these
links conscious in your mind.
Students often ask: Why T- table values with d.f.=1 are
much larger compared with other d.f. values? Some tables
are limited, what should I do when the sample size is too
large?, How can I get familiarity with tables and their
differences. Is there any type of integration among tables?
Is there any connections between test of hypotheses and
confidence interval under different scenario, for example
testing with respect to one, two more than two populations.
And so on.
Further Reading:
Kagan. A., What students can learn from tables of basic
distributions, Int. Journal of Mathematical Education in
Science & Technology, 30(6), 1999.
The following two Figures demonstrate useful relationships
among distributions and a unification of statistical tables:
Unification of Common Statistical Tables, needs Acrobat to
view
Relationship Among Commonly Used Distributions in
Testing, needs Acrobat to view
Normal Distribution
Up to this point we have been concerned with how empirical
scores are distributed and how best to describe the
distribution. We have discussed several different measures,
but the mean will be the measure that we use to describe
the center of the distribution and the standard
deviation will be the measure we use to describe the
spread of the distribution. Knowing these two facts gives us
ample information to make statements about the probability
of observing a certain value within that distribution. If I
know, for example, that the average I.Q. score is 100 with a
standard deviation of  = 20, then I know that someone
with an I.Q. of 140 is very smart. I know this because 140
deviates from the mean by twice the average amount as
the rest of the scores in the distribution. Thus, it is unlikely
to see a score as extreme as 140 because most the I.Q.
scores are clustered around 100 and on average only
deviate 20 points from the mean .
Many applications arise from central limit theorem (average
of values of n observations approaches normal distribution,
irrespective of form of original distribution under quite
general conditions). Consequently, appropriate model for
many, but not all, physical phenomena.
Distribution of physical measurements on living organisms,
intelligence test scores, product dimensions, average
temperatures, and so on.
Know that the Normal distribution is to satisfy seven
requirements: the graph should be bell shaped curve, mean,
medial and mode equal and located at the center of the
distribution, only has one mode, symmetric about mean,
continuous, never touches x-axis and area under curve
equals one.
Many methods of statistical analysis presume normal
distribution.
Normal Curve Area Area.
What Is So Important About the Normal
Distributions?
Normal Distribution (called also Gaussian) curves, which
have a bell-shaped appearance (it is sometimes even
referred to as the "bell-shaped curves") are very important
in statistical analysis. In any normal distribution is
observations are distributed symmetrically around the
mean, 68% of all values under the curve lie within one
standard deviation of the mean and 95% lie within two
standard deviations.
There are many reasons for their popularity. The following
are the most important reasons for its applicability:
1. One reason the normal distribution is important is that
a wide variety of naturally occurring random
variables such as heights and weights of all creatures
are distributed evenly around a central value, average,
or norm (hence, the name normal distribution).
Although the distributions are only approximately
normal, they are usually quite close.
When there are many, too many factors
influencing the outcome of a random outcome, then
the underlying distribution is approximately normal.
For example, the height of a tree is determined by the
"sum" of such factors as rain, soil quality, sunshine,
disease, etc.
As Francis Galton wrote in 1889, "Whenever a large
sample of chaotic elements are taken in hand and
marshaled in the order of their magnitude, an
unsuspected and most beautiful form of regularity
proves to have been latent all along."
Visit the Web sites Quincunx (with 5 influencing
factors) influencing, Central Limit Theorem ( with 8
influencing factors), or BallDrop for demos.
2. Almost all statistical tables are limited by the size of
their parameters. However, when these parameters
are large enough one may use normal distribution for
calculating the critical values for these tables.
Visit Relationship Among Statistical Tables and Their
Applications (pdf version).
3. If the mean and standard deviation of a normal
distribution are known, it is easy to convert back and
forth from raw scores to percentiles.
4. It's characterized by two independent parameters-mean and standard deviation. Therefore many
effective transformations can be applied to convert
almost any shaped distribution into a normal one.
5. The most important reason for popularity of normal
distribution is the Central Limit Theorem (CLT). The
distribution of the sample averages of a large number
of independent random variables will be approximately
normal regardless of the distributions of the individual
random variables. Visit also the Web sites Central Limit
Theorem Applet, Sampling Distribution Simulation,
and CLT, for some demos.
6. The other reason the normal distributions are so
important is that the normality condition is required by
almost all kinds of parametricstatistical tests. The
CLT is a useful tool when you are dealing with a
population with unknown distribution. Often, you may
analyze the mean (or the sum) of a sample of size n.
For example instead of analyzing the weights of
individual items you may analyze the batch of size n,
that is, the packages each containing n items.
What is a Linear Least Squares Model?
Many problems in analyzing data involve describing how
variables are related. The simplest of all models describing
the relationship between two variables is a linear, or
straight-line, model. Linear regression is always linear in the
coefficients being estimated, not necessarily linear in the
variables.
The simplest method of drawing a linear model is to "eyeball" a line through the data on a plot, but a more elegant,
and conventional method is that of least squares, which
finds the line minimizing the sum of distances between
observed points and the fitted line. Realize that fitting the
"best" line by eye is difficult, especially when there is a lot of
residual variability in the data.
Know that there is a simple connection between the
numerical coefficients in the regression equation and the
slope and intercept of regression line.
Know that a single summary statistic like a correlation
coefficient does not tell the whole story. A scatterplot is an
essential complement to examining the relationship between
the two variables.
Again, the regression line is a group of estimates for the
variable plotted on the Y-axis. It has a form of y = a + mx,
m is the slope of the line. The slope is the rise over run. If a
line goes up 2 for each 1 it goes over, then its slope is 2.
Formulas:

=  x(i)/n
This is just the mean of the x values.

=  y(i)/n
This is just the mean of the y values.

Sxx = (x(i) -
)2 = x(i)2 - [x(i) ]
2
/n

Syy = (y(i) -
)2 = y(i)2 - [y(i) ]
2
/n

Sxx = (x(i) . y(i)] / n
)(y(i) -
) = x(i).y(i) - [x(i)

Slope m = Sxy / Sxx

Intercept, b =
-m.
The least squares regression line is:
y-predicted = yhat = mx + b

The regression line goes through a mean-mean point. That
is the point at the mean of the x values and the mean of the
y values. If you drew lines from the mean-mean point out to
each of the data points on the scatter plot, each of the lines
that you drew would have a slope. The regression slope is
the weighted mean of those slopes, where the weights are
the runs squared.
If you put in each x, the regression line would spit out for
you an estimate for each y. Each estimate makes an error.
Some errors are positive and some are negative. The sum of
squared of the errors plus the sum of squared of the
estimates add up to the sum of squared of Y. The regression
line is the line that minimizes the variance of the errors.
(the mean error is zero, so this means that it minimizes the
sum of the squared errors.)
The reason for finding the best line is so that you can make
a reasonable predictions of what y will be if x is known (not
vise-versa).
r2 is the variance of the estimates divided by the variance of
Y. r is ± the square root of r2. r is the size of the slope of
the regression line, in terms of standard deviations. In other
words, it is the slope if we use the standardized X and Y. It
is how many standard deviations of Y you would go up,
when you go one standard deviation of X to the right.
Visit also the Web sites Simple Regression, Linear
Regression, Putting Points
Coefficient of Determination
Another measure of the closeness of the points to the
regression line is the Coefficient of Determination.
r2 = Syhat yhat / Syy
which is the amount of the squared deviation which is
explained by the points on the least squares regression line.
When you have regression equations based on theory, you
should compare:
1. R squares, that is, the percentage of of variance [in
fact, sum of squares] in Y accounted for variance in X
captured by the model.
2. When you want to compare models of different size
(different numbers of independent variables (p) and/or
different sample sizes n) you must use the Adjusted RSquared, because the usual R-Squared tends to grow
with the number of independent variables.
R2
a
= 1 - (n - 1)(1 - R2)/(n - p - 1)
3. prediction error or standard error
4. trends in error, 'observed-predicted' as a function of
control variables such as time. Systematic trends are
not uncommon
5. extrapolations to interesting extreme conditions of
theoretical significance
6. t-stats on individual parameters
7. values of the parameters and its content to content
underpinnings.
8. Fdf1 df2 value for overall assessment. Where df1
(numerator degrees of freedom) is the number of
linearly independent predictors in the assumed model
minus the number of linearly independent predictors in
the restricted model (i.e.,the number of linearly
independent restrictions imposed on the assumed
model), and df2 (denominator degrees of freedom) is
the number of observations minus the number of
linearly independent predictors in the assumed model.
Homoscedasticity and Heteroscedasiticy: Homoscedasticity
(homo=same, skedasis=scattering) is a word used to
describe the distribution of data points around the line of
best fit. The opposite term is heteroscedasiticy. Briefly,
homoscedasticity means that data points are distributed
equally about the line of best fit. Therefore,
homoscedasticity means constancy of variances for/over all
the levels of factors. Heteroscedasiticy means that the data
points cluster or clump above and below the line in a nonequal pattern. You should find a discussion of these terms in
any decent statistics text that deals with least squares
regression. See, e.g., Testing Research Hypothesis with the
GLM, by McNeil, Newman and Kelly, 1996 pages 174-176.
Finally in statistics for business, there exists an opinion that
with more that 4 parameters one can fit an elephant, so that
if one attempts to fit a curve that depends on many
parameters the result should not be regarded as very
reliable.
If m1 and m2 are the slopes of two regressions y on x and x
on y respectively then R2=m1.m2
Logistic regression: Standard logistic regression is a method
for modeling binary data (e.g., does a person smoke or not,
does a person survive a disease, or
not). Polygamous logistic regression models more than two
options (beg., does a person take the bus, drive a car or
take the subway, does an office use WordPerfect, Word, or
another package).
Test for equality of two slopes: Let m1 represent the
regression coefficient for explanatory variable X1 in sample
1 with size n1. Let m2 represent the regression coefficient
for X1 in sample 2 with size n2. Let S1 and S2 represent the
associated standard error estimates. Then, the quantity
(m1 - m2) / SQRT(S1
2
+ S2 2)
has the t distribution with df = n1 + n2 - 4
Regression when both X and Y are in error: Simple linear
least-square regression has among its conditions that the
data for the independent (X) variables are known without
error. Infact, the estimated results are conditioned on
whatever errors happened to be present in the independent
dataset. When the X-data have an error associated with
them the result is to bias the slope downwards. A procedure
known as Deming regression can handle this problem quite
well. Biased slope estimates (due to error in x) can be
avoided using Deming regression.
Reference:
Cook and Weisberg, An Introduction to Regression Graphics,
Wiley, 1994
Regression Analysis: Planning, Development, and
Maintenance
I – Planning:
1. Define the problem, select response, suggest variables
2. Are the proposed variables fundamental to the
problem, and are they variables? Measurable? Can one
get a complete set of observations at the same time?
Ordinary regression analysis does not assume that the
independent variables are measured without error.
However, they are conditioned on whatever errors
happened to be present in the independent dataset.
3. Is the problem potentially solvable?
4. Correlation Matrix and first regression runs (for a
subset of data).
Find the basic statistics, correlation matrix.
How difficult this problem may be?
Compute the Variance Inflation Factor, VIF = 1/(1 rij),, i=1, 2, 3, .., i j. For moderate VIF, say between 2
and 8 you might be able to come-up with a ‘good'
model.
Inspect rij's , one or two must be large. If all are small,
perhaps the ranges of the X variables are too-small.
5. Establish goals, prepare budget and time table.
a - the final equation should have R2 = 0.8 (say).
b - Coef. of Variation of say less than 0.10
c – Nunmer of predictors should not exceed p (say, 3),
(for example for p=3, we need at least 30 points).
d – All estimated coefficients must be significant at =
0.05 (say).
e – No pattern in the residuals
6. Are goals and budget acceptable?
II – Development of the Model:
1. 1 – Collect date, plot, try models, check the quality of
date, check the assumptions.
2. 2 – Consult experts for criticism.
Plot new variable and examine same fitted model.
Also transformed Predictor Variable may be used.
3. 3 – Are goals met?
Have you found "the best" model?
III – Validation and Maintenance of the Model:
1. 1 – Are parameters stable over the sample space?
2. 2 – Is there lack of fit?
Are the coefficients reasonable?
Are any obvious variables missing?
Is the equation usable for control or for prediction?
3. 3 – Maintenance of the Model.
Need to have control chart to check the model
periodically by statistical techniques.
Predicting Market Response
As applied researchers in business and economics, faced
with the task of predicting market response, we seldom
know the functional form of the response. Perhaps market
response is a nonlinear monotonic, or even a non-monotonic
function of explanatory variables. Perhaps it is determined
by interactions of explanatory variable. Interaction is
logically independent of its components.
When we try to represent complex market relationships
within the context of a linear model, using appropriate
transformations of explanatory and response variables, we
learn how hard the work of statistics can be. Finding
reasonable models is a challenge, and justifying our choice
of models to our peers can be even more of a challenge.
Alternative specifications abound.
Modern regression methods, such as generalized additive
models, multivariate adaptive regression splines, and
regression trees, have one clear advantage: They can be
used without specifying a functional form in advance. These
data-adaptive, computer- intensive methods offer a more
flexible approach to modeling than traditional statistical
methods. How well modern regression methods perform in
predicting market response? Some perform quite well based
on the results of simulation studies.
How to Compare Two Correlations Coefficients?
The statistical test is the following for Ho: 1 = 2.
Compute
t = (z1 - z2) / [ 1/(n1-3) + 1/(n2-3) ]½ n1, n2
3.
where
z1 = 0.5 ln( (1+r1)/(1-r1) ),
z2 = 0.5 ln( (1+r2)/(1-r2) ) and
n1= sample size associated with r1, and n2=sample size
associated with r2
The distribution of the statistic t is approximately N(0,1).
So, you should reject Ho if |t| 1.96 at the 95% confidence
level.
r is (positive) scale and (any) shift invariant. That is ax + c,
and by + d, have same r as x and y, for any positive a and
b.
Procedures for Statistical Decision Making
The two most widely used measuring tools and decision
procedures in statistical decision making, are Classical and
Bayesian Approaches.
Classical Approach: Classical probability of finding this
sample statistic -- or any statistic more unlikely-- assuming
the null hypothesis is true. A small p-value is not sufficient
evidence to reject the null hypothesis and to accept the
alternate.
As indicated in the above Figure, type-I error occurs when
based on your data you reject the null hypothesis when in
fact it is true. The probability of a type I error is the level of
significance of the test of hypothesis, and is denoted by .
A type II error occurs when you do not reject the null
hypothesis when it is in fact it is false. The probability of a
type-II error is denoted by . The quantity 1 - is known
as the Power of a Test. Type-II error can be evaluated for
any specific alternative hypotheses stated in the form "Not
Equal to" as a competing hypothesis.
Bayesian Approach: Difference in expected gain (loss)
associated with taking various actions each having an
associated gain (loss) and a given Bayesian statistical
significance. This is standard Min/Max decision theory using
Bayesian strength of belief assessments in the truth of the
alternate hypothesis. One would choose the action which
minimizes expected loss or maximizes expected gain (the
risk function).
Hypothesis Testing: Rejecting a Claim
To perform a hypothesis testing, one must be very specific
about the test one wishes to perform. The null hypothesis
must be clearly stated, and the data must be collected in a
repeatable manner. Usually, the sampling design will involve
random, stratified random, or regular distribution of study
plots. If there is any subjectivity, the results are technically
not valid. All of the analyses, including the sample size,
significance level, the time, and the budget, must be
planned in advance, or else the user runs the risk of "data
diving"
Hypothesis testing is mathematical proof by contradiction.
For example, for a Student's t test comparing 2 groups, we
assume that the two groups come from the same population
(same means, standard deviations, and in general same
distributions). Then we try like all get out to prove that this
assumption is false. Rejecting H0 means either H0 is false, or
a rare event such as has occurred.
The real question in statistics not whether a null hypothesis
is correct, but whether it is close enough to be used as an
approximation.
Selecting Statistics
In most statistical tests concerning , we start by assuming
the 2 & higher moments (skewness, kurtosis) are equal. Then we
hypothesize that the 's are equal. Null hypothesis.
The "null" suggests no difference between group means, or no
relationship between quantitative variables, and so on.
Then we test with a calculated t-value. For simplicity, suppose we
have a 2 sided test. If the calculated t is close to 0, we say good,
as we expected. If the calculated t is far from 0, we say, "the
chance of getting this value of t, given my assumption of equal
populations, is so small that I will not believe the assumption. We
will say that the populations are not equal, specifically the means
are not equal."
Sketch a normal distribution, with mean
12 and
standard deviation s. If the null hypothesis is true, then the mean
is 0. We calculate the 't' value, as per the equation. We look up a
"critical" value of t. The probability of calculating a t value more
extreme ( + or - ) than this, given that the null hypothesis is
true, is equal or less than the  risk we used in pulling the critical
value from the table. Mark the calculated t, and critical t (both
sides) on the sketch of the distribution. Now. If the calculated t is
more extreme than the critical value, we say, "the chance of
getting this t, by shear chance, when the null hypothesis is true,
is so small that I would rather say the null hypothesis is false,
and accept the alternative, that the means are not equal." When
the calculated value is less extreme than the calculated value, we
say, "I could get this value of t by shear chance, often enough
that I will not write home about it. I cannot detect a difference in
the means of the two groups at the  significance level."
In this test we need (among others) the condition that the
population variances (i.e., treatment impacts on central tendency
but not variability) are equal. However, this test is robust to
violations of that condition if n's are large and almost the same
size. A counter example would be to try a t-test between (11, 12,
13) and (20, 30, 40). The pooled and un pooled tests both give t
statistics of 3.10, but the degrees of freedom are different: 4
(pooled) or about 2 (unpooled). Consequently the pooled test
gives p = .036 and the unpooled p = .088. We could go down to
n = 2 and get something still more extreme.
The Classical Approach to the Test of Hypotheses
In this treatment there are two parties, one party (or a person)
sets out the null hypothesis (the claim), an alternative hypothesis
is proposed by the other party , a significance level  and a
sample size n are agreed upon by both parties. The second step
is to compute the relevant statistic based on the null hypothesis
and the random sample of size n. Finally, one determines the
critical region (i.e. rejection region). The conclusion based on this
approach is as follows:
If the computed statistics falls within the rejection region,
then Reject the null hypothesis. Otherwise Do Not Reject the null
hypothesis (the claim).
You may ask: How to determine the the critical value (such as zvalue) for the rejection interval: for one and two-tailed
hypotheses. What is the rule?
First you have to choose a significance level . Knowing that the
null hypothesis is always in "equality" form, then, the alternative
hypothesis has one three possible forms: "greater-than", "lessthan", or "not equal to". The first two forms correspond to onetail hypotheses while the last one corresponds to a two-tail
hypothesis.



if your alternative is in the form of "greater-than", then z is
the value that gives you an area to the right tail of
distribution that is equal to .
if your alternative is in the form of "less-than", then z is the
value that gives you an area to the left tail of distribution
that is equal to .
if your alternative is in the form of "not equal to" then, there
are two z values, one positive the other negative.
The positive z is the value that gives you an /2 area to
the right tail of distribution. While, the negative z is the
value that gives you an /2 area to the left tail of
distribution.
This is a general rule, and to implement this process in
determining the critical value, for any test of hypothesis, you
must first master reading the statistical tables well, because, as
you see, not all tables in your textbook are presented in a same
format.
The Meaning and Interpretation of P-values (what the data say?)
The p-value, which directly depends on a given sample, attempts
to provide a measure of the strength of the results of a test for
the null hypotheses, in contrast to a simple reject or do not reject
in the classical approach to the test of hypotheses. If the null
hypothesis is true and the chance of random variation is the only
reason for sample differences, then the p-value is a quantitative
measure to feed into the decision making process as evidence.
The following table provides a reasonable interpretation of pvalues:
P-value
P  0.01
Interpretation
very strong evidence against H0
0.01
P  0.05
moderate evidence against H0
0.05
P  0.10
suggestive evidence against H0
0.10
P
little or no real evidence against H0
This interpretation is widely accepted, and many scientific
journals routinely publish papers using such an interpretation for
the result of test of hypothesis.
For the fixed-sample size, when the number of realizations is
decided in advance, the distribution of p is uniform (assuming the
null hypothesis). We would express this as P(p
means the criterion of p
x) = x. That
0.05 achieves  of 0.05.
Understand that the distribution of p-values under null hypothesis
H0 is uniform, and thus does not depend on a particular form of
the statistical test. In a statistical hypothesis test, the P value is
the probability of observing a test statistic at least as extreme as
the value actually observed, assuming that the null hypothesis is
true. The value of p is defined with respect to a distribution.
Therefore, we could call it "model-distributional hypothesis"
rather than "the null hypothesis".
In short, it simply means that if the null had been true, the p
value is the probability against the null in that case. The p-value
is determined by the observed value, however, this makes it
difficult to even state the inverse of p.
Reference:
Arsham H., Kuiper's P-value as a Measuring Tool and Decision
Procedure for the Goodness-of-fit Test, Journal of Applied
Statistics, Vol. 15, No.3, 131-135, 1988.
Blending the Classical and the P-value Based Approaches
in Test of Hypotheses
A p-value is a measure of how much evidence you have against
the null hypothesis. Notice that, the null hypothesis is always in =
form, and does not contain any forms of inequalities. The smaller
the p-value, the more evidence you have. In this setting the pvalue is based on the hull hypothesis and has nothing to do with
alternative hypothesis and therefore with the rejection region. In
recent years, some authors try to use the mixture of classical
approach (which is based the critical value obtained from given ,
and the computed statistics based) and the p-value approach.
This is a blend of two different school of thoughts. In this setting,
some textbooks compare the p-value with the significance level to
make decision on a given test of hypothesis. Larger the p-value is
when compared with  (in one sided alternative hypothesis,
and /2 for the two sided alternative hypotheses), less evidence
we have in rejecting the null hypothesis. In such a comparison, if
the p-value is less than some threshold (usually 0.05, sometimes
a bit larger like 0.1 or a bit smaller like 0.01) then you reject the
null hypothesis.The following paragraph deal with such a
combined approach.
Use of P-value and : In this setting, we must also consider the
alternative hypothesis in drawing the rejection interval (region) .
There is only one p-value to compare with  (or /2). Know that,
for any test of hypothesis, there is only one p-value. The
following outlines the computation of the p-value and the decision
process involving in a given test of hypothesis:
1. P-value for One-side Alternative Hypotheses: The p-value is
defined as the area to the right tail of distribution if the
rejection region in on the right tail, if the rejection region is
on the left tail, then the p-value is the area to the left tail (in
one-sided alternative hypotheses).
2. P-value for Two-side Alternative Hypotheses: If the
alternative hypothesis is a two-sided (that is, rejection
regions are both, on the left and on the right tails) then the
p-value is the area to the right tail or to the left of
distribution depending on whether the computed statistic is
closer to the right rejection region or left rejection region.
For symmetric densities (such as t) the left and right tails pvalues are the same. However, for non-symmetric densities
(such as Chi-square) used the smaller of the two (this
makes the test more conservative). Notice that, for two
sided-test alternative hypotheses, the p-value is never
greater than 0.5.
3. After finding the p-value as defined here, you compare it
with a preset  value for one-sided tests, and with /2 for
two sided-test. Larger the p-value is when compared
with  (in one sided alternative hypothesis, and /2 for the
two sided alternative hypotheses), less evidence we have for
rejecting the null hypothesis.
To avoid looking-up the p-values from the limited statistical
tables given in your textbook, most professional statistical
packages such as SPSS provide the two-tail p-value. Based on
where the rejection region is, you must find out what p-value to
use.
Unfortunately, some textbooks have many misleading statements
about p-value and its applications, for example in many textbooks
you find the authors double the p-value to compare it
with  when dealing with the the two-sided test of hypotheses.
One wonders how they do it in the case when "their" p-vaue
exceeds 0.5? Notice that, while it is correct to compare the pvalue with  for one side test of hypotheses , however, for twosided hypotheses, one must compare the p-value with /2,
NOT  with 2 times p-value, as unfortunately some text book
advise. While, the decision is the same, but there is a clear
distinction here and an important difference which the careful
reader will note.
When We Should POOL Variance Estimates?
Variance estimates should be pooled only if there is a good
reason for doing so, and then (depending on that reason) the
conclusions might have to be made explicitly conditional on the
validity of the equal-variance model. There are several different
good reasons for pooling:
(a) to get a single stable estimate from several relatively small
samples, where variance fluctuations seem not to be systematic;
or
(b) for convenience, when all the variance estimates are near
enough to equality; or
(c) when there is no choice but to model variance (as in simple
linear regression with no replicated X values), and deviations
from the constant-variance model do not seem systematic; or
(d) when group sizes are large and nearly equal, so that there is
essentially no difference between the pooled and unpooled
estimates of standard errors of pairwise contrasts, and degrees of
freedom are nearly asymptotic.
Note that this last rationale can fall apart for contrasts other than
pairwise ones. One is not really pooling variance in case (d),
rather one is merely taking a shortcut in the computation of
standard errors of pairwise contrasts.
If you calculate the test without the assumption, you have to
determine the degrees of freedom (or let the statistics package
do it). The formula works in such a way that df will be less if the
larger sample variance is in the group with the smaller number of
observations. This is the case in which the two tests will differ
considerably. A study of the formula for the df is most
enlightening and one must understand the correspondence
between the unfortunate design (having the most observations in
the group with little variance) and the low df and accompanying
large t-value.
Example: When doing t tests for differences in means of
populations (a classic independent samples case):
1. Use the standard error formula for differences in means that
does not make any assumption about equality of population
variances [i.e., (VAR1/n1 + VAR2/n2)½].
2. Use the "regular" way to calculate df in a t test (n1-1)+(n21), n1, n2
2.
3. If total N is less than 50 and one sample is 1/2 the size of
the other (or less) and the smaller sample has a standard
deviation at least twice as large as the other sample, then
replace #2 with formula for adjusting df value. Otherwise,
don't worry about the problem of having an actual level
that is much different than what you have set it to be.
In the Statistics With Confidence Section we are concerned with
the construction of confidence interval where the equality of
variances condition is an important issue.
Visit also the Web sites Statistics, Statistical tests.
Remember that in the t tests for differences in means there is a
condition of equal population variances that must be examined.
One way to test for possible differences in variances is to do an F
test. However, the F test is very sensitive to violations of the
normality condition; i.e., if populations appear not to be normal,
then the F test will tend to over reject the null of no differences in
population variances.
SPSS program for T-test, Two-Population Independent
Means:
$SPSS/OUTPUT=CH2DRUG.OUT
TITLE
' T-TEST, TWO INDEPENDENT MEANS '
DATA LIST
FREE FILE='A.IN'/drug walk
VAR LABELS
DRUG 'DRUG OR PLACEBO'
WALK 'DIFFERENCE IN TWO WALKS'
VALUE LABELS
DRUG 1 'DRUG' 2 'PLACEBO'
T-TEST GROUPS=DRUG(1,2)/VARIABLES=WALK
NPAR TESTS
M-W=WALK BY DRUG(1,2)/
NPAR TESTS
K-S=WALK BY DRUG(1,2)/
NPAR TESTS
K-W=WALK BY DRUG(1,2)/
SAMPLE 10 FROM 20
CONDESCRIPTIVES DRUG(ZDRUG),WALK(ZWALK)
LIST CASE
CASE =10/VARIABLES=DRUG,ZDRUG,WALK,ZWALK
FINISH
SPSS program for T-test, Two-Population Dependent
Means:
$ SPSS/OUTPUT=A.OUT
TITLE
' T-TEST, 2 DEPENDENT MEANS'
FILE HANDLE
MC/NAME='A.IN'
DATA LIST
FILE=MC/YEAR1,YEAR2,(F4.1,1X,F4.1)
VAR LABELS
YEAR1 'AVERAGE LENGTH OF STAY IN YEAR 1'
YEAR2 'AVERAGE LENGTH OF STAY IN YEAR 2'
LIST CASE
CASE=11/VARIABLES=ALL/
T-TEST PAIRS=YEAR1 YEAR2
NONPAR COR YEAR1,YEAR2
NPAR TESTS WILCOXON=YEAR1,YEAR2/
NPAR TESTS SIGN=YEAR1,YEAR2/
NPAR TESTS KENDALL=YEAR1,YEAR2/
FINISH
Visit also the Web site Statistical tests.
Analysis of Variance (ANOVA)
The tests we have learned up to this point allow us to test
hypotheses that examine the difference between only two means.
Analysis of Variance or ANOVA will allow us to test the difference
between 2 or more means. ANOVA does this by examining the
ratio of variability between two conditions and variability within
each condition. For example, say we give a drug that we believe
will improve memory to a group of people and give a placebo to
another group of people. We might measure memory
performance by the number of words recalled from a list we ask
everyone to memorize. A t-test would compare the likelihood of
observing the difference in the mean number of words recalled
for each group. An ANOVA test, on the other hand, would
compare the variability that we observe between the two
conditions to the variability observed within each condition. Recall
that we measure variability as the sum of the difference of each
score from the mean. When we actually calculate an ANOVA we
will use a short-cut formula
Thus, when the variability that we predict (between the two
groups) is much greater than the variability we don't predict
(within each group), then we will conclude that our treatments
produce different results.
An Illustrative Numerical Example for ANOVA
Introducing ANOVA in simplest forms by numerical illustration.
Example: Consider the following (small, and integer, indeed for
illustration while saving space) random samples from three
different populations.
With the null hypothesis H0: µ1 = µ2 = µ3, and the Ha: at least
two of the means are not equal. At the significance level =
0.05, the critical value from F-table is
F 0.05, 2, 12 = 3.89.
Sample 1 Sample 2 Sample 3
2
3
5
3
4
5
1
3
5
3
5
3
1
0
2
SUM
10
15
20
Mean
2
3
4
Demonstrate that, SST=SSB+SSW
Computation of sample SST: With the grand mean = 3, first,
start with taking the difference between each observation and the
grand mean, and then square it for each data point.
Sample 1 Sample 2 Sample 3
SUM
1
0
4
0
1
4
4
0
4
0
4
0
4
9
1
9
14
13
Therefore SST=36 with d.f = 15-1 = 14
Computation of sample SSB:
Second, let all the data in each sample have the same value as
the mean in that sample. This removes any variation WITHIN.
Compute SS differences from the grand mean.
Sample 1 Sample 2 Sample 3
SUM
1
0
1
1
0
1
1
0
1
1
0
1
1
0
1
5
0
5
Therefore SSB = 10, with d.f = 3-1 = 2
Computation of sample SSW:
Third, compute the SS difference within each sample using their
own sample means. This provides SS deviation WITHIN all
samples.
Sample 1 Sample 2 Sample 3
SUM
0
0
1
1
1
1
1
0
1
1
4
1
1
9
4
4
14
8
SSW = 26 with d.f = 3(5-1) = 12
Results are: SST = SSB + SSW, and d.fSST = d.fSSB + d.fSSW, as
expected.
Now, construct the ANOVA table for this numerical example by
plugging the results of your computation in the ANOVA Table.
The ANOVA Table
Sources of Variation Sum of Squares Degrees of Freedom Mean Squares F-Statistic
Between Samples
10
2
5
Within Samples
26
12
2.17
Total
36
14
2.30
Conclusion: There is not enough evidence to reject the null
hypothesis Ho.
Logic Behind ANOVA: First, let us try to explain the logic and then
illustrate it with a simple example. In performing ANOVA test, we
are trying to determine if a certain number of population means
are equal. To do that, we measure the difference of the sample
means and compare that to the variability within the sample
observations. That is why the test statistic is the ratio of the
between-sample variation (MST) and the within-sample variation
(MSE). If this ratio is close to 1, there is evidence that the
population means are equal.
Here's a hypothetical example: many people believe that men get
paid more in the business world than women, simply because
they are male. To justify or reject such a claim, you could look at
the variation within each group (one group being women's
salaries and the other being men salaries) and compare that to
the variation between the means of randomly selected samples of
each population. If the variation in the women's salaries is much
larger than the variation between the men and women's mean
salaries, one could say that because the variation is so large
within the women's group that this may not be a gender-related
problem.
Now, getting back to our numerical example, we notice that:
given the test conclusion and the ANOVA test's conditions, we
may conclude that these three populations are in fact the same
population. Therefore, the ANOVA technique could be used as a
measuring tool and statistical routine for quality control as
described below using our numerical example.
Construction of the Control Chart for the Sample
Means: Under the null hypothesis the ANOVA concludes that µ1
= µ2 = µ3; that is, we have a "hypothetical parent population."
The question is, what is its variance? The estimated variance is
36 / 14 = 2.75. Thus, estimated standard deviation is = 1.60 and
estimated standard deviation for the means is 1.6 /
5 = 0.71.
Under the conditions of ANOVA, we can construct a control chart
with the warning limits = 3 ± 2(0.71); the action limits = 3 ±
3(0.71). The following figure depicts the control chart.
Visit also the Web site Statistical tests.
Bartlett's Test: The Analysis of Variance requires certain
conditions be met if the statistical tests are to be valid. One of
the conditions we make is that the errors (residuals) all come
from the same normal distribution. Thus we have to test not only
for normality, but we must also test homogeneity of the
variances. We can do this by subdividing the data into
appropriate groups, computing the variances in each of the
groups and testing that they are consistent with being sampled
from a Normal distribution. The statistical test for homogeneity of
variance is due to Bartlett which is a modification of the NeymanPearson likelihood ratio test.
Bartlett's Test of Homogeneity of Variances for r Independent
Samples is a test to check for equal variances between
independent samples of data. The subgroups sizes do not have to
be equal. This tests assumes that each sample was randomly and
independently drawn from a normal population.
SPSS program for ANOVA: More Than Two Independent
Means:
$SPSS/OUTPUT=4-1.OUT1
TITLE
'ANALYSIS OF VARIANCE - 1st ITERATION'
DATA LIST
FREE FILE='A.IN'/GP Y
ONEWAY Y BY GP(1,5)/RANGES=DUNCAN
/STATISTICS DESCRIPTIVES HOMOGENEITY
STATISTICS 1
MANOVA Y BY GP(1,5)/PRINT=HOMOGENEITY(BARTLETT)/
NPAR TESTS K-W Y BY GP(1,5)/
FINISH
ANOVA like two population t-test can go wrong when the equality
of variances condition is not met.
Homogeneity of Variance: Checking the equality of
variances For 3 or more populations, there is a practical rule
known as the "Rule of 2". According to this rule, one divides the
highest variance of a sample by the lowest variance of the other
sample. Given that the sample sizes are almost the same, and
the value of this division is less than 2, then, the variations of the
populations are almost the same.
Example: Consider the following three random samples from
three populations, P1, P2, P2
P1
P2
P3
25
25
20
17
21
17
8
10
14
18
13
6
5
22
25
10
25
19
21
15
16
24
23
16
12
14
6
16
13
6
The summary statistics and the ANOVA table are computed to be:
Variable
P1
P2
P3
Source
Factor
Error
Total
N
10
10
10
DF
2
27
29
Mean
16.90
19.80
11.50
St.Dev
7.87
3.52
3.81
Analysis of Variance
SS
MS
F
79.40
39.70
4.38
244.90
9.07
324.30
SE Mean
2.49
1.11
1.20
p-value
0.023
With an F = 4.38 and a p-value of .023, we reject the null at =
0.05. This is not good news, since ANOVA, like two sample t-test,
can go wrong when the equality of variances condition is not met
by the Rule of 2.
Visit also the Web site Statistical tests.
SPSS program for ANOVA: More Than Two Independent
Means:
$SPSS/OUTPUT=A.OUT
TITLE
'ANALYSIS OF VARIANCE - 1st ITERATION'
DATA LIST
FREE FILE='A.IN'/GP Y
ONEWAY Y BY GP(1,5)/RANGES=DUNCAN
STATISTICS 1
MANOVA Y BY GP(1,5)/PRINT=HOMOGENEITY(BARTLETT)/
NPAR TESTS K-W Y BY GP(1,5)/
FINISH
CHI square test: Dependency
$SPSS/OUTPUT=A.OUT
TITLE
'PROBLEM 4.2 CHI SQUARE; TABLE 4.18'
DATA LIST
FREE FILE='A.IN'/FREQ SAMPLE NOM
WEIGHT BY FREQ
VARIABLE LABELS
SAMPLE
'SAMPLE 1 TO 4'
NOM
'LESS OR MORE THAN 8'
VALUE LABELS
SAMPLE 1 'SAMPLE1' 2 'SAMPLE2' 3 'SAMPLE3' 4 'SAMPLE4'/
NOM
1 'LESS THAN 8' 2 'GT/EQ TO 8'/
CROSSTABS TABLES=NOM BY SAMPLE/
STATISTIC 1
FINISH
Non-parametric ANOVA:
$SPSS/OUTPUT=A.OUT
DATA LIST
FREE FILE='A.IN'/GP Y
NPAR TESTS K-W Y BY GP(1,4)
FINISH
Power of a Test
Power of a test is the probability of correctly rejecting a false null
hypothesis. This probability is inversely related to the probability
of making a Type II error. Recall that we choose the probability of
making a Type I error when we set . If we decrease the
probability of making a Type I error we increase the probability of
making a Type II error. Therefore, there are basically two errors
possible when conducting a statistical analysis; types I and II:

Type I error - risk (i.e. probability) of rejecting the null
hypothesis when it is in fact true

Type II error - risk of not rejecting the null hypothesis when
it is in fact false
Power and Alpha ()
Thus, the probability of correctly retaining a true null has the
same relationship to Type I errors as the probability of correctly
rejecting an untrue null does to Type II error. Yet, as I mentioned
if we decrease the odds of making one type of error we increase
the odds of making the other type of error. What is the
relationship between Type I and Type II errors? For a fixed
sample size, decreasing one type of error increases the size of
the other one.
Power and the True Difference Between Population Means
Anytime we test whether a sample differs from a population, or
whether two samples come from 2 separate populations, there is
the condition that each of the populations we are comparing has
it's own mean and standard deviation (even if we do not know it).
The distance between the two population means will affect the
power of our test.
Power as a Function of Sample Size and Variance 2:
Anything that effects the extent to which the two distributions
share common values will increase Beta (the likelihood of making
a Type II error)
Four factors influence power:




effect size (for example, the difference between the means)
standard error 
significance level 
number of observations, or the sample size n
A Numerical Example: The following Figure provides an
illustrative numerical example:
Not rejecting the null hypothesis when it is false is defined as a
type II error, and is denoted by the  region. In the above Figure
this region lies to the left of the critical value. In the configuration
shown in this Figure,  falls to the left of the critical value (and
below the statistic's density under the alternative hypothesis Ha).
The  is also defined as the probability of incorrectly not-rejecting
a false null hypothesis, also called a miss. Related to the value
of  is the power of a test. The power is defined as the probability
of rejecting the null hypothesis given that a specific alternative is
true, and is computed as (1- ).
A Short Discussion: Consider testing a simple null versus simple
alternative. In the Neyman-Pearson setup, an upper bound is set
for the probability of type I error (), and then it is desirable to
find tests with low probability of type II error () given this. The
usual justification for this is that "we are more concerned about a
type I error, so we set an upper limit on the  we can tolerate." I
have seen this sort of reasoning in elementary texts and also
some advanced ones. It doesn't seem to make any sense. When
the sample size is large, for most standard tests, the
ratio / tends to 0. If we care more about type I error than type
II error, why should this concern dissipate with increasing sample
size?
This is indeed a drawback of the classical theory of testing
statistical hypotheses. A second drawback is that the choice lies
between only two test decisions: reject the null or accept the null.
It is worth considering approaches that overcome these
deficiencies. This can be done, for example, by the concept of
profile-tests at a 'level' . Neither the Type I nor Type II error
rates are considered separately, but they are the ratio of a
correct decision. For example, we accept the alternative
hypothesis Ha and reject the null H0, if an event is observed which
is at least a-times greater under Ha than under H0. Conversey, we
accept H0 and reject Ha, if an event is observed which is at least
a-times greater under H0 than under Ha. This is a symmetric
concept which is formulated within the classical approach.
Furthermore, more than two decisions can also be formulated.
Visit also, the Web site Sample Size Calculations
Parametric vs. Non-Parametric vs. Distribution-free Tests
One must use a statistical technique called nonparametric if it
satisfies at least one of the following five types of criteria:
1. The data entering the analysis are enumerative - that is,
count data representing the number of observations in each
category or cross-category.
2. The data are measured and/or analyzed using a nominal
scale of measurement.
3. The data are measured and/or analyzed using an ordinal
scale of measurement.
4. The inference does not concern a parameter in the
population distribution - as, for example, the hypothesis that
a time-ordered set of observations exhibits a random
pattern.
5. The probability distribution of the statistic upon which the
analysis is based is not dependent upon specific information
or assumptions about the population(s) from which the
sample(s) are drawn, but only on general assumptions, such
as a continuous and/or symmetric population distribution.
By this definition, the distinction of nonparametric is accorded
either because of the level of measurement used or required for
the analysis, as in types 1 through 3; the type of inference, as in
type 4, or the generality of the assumptions made about the
population distribution, as in type 5.
For example, one may use the Mann-Whitney Rank Test as a
nonparametric alternative to Students T-test when one does not
have normally distributed data.
Mann-Whitney: To be used with two independent groups
(analogous to the independent groups t-test)
Wilcoxon: To be used with two related (i.e., matched or repeated)
groups (analogous to the related samples t-test)
Kruskall-Wallis: To be used with two or more independent
groups (analogous to the single-factor between-subjects ANOVA)
Friedman: To be used with two or more related groups
(analogous to the single-factor within-subjects ANOVA)
Non-parametric vs. Distribution-free Tests:
Non-parametric tests are those used when some specific
conditions for the ordinary tests are violated.
Distribution-free tests are those for which the procedure is valid
for all different shape of the population distribution.
For example, the chi-square test concerning the variance of a
given population is parametric since this test requires that the
population distribution be normal. The chi-square test of
independence does not assume normality, or even that the data
are numerical. The Kolmogorov-Smirinov goodness-of-fit test is a
distribution-free test which can be applied to test any distribution.
Pearson's and Spearman's Correlations
There are measures that describe the degree to which two
variables are linearly related. For the majority of these measures,
the correlation is expressed as a coefficient that ranges from
1.00, indicating a perfect linear relationship such that knowing
the value of one variable will allow perfect prediction of the value
of the related value, to 0.00, indicating no predictability by a
linear model, with negative values indicating that when the value
of one variable is high, the other is low (and vice versa), and
positive values indicating that when the value of one variable is
high, so is the other (and vice versa). Correlation has similar
interpretation compared with the derivative you have learned in
you calculus (a deterministic course).
The Pearson's product correlation is an index of the linear
relationship between two variables.
Formulas:

=  xi / n
This is just the mean of the x values.

=  yi / n
This is just the mean of the y values.

Sxx = (xi -
)2 = xi2 - [xi) ]

Syy = (yi -
)2 = yi2 - [yi ]

Sxx = (xi -
)(yi -
2
2
/n
/n
) = xi. yi - [x(i) . yi ] / n
The Pearson's correlation is
= Sxy / (Sxx . Syy)0.5
If there is a positive relationship an individual has a score on
variable x that is above the mean of variable x, this individual is
likely to have a score on variable y that is above the mean of
variable y, and vice versa. A negative relationship would be an x
score above the mean of x and a y score below the mean of y. It
is a measure of the relationship between variables and an index
of the proportion of individual differences in one variable that can
be associated with the individual differences in another variable.
In essence, the product-moment correlation coefficient is the
mean of the cross-products of scores. If you have three values
for of .40, .60, and .80. you cannot say that the difference
between = .40 and = .60 is the same as the difference
between =.60 and = .80, or that = .80 is twice as large
as = .40 because the scale of values for the correlation
coefficient is not interval or ratio, but ordinal. Therefore, all you
can say is that, for example, a correlation coefficient of +.80
indicates a high positive linear relationship and a correlation
coefficient of +.40 indicates a some what lower positive linear
relationship. It can tell us how much of the total variance of one
variable can be associated with the variance of another variable.
The square of the correlation coefficient equals the proportion of
the total variance in Y that can be associated with the variance in
x.
However, in engineering/manufacturing/development, an of 0.7
is often considered weak, and +0.9 is desirable. When the
correlation coefficient is around +0.9, it is time to make a
prediction and confirmation trial(s). Note that a correlation
coefficient is usually done on linear correlations. If the data forms
a symmetric quadratic hump, a linear correlation of x and y will
produce an of 0!. So one must be careful and look at data.
Spearman rank-order correlation coefficient is used as a nonparametric version of Pearson's. It is expressed as:
= 1 - (6d2) / [n(n2 - 1)],
where d is the difference rank between each X and Y pair.
Spearman correlation coefficient can be algebraically derived from
the Pearson correlation formula by making use of sums of series.
Pearson contains expressions
for x(i), y(i), x(i)2 and y(i)2.
In the Spearman case, the x(i)'s and y(i)' are ranks, and so the
sums of the ranks, and the sums of the ranks squared, are
entirely determined by the number of cases (without any ties).
i = (N+1)N/2, i2 = N(N+1)(2N+1)/6
The Spearman formula then is equal to:
[12P - 3N(N+1)2] / [N(N2 - 1)],
where P is the sum of the product of each pair of ranks x(i)y(i).
This reduces to:
= 1 - (6d2) / [n(n2 - 1)],
where d is the difference rank between each x(i) and y(i) pair.
An important consequence of this is that if you enter ranks into a
Pearson formula, you get precisely the same numerical value as
that obtained by entering the ranks into the Spearman formula.
This comes as a bit of a shock to those who like to adopt
simplistic slogans such as "Pearson is for interval data, Spearman
is for ranked data". Spearman doesn't work too well if there are
lots of tied ranks. That's because the formula for calculating the
sums of squared ranks no longer holds true. If one has lots of
tied ranks, use the Pearson formula.
Visit also the Web sites: Correlation Pearsons r, Spearman's Rank
Correlation
Independence vs. Correlated
In the sense that it is used in statistics, i.e., as an assumption in
applying a statistical test, a random sample from the entire
population provides a set of random variables X1,...., Xn that are
identically distributed and and mutually independent (mutually
independent is stronger than pairwise independence). The
random variables are mutually independent if their joint
distribution is equal to the product of their marginal distributions.
In the case of joint normality, independence is equivalent to zero
correlation but not in general. Independence will imply zero
correlation (if the random variables have second moments) but
not conversely. Not that not all random variables have a first
moment let alone a second moment and hence there may not be
a correlation coefficient.
However if the correlation coefficient of two random variables
(theoretical) is not zero then the random variables are not
independent.
Correlation, and Level of Significance
It is intuitive that with very few data points, a high correlation
may not be statistically significant. You may see statements such
as, "correlation is significant between x and y at the  = .005
level" and "correlation is significant at the  = .05 level." The
question is that how to determine these numbers?
For simple correlation, you can look at the test as a test on r2.
Looking at a simple correlation, the formula for F, where F is the
square of the t-statistic, becomes
F= (n-2) r2 / (1-r2), n
2.
As you may see, this is monotonic in r2and in n. If the degrees of
freedom (n-2) is large, then the F-test is very closely
approximated by the chisquared - so that a value of 3.84 is what
is needed for reaching  = 5% level. The cutoff value of F
changes little enough that the same value, 3.84, gives a pretty
good estimate even when the n is small. You can look up an Ftable or chisquared table to see the cutoff values needed for
other  levels.
Resampling Techniques: Jackknifing, and Bootstrapping
Statistical inference techniques that do not require distributional
assumptions about the statistics involved. These modern nonparametric methods use large amounts of computation to explore
the empirical variability of a statistic, rather than making a priori
assumptions about this variability, as is done in the traditional
parametric t- and z- tests. Monte Carlo simulation allows for the
evaluation of the behavior of a statistic when its mathematical
analysis is intractable. Bootstrapping and jackknifing allow
inferences to be made from a sample when traditional parametric
inference fails. These techniques are especially useful to deal with
statistical problems such as small sample size, statistics with no
well-developed distributional theory, and parametric inference
conditions violations. Both are comouter intensive. Bootstrapping
involves taking repeated samples from a popular with the
operating rule that you delete n from the sample each time.
Jackknifing involves systematically doing n steps, of omitting 1
case from a sample at a time, or, more generally, n/k steps of
omitting k cases; computations that compare "included" vs.
"omitted" can be used (especially) to reduce the bias of
estimation.
Bootstrapping means you take repeated samples from a sample
and then make statements about a population. Bootstrapping
entails sampling-with-replacement from a sample. Both have
applications in reducing biase in estimations.
Resampling -- including the bootstrap, permutation, and other
non-parametric tests -- is a method for hypothesis tests,
confidence limits, and other applied problems in statistics and
probability. It involves no formulas or tables. Resampling
procedure-free for all tests.
Following the first publication of the general technique (and the
bootstrap) in 1969 by Julian Simon and subsequent independent
development by Bradley Efron, resampling has become an
alternative approach for test of hypotheses.
There are other findings, "The bootstrap started out as a good
notion in that it presented, in theory, an elegant statistical
procedure that was free of distributional conditions.
Unfortunately, it doesn't work very well, and the attempts to
modify it make it more complicated and more confusing than the
parametric procedures that it was meant to replace."
For the pros and cons of the bootstrap, read
Young G., Bootstrap: More than a Stab in the Dark?, Statistical
Science, l9, 382-395, 1994.
visit also, the Web sites
Resampling, and
Bootstrapping with SAS.
Sampling Methods
From the food you eat to the TV you watch, from political
elections to school board actions, much of your life is regulated
by the results of sample surveys. In the information age of today
and tomorrow, it is increasingly important that sample survey
design and analysis be understood by many so as to produce
good data for decision making and to recognize questionable data
when it arises. Relevant topics are: Simple Random Sampling,
Stratified Random Sampling, Cluster Sampling, Systematic
Sampling, Ratio and Regression Estimation, Estimating a
Population Size, Sampling a Continuum of Time, Area or Volume,
Questionnaire Design, Errors in Surveys.
A sample is a group of units selected from a larger group (the
population). By studying the sample it is hoped to draw valid
conclusions about the larger group.
A sample is generally selected for study because the population is
too large to study in its entirety. The sample should be
representative of the general population. This is often best
achieved by random sampling. Also, before collecting the sample,
it is important that the researcher carefully and completely
defines the population, including a description of the members to
be included.
Random sampling of size n from a population size N. Unbiased
estimate for variance of
is Var(
) = S2(1-n/N)/n, where
n/N is the sampling fraction. For sampling fraction less than 10%
the finite population correction factor (N-n)/(N-1) is almost 1.
The total T is estimated by N.
, its variance is N2Var(
).
For 0, 1, (binary) type variables, variation in Pbar is
S2 = Pbar.(1-Pbar).(1-n/N)/(n-1).
For ratio r = xi/yi=
/
, the variation for r is
[(N-n)(r2S2x + S2y -2 r Cov(x, y)]/[n(N-1).
Stratified Sampling:
and
t
s
2
].
=  Wt. Bxart, over t=1, 2, ..L (strata),
is Xit/nt.
Its variance is:
W2t /(Nt-nt)S2t/[nt(Nt-1)]
Population total T is estimated by N.
s,
its variance is
N2t(Nt-nt)S2t/[nt(Nt-1)].
Since the survey usually measures several attributes for each
population member, it is impossible to find an allocation that is
simultaneously optimal for each of those variables. Therefore, in
such a case we use the popular method of allocation which use
the same sampling fraction in each stratum. This yield optimal
allocation given the variation of the strata are all the same.
Determination of sample sizes (n) with regard to binary data:
Smallest integer greater than or equal to:
[t2 N p(1-p)] / [t2 p(1-p) + 2 (N-1)]
with N being the size of the total number of cases, n being the
sample size,  the expected error, t being the value taken from
the t distribution corresponding to a certain confidence interval,
and p being the probability of an event.
Cross-Sectional Sampling:: Cross-Sectional Study the
observation of a defined population at a single point in time or
time interval. Exposure and outcome are determined
simultaneously.
For more information on sampling methods, visit the Web sites :
Sampling
Sampling In Research
Sampling, Questionnaire Distribution and Interviewing
SRMSNET: An Electronic Bulletin Board for Survey Researchers
Sampling and Surveying Handbook
Warranties: Statistical Planning and Analysis
In today market place, warranty has become an increasingly
important component of a product package and most consumer
and industrial products are sold with a warranty. The warranty
serves many purposes. It provides protection for both buyer and
manufacturer. For a manufacturer, a warranty also serves to
communicate information about product quality, and, as such,
may be used as a very effective marketing tool.
Warranty decisions involve both technical and commercial
considerations. Because of the possible financial consequences of
these decisions, effective warranty management is critical for the
financial success of a manufacturing firm. This requires that
management at all levels be aware of the concept, role, uses and
cost and design implications of warranty.
The aim is to understand:
the concept of warranty and its uses; warranty policy
alternatives; the consumer/manufacturer perspectives with
regards warranties; the commercial/technical aspects of warranty
and their interaction; strategic warranty management; methods
for warranty cost prediction; warranty administration
References and Further Readings:
Brennan J., Warranties: Planning, Analysis, and Implementation,
McGraw Hill, New York, 1994.
Factor Analysis
Factor Analysis is a technique for data reduction that is,
explaining the variation in a collection of continuous variables by
a smaller number of underlying dimensions (called factors).
Common factor analysis can also be used to form index numbers
or factor scores by using correlation or covariance matrix. The
main problem with factor analysis concept is that it is very
subjective in its interpretation of the results.
Delphi Analysis
Delphi Analysis is used in decision making process, in particular in
forecasting. Several "experts" sit together and try to compromise
on something they cannot agree on.
Reference:
Delbecq, A., Group Techniques for Program Planning, Scott
Foresman, 1975.
Binomial Distribution
Application: Gives probability of exact number of successes in n
independent trials, when probability of success p on single trial is
a constant. Used frequently in quality control, reliability, survey
sampling, and other industrial problems.
Example: What is the probability of 7 or more "heads" in 10
tosses of a fair coin?
Know that the binomial distribution is to satisfy the five following
requirements: each trial can have only two outcomes or its
outcomes can be reduced to two categories which is called pass
and fail, there must be a fixed number of trials, the outcome of
each trail must be independent, the probabilities must remain
constant, and the outcome of interest is the number of successes.
Comments: Can sometimes be approximated by normal or by
Poisson distribution.
Poisson
Application: Gives probability of exactly x independent
occurrences during a given period of time if events take place
independently and at a constant rate. May also represent number
of occurrences over constant areas or volumes. Used frequently
in quality control, reliability, queuing theory, and so on.
Example: Used to represent distribution of number of defects in a
piece of material, customer arrivals, insurance claims, incoming
telephone calls, alpha particles emitted, and so on.
Comments: Frequently used as approximation to binomial
distribution.
Exponential Distribution
Application: Gives distribution of time between independent
events occurring at a constant rate. Equivalently, probability
distribution of life, presuming constant conditional failure (or
hazard) rate. Consequently, applicable in many, but not all
reliability situations.
Example: Distribution of time between arrival of particles at a
counter. Also life distribution of complex non redundant systems,
and usage life of some components - in particular, when these
are exposed to initial burn-in, and preventive maintenance
eliminates parts before wear-out.
Comments: Special case of both Weibull and gamma
distributions.
Uniform Distribution
Application: Gives probability that observation will occur within a
particular interval when probability of occurrence within that
interval is directly proportional to interval length.
Example: Used to generate random valued.
Comments: Special case of beta distribution.
The density of geometric mean of n independent uniforms(0,1)
is:
P(X = x) = n x(n - 1) (Log[1/xn])(n -1) / (n - 1)!.
zL = [UL-(1-U)L] / L is said to have Tukey's symmetrical lambda
distribution.
Student's t-Distributions
The t distributions were discovered in 1908 by William
Gosset who was a chemist and a statistician employed by the
Guinness brewing company. He considered himself a student still
learning statistics, so that is how he signed his papers as
pseudonym "Student". Or perhaps he used a pseudonym due to
"trade secrets" restrictions by Guinness.
Note that there are different t distributions, it is a class of
distributions. When we speak of a specific t distribution, we have
to specify the degrees of freedom. The t density curves are
symmetric and bell-shaped like the normal distribution and have
their peak at 0. However, the spread is more than that of the
standard normal distribution. The larger the degrees of freedom,
the closer the t-density is to the normal density.
Critical Values for the t-Distribution
Annotated Review of Statistical Tools on the Internet
Visit also the Web site Computational Tools and Demos on the
Internet
Introduction: Modern, web-based learning and computing
provides the means for fundamentally changing the way in which
statistical instruction is delivered to students. Multimedia learning
resources combined with CD-ROMs and workbooks attempt to
explore the essential concepts of a course by using the full
pedagogical power of multimedia. Many Web sites have nice
features such as interactive examples, animation, video,
narrative, and written text. These web sites are designed to
provide students with a "self-help" learning resource to
complement the traditional textbook.
In a few pilot studies, [Mann, B. (1997) Evaluation of
Presentation modalities in a hypermedia system, Computers &
Education, 28, 133-143. Ward M. and D. Newlands (1998) Use of
the Web in undergraduate teaching, Computers & Education, 31,
171-184.] compared the relative effectiveness of three versions
of hypermedia systems, namely, Text, Sound/Text, and Sound.
The results indicate that those working with Sound could focus
their attention on the critical information. Those working with the
Text and Sound/Text version however, did not learn as much and
stated their displeasure with reading so much text from the
screen. Based on this study, it is clear at least at this time that
such web-based innovations cannot serve as an adequate
substitute for face-to-face live instruction [See also Mcintyre D.,
and F. Wolff, An experiment with WWW interactive learning in
university education, Computers & Education, 31, 255-264,
1998].
Online learning education does for knowledge what just-in-time
delivery does for manufacturing: It delivers the right tools and
parts when you need them.
The Java applets are probably the most phenomenal way of
simplifying various concepts by way of interactive processes.
These applets help bring into life every concept from central limit
theorem to interactive random games and multimedia
applications.
The Flashlight Project develops survey items, interview plans,
cost analysis methods, and other procedures that institutions can
use to monitor the success of educational strategies that use
technology.
Read also, Critical notice: we are blessed with the emergence of
the WWW? Edited by B. Khan, and R. Goodfellow, Computers and
Education, 30(1-2), 131-136, 1998.
The following compilation summarizes currently available public
domain web sites offering statistical instructional material. While
some sites may have been missed, I feel that this listing is fully
representative. I would welcome information regarding any
further sites for inclusion, E-mail.
Academic Assistance Access It is a free tutoring service designed
to offer assistance to your statistics questions.
Basic Definitions, by V. Easton and J. McColl, Contains glossary of
basic terms and concepts.
Basic principles of statistical analysis, by Bob Baker, Basics
concepts of statistical models, Mixed model, Choosing between
fixed and random effects, Estimating variances and covariance,
Estimating fixed effects, Predicting random effects, Inference
space, Conclusions, Some references.
Briefbook of Data Analysis, has many contributors. The most
comprehensive dictionary of statistics. Includes ANOVA, Analysis
of Variance, Attenuation, Average, Bayes Theorem, Bayesian
Statistics, Beta Distribution, Bias, Binomial Distribution, Bivariate
Normal Distribution, Bootstrap, Cauchy Distribution, Central Limit
Theorem, Bootstrap, Chi-square Distribution, Composite
Hypothesis, Confidence Level, Correlation Coefficient, Covariance,
Cramer-Rao Inequality, Cramer-Smirnov-Von Mises Test, Degrees
of Freedom, Discriminant Analysis, Estimator, Exponential
Distribution, F-Distribution, F-test, Factor Analysis, Fitting,
Geometric Mean, Goodness-of-fit Test, Histogram, Importance
Sampling, Jackknife, Kolmogorov Test, Kurtosis, Least Squares,
Likelihood, Linear Regression, Maximum Likelihood Method,
Mean, Median, Mode, Moment, Monte Carlo Methods, Multinomial
Distribution, Multivariate Normal, Distribution Normal
Distribution, Outlier, Poisson Distribution, Principal Component
Analysis, Probability, Probability Calculus, Random Numbers,
Random Variable, Regression Analysis, Residuals, Runs Test,
Sample Mean, Sample Variance, Sampling from a Probability
Density Function, Scatter Diagram, Significance of Test,
Skewness, Standard Deviation, Stratified Sampling, Student's t
Distribution, Student's test, Training Sample, Transformation of
Random Variables, Trimming, Truly Random Numbers, Uniform
Distribution, Validation Sample Variance, Weighted Mean, etc.,
References, and Index.
Calculus Applied to Probability and Statistics for Liberal Arts and
Business Majors, by Stefan Waner and Steven Costenoble,
contains: Continuous Random Variables and Histograms;
Probability Density Functions; Mean, Median, Variance and
Standard Deviation.
Computing Studio, by John Behrens, Each page is a data entry
form that will allow you to type data in and will write a page that
walks you through the steps of computing your statistic: Mean,
Median, Quartiles, Variance of a population, Sample variance for
estimating a population variance, Standard-deviation of a
population, Sample standard-deviation used to estimate a
population standard-deviation, Covariance for a sample, Pearson
Product-Moment Correlation Coefficient (r), Slope of a regression
line, Sums-of-squares for simple regression.
CTI Statistics, by Stuart Young, CTI Statistics is a statistical
resource center. Here you will find software reviews and articles,
a searchable guide to software for teaching, a diary of
forthcoming statistical events worldwide, a CBL software
developers' forum, mailing list information, contact addresses,
and links to a wealth of statistical resources worldwide.
Data and Story Library, It is an online library of datafiles and
stories that illustrate the use of basic statistics methods.
DAU Stat Refresher, has many contributors. Tutorial, Tests,
Probability, Random Variables, Expectations, Distributions, Data
Analysis, Linear Regression, Multiple Regression, Moving
Averages, Exponential Smoothing, Clustering Algorithms, etc.
Descriptive Statistics Computation, Enter a column of your data
so that the mean, standard deviation, etc. will be calculated.
Elementary Statistics Interactive, by Wlodzimierz Bryc,
Interactive exercises, including links to further reading materials,
includes on-line tests.
Elementary Statistics, by J. McDowell. Frequency distributions,
Statistical moments, Standard scores and the standard normal
distribution, Correlation and regression, Probability, Sampling
Theory, Inference: One Sample, Two Samples.
Evaluation of Intelligent Systems, by Paul Cohen (Editor-inChief), covers: Exploratory data analysis, Hypothesis testing,
Modeling, and Statistical terminology. It also serves as
community-building function.
First Bayes, by Tony O'Hagan, First Bayes is a teaching package
for elementary Bayesian Statistics.
Fisher's Exact Test, by Øyvind Langsrud, To categorical variables
with two levels.
Gallery of Statistics Jokes, by Gary Ramseyer, Collection of
Statistical Joks.
Glossary of Statistical Terms, by D. Hoffman, Glossary of major
keywords and phrases in suggested learning order is provided.
Graphing Studio, Data entry forms to produce plots for twodimensional scatterplot, and three-dimensional scatterplot.
HyperStat Online, by David Lane. It is an introductory-level
statistics book.
Interactive Statistics, Contains some nice Java applets: guessing
correlations, scatterplots, Data Applet, etc.
Interactive Statistics Page, by John Pezzullo, Web pages that
perform mostly needed statistical calculations. A complete
collection on: Calculators, Tables, Descriptives, Comparisons,
Cross-Tabs, Regression, Other Tests, Power&Size, Specialized,
Textbooks, Other Stats Pages.
Internet Glossary of Statistical Terms, by By H. Hoffman, The
contents are arranged in suggested learning order and
alphabetical order, from Alpha to Z score.
Internet Project, by Neil Weiss, Helps students understand
statistics by analyzing real data and interacting with graphical
demonstrations of statistical concepts.
Introduction to Descriptive Statistics, by Jay Hill, Provides
everyday's applications of Mode, Median, Mean, Central
Tendency, Variation, Range, Variance, and Standard Deviation.
Introduction to Quantitative Methods, by Gene Glass. A basic
statistics course in the College of Education at Arizona State
University.
Introductory Statistics Demonstrations, Topics such as Variance
and Standard Deviation, Z-Scores, Z-Scores and Probability,
Sampling Distributions, Standard Error, Standard Error and Zscore Hypothesis Testing, Confidence Intervals, and Power.
Introductory Statistics: Concepts, Models, and Applications, by
David Stockburger. It represents over twenty years of experience
in teaching the material contained therein by the author. The high
price of textbooks and a desire to customize course material for
his own needs caused him to write this material. It contains
projects, interactive exercises, animated examples of the use of
statistical packages, and inclusion of statistical packages.
The Introductory Statistics Course: A New Approach, by D.
Macnaughton. Students frequently view statistics as the worst
course taken in college. To address that problem, this paper
proposes five concepts for discussion at the beginning of an
introductory course: (1) entities, (2) properties of entities, (3) a
goal of science: to predict and control the values of properties of
entities, (4) relationships between properties of entities as a key
to prediction and control, and (5) statistical techniques for
studying relationships between properties of entities as a means
to prediction and control. It is argued that the proposed approach
gives students a lasting appreciation of the vital role of the field
of statistics in scientific research. Successful testing of the
approach in three courses is summarized.
Java Applets, by many contributors. Distributions (Histograms,
Normal Approximation to Binomial, Normal Density, The T
distribution, Area Under Normal Curves, Z Scores & the Normal
Distribution. Probability & Stochastic Processes (Binomial
Probabilities, Brownian Motion, Central Limit Theorem, A Gamma
Process, Let's Make a Deal Game. Statistics (Guide to basic stats
labs, ANOVA, Confidence Intervals, Regression, Spearman's rank
correlation, T-test, Simple Least-Squares Regression, and
Discriminant Analysis.
The Knowledge Base, by Bill Trochim, The Knowledge Base is an
online textbook for an introductory course in research methods.
Lies, Damn Lies, and Psychology, by David Howell, This is the
homepage for a course modeled after the Chance course.
Math Titles: Full List of Math Lesson Titles, by University of
Illinois, Lessons on Statistics and Probability topics among others.
Nonparametric Statistical Methods, by Anthony Rossini, almost all
widely used nonparametric tests are presented.
On-Line Statistics, by Ronny Richardson, contains the contents of
his lecture notes on: Descriptive Statistics, Probability, Random
Variables, The Normal Distribution, Create Your Own Normal
Table, Sampling and Sampling Distributions, Confidence
Intervals, Hypothesis Testing, Linear Regression Correlation Using
Excel.
Online Statistical Textbooks, by Haiko Lüpsen.
Power Analysis for ANOVA Designs, by Michael Friendly, It runs a
SAS program that calculates power or sample size needed to
attain a given power for one effect in a factorial ANOVA design.
The program is based on specifying Effect Size in terms of the
range of treatment means, and calculating the minimum power,
or maximum required sample size.
Practice Questions for Business Statistics, by Brian Schott, Over
800 statistics quiz questions for introduction to business
statistics.
Prentice Hall Statistics, This site contains full description of the
materials covers in the following books coauthored by Prof.
McClave: A First Course In Statistics, Statistics, Statistics For
Business And Economics, A First Course In Business Statistics.
Probability Lessons, Interactive probability lessons for problemsolving and actively.
Probability Theory: The logic of Science, by E. Jaynes. Plausible
Reasoning, The Cox Theorems, Elementary Sampling Theory,
Elementary Hypothesis Testing, Queer Uses for Probability
Theory, Elementary Parameter Estimation, The Central Gaussian,
or Normal, Distribution, Sufficiency, Capillarity, and All That,
Repetitive Experiments: Probability and Frequency, Physics of
``Random Experiments'', The Entropy Principle, Ignorance Priors
-- Transformation Groups, Decision Theory: Historical Survey,
Simple Applications of Decision Theory, Paradoxes of Probability
Theory, Orthodox Statistics: Historical Background, Principles and
Pathology of Orthodox Statistics, The A --Distribution and Rule of
Succession. Physical Measurements, Regression and Linear
Models, Estimation with Cauchy and t--Distributions, Time Series
Analysis and Auto regressive Models, Spectrum / Shape Analysis,
Model Comparison and Robustness, Image Reconstruction,
Nationalization Theory, Communication Theory, Optimal Antenna
and Filter Design, Statistical Mechanics, Conclusions Other
Approaches to Probability Theory, Formalities and Mathematical
Style, Convolutions and Cumulates, Circlet Integrals and
Generating Functions, The Binomial -- Gaussian Hierarchy of
Distributions, Courier Analysis, Infinite Series, Matrix Analysis and
Computation, Computer Programs.
Probability and Statistics, by Beth Chance. Covers the
introductory materials supporting Moo re and
McCabe, Introduction to the practice of statistics, ND edition,
WHO Freeman, 1999. text book.
Rice Virtual Lab in Statistics, by David Lane et al., An introductory
statistics course which uses Java script Monte Carlo.
Sampling distribution demo, by David Lane, Applet estimates and
plots the sampling distribution of various statistics given
population distribution, sample size, and statistic.
Selecting Statistics, Cornell University. Answer the questions
therein correctly, then Selecting Statistics leads you to an
appropriate statistical test for your data.
Simple Regression, Enter pairs of data so that a line can be fit to
the data.
Scatterplot, by John Behrens, Provides a two-dimensional
scatterplot.
Selecting Statistics, by Bill Trochim, An expert system for
statistical procedures selection.
Some experimental pages for teaching statistics, by Juha
Puranen, contains some - different methods for visualizing
statistical phenomena, such as Power and Box-Cox
transformations.
Statlets: Download Academic Version (Free), Contains Java
Applets for Plots, Summarize, One and two-Sample Analysis,
Analysis of Variance, Regression Analysis, Time Series Analysis,
Rates and Proportions, and Quality Control.
Statistical Analysis Tools, Part of Computation Tools of Hyperstat.
Statistical Demos and Monte Carlo, Provides demos for Sampling
Distribution Simulation, Normal Approximation to the Binomial
Distribution, and A "Small" Effect Size Can Make a Large
Difference.
Statistical Education Resource Kit, by Laura Simon, This web page
contains a collection of resources used by faculty in Penn State's
Department of Statistics in teaching a broad range of statistics
courses.
Statistical Instruction Internet Palette, For teaching and learning
statistics, with extensive computational capability.
Statistical Terms, by The Animated Software Company,
Definitions for terms via a standard alphabetical listing.
Statiscope, by Mikael Bonnier, Interactive environment (Java
applet) for summarizing data and descriptive statistical charts.
Statistical Calculators, Presided at UCLA, Material here includes:
Power Calculator, Statistical Tables, Regression and GLM
Calculator, Two Sample Test Calculator, Correlation and
Regression Calculator, and CDF/PDF Calculators.
Statistical Home Page, by David C. Howell, This is a Home Page
containing statistical material covered in the author's textbooks
(Statistical Methods for Psychology and Fundamental Statistics for
the Behavioral Sciences), but it will be useful to others not using
those book. It is always under construction.
Statistics Page, by Berrie, Movies to illustrate some statistical
concepts.
Statistical Procedures, by Phillip Ingram, Descriptions of various
statistical procedures applicable to the Earth Sciences: Data
Manipulation, One and Two Variable Measures, Time Series
Analysis, Analysis of Variance, Measures of Similarity, Multivariate Procedures, Multiple regression, and Geostatistical
Analysis.
Statistical Tests, Contains Probability Distributions (Binomial,
Gaussian, Student-t, Chi-Square), One-Sample and MatchedPairs tests, Two-Sample tests, Regression and correlation, and
Test for categorical data.
Statistical Tools, Pointers for demos on Binomial and Normal
distributions, Normal approximation, Sample distribution, Sample
mean, Confidence intervals, Correlation, Regression, Leverage
points and Chisquare.
Statistics, This server will perform some elementary statistical
tests on your data. Test included are Sign Test, McNemar's Test,
Wilcoxon Matched-Pairs Signed-Ranks Test, Student-t test for one
sample, Two-Sample tests, Median Test, Binomial proportions,
Wilcoxon Test, Student-t test for two samples, Multiple-Sample
tests, Friedman Test, Correlations, Rank Correlation coefficient,
Correlation coefficient, Comparing Correlation coefficients,
Categorical data (Chi-square tests), Chi-square test for known
distributions, Chi-square test for equality of distributions.
Statistics Homepage, by StatSoft Co., Complete coverage of
almost all topics
Statistics: The Study of Stability in Variation, Editor: Jan de
Leeuw. It has components which can be used on all levels of
statistics teaching. It is disguised as an introductory textbook,
perhaps, but many parts are completely unsuitable for
introductory teaching. Its contents are Introduction, Analysis of a
Single Variable, Analysis of a Pair of Variables, and Analysis of
Multi-variables.
Statistics Every Writer Should Know, by Robert Niles and Laurie
Niles. Treatment of elementary concepts.
Statistics Glossary, by V. Easton and J. McColl, Alphabetical index
of all major keywords and phrases
Statistics Network A Web-based resource for almost all statistical
kinds of information.
Statistics Online A good collection of links on: Statistics to Use,
Confidence Intervals, Hypothesis Testing, Probability
Distributions, One-Sample and Matched-Pairs Tests, Two-Sample
Tests, Correlations, Categorical Data, and Statistical Tables.
Statistics on the Web, by Clay Helberg, Just as the Web itself
seems to have unlimited resources, Statistics on the web must
have hundreds of sites listing such statistical areas as:
Professional Organizations, Institutes and Consulting Groups,
Educational Resources, Web courses, and others too numerous to
mention. One could literally shop all day finding the joys and
treasures of Statistics!
Statistics To Use, by T. Kirkman, Among others it contains
computations on: Mean, Standard Deviation, etc., Student's tTests, chi-square distribution test, contingency tables, Fisher
Exact Test, ANOVA, Ordinary Least Squares, Ordinary Least
Squares with Plot option, Beyond Ordinary Least Squares, and Fit
to data with errors in both coordinates.
Stat Refresher, This module is an interactive tutorial which gives
a comprehensive view of Probability and Statistics. This
interactive module covers basic probability, random variables,
moments, distributions, data analysis including regression,
moving averages, exponential smoothing, and clustering.
Tables, by William Knight, Tables for: Confidence Intervals for the
Median, Binomial Coefficients, Normal, T, Chi-Square, F, and
other distributions.
Two-Population T-test
SURFSTAT Australia, by Keith Dear. Summarizing and Presenting
Data, Producing Data, Variation and Probability, Statistical
Inference, Control Charts.
UCLA Statistics, by Jan de Leeuw, On-line introductory textbook
with datasets, Lispstat archive, datasets, and live on-line
calculators for most distributions and equations.
VassarStats, by Richard Lowry, On-line elementary statistical
computation.
Web Interface for Statistics Education, by Dale Berger, Sampling
Distribution of the Means, Central Limit Theorem, Introduction to
Hypothesis Testing, t-test tutorial. Collection of links for Online
Tutorials, Glossaries, Statistics Links, On-line Journals, Online
Discussions, Statistics Applets.
WebStat, by Webster West. Offers many interactive test
procedures, graphics, such as Summary Statistics, Z tests (one
and two sample) for population means, T tests (one and two
sample) for population means, a chi-square test for population
variance, a F test for comparing population variances, Regression,
Histograms, Stem and Leaf plots, Box plots, Dot plots, Parallel
Coordinate plots, Means plots, Scatter plots, QQ plots, and Time
Series Plots.
WWW Resources for Teaching Statistics, by Robin Lock.
Interesting and Useful Sites
Selected Reciprocal Web Sites
| ABCentral | Bulletin Board Libraries |Business Problem
Solving |Business Math |Casebook |Chance |CTI
Statistics |Cursos de estadística |Demos for Learning
Statistics |Electronic texts and statistical tables |Epidemiology
and Biostatistics |Financial and Economic Links | Hyperstat |Intro.
to Stat. |Java Applets |Lecturesonline |Lecture
summaries | Maths & Stats Links|
| Online Statistical Textbooks and Courses |Probability
Tutorial | Research Methods & Statistics Resources | Statistical
Demos and Calculations |Statistical Education Resource Kit|
|Statistical Resources |Statistical Resources on the
Web |Statistical tests |Statistical Training on the Web |Statistics
Education-I |Statistics Education-II |
| Statistics Network |Statistics on the Web |Statistics, Statistical
Computing, and Mathematics |Statoo |Stats
Links |st@tserv |StatSoft | StatsNet |
| StudyWeb | SurfStat |Using Excel |Virtual Library |WebEc | Web
Tutorial Links |Yahoo:Statistics|
More reciprocal sites may be found by clicking on the following
search engines:
GoTo| HotBot| InfoSeek| LookSmart| Lycos|
General References
| The MBA Page | What is OPRE? | Desk Reference| Another Desk
Reference | Spreadsheets | All Topics on the Web | Contacts to
Statisticians | Statistics Departments (by country)|
| ABCentral | Syllabits | World Lecture Hall | Others Selected
Links | Virtual Library
| Argus Clearinghouse | TILE.NET | CataList | Maths and
Computing Lists
Statistics References
| Careers in Statistics | Conferences | | Statistical List
Subscription | Statistics Mailing Lists | Edstat-L | Mailbase
Lists | Stat-L | Stats-Discuss | Stat Discussion
Group | StatsNet | List Servers|
| Math Forum Search|
| Statistics Journals | Books and Journal | Main Journals | Journal
Web Sites|
Statistical Societies & Organizations

American Statistical Association (ASA)

ASA D.C. Chapter

Applied Probability Trust

Bernoulli Society

Biomathematics and Statistics Scotland

Biometric Society

Center for Applied Probability at Columbia


Center for Applied Probability at Georgia Tech
Center for Statistical and Mathematical Computing

Classification Society of North America

CTI Statistics

Dublin Applied Probability Group

Institute of Mathematical Statistics

International Association for Statistical Computing

International Biometric Society

International Environmetric Society

International Society for Bayesian Analysis

International Statistical Institute

National Institute of Statistical Sciences

RAND Statistics Group

Royal Statistical Society

Social Statistics

Statistical Engineering Division

Statistical Society of Australia

Statistical Society of Canada
Statistics Resources
| Statistics Main Resources | Statistics and OPRE
Resources | Statistics Links | STATS | StatsNet | Resources | UK
Statistical Resources|
| Mathematics Internet Resources | Mathematical and
Quantitative Methods |Stat Index | StatServ | Study
WEB | Ordination Methods for Ecologists|
WWW Resources | StatLib: Statistics Library | Guide for
Statisticians|
| Stat Links | Use and Abuse of Statistics | Statistics Links|
| Statistical Links | Statistics Handouts | Statistics Related
Links | Statistics Resources |OnLine Text Books|
Probability Resources
|Probability Tutorial |Probability | Probability & Statistics |Theory
of Probability | Virtual Laboratories in Probability and Statistics
|Let's Make a Deal Game |Central Limit Theorem | The Probability
Web | Probability Abstracts
| Coin Flipping |Java Applets on Probability | Uncertainty in
AI |Normal Curve Area | Topics in Probability | PQRS Probability
Plot | The Birthday Problem|
Data and Data Analysis
|Histograms | Statistical Data Analysis | Exploring Data | Data
Mining |Books on Statistical Data Analysis|
| Evaluation of Intelligent Systems | AI and Statistics|
Statistical Software
| Statistical Software Providers | SPLUS | WebStat | QDStat | Statistical Calculators on
Web | MODSTAT | The AssiStat|
| Statistical Software | Mathematical and Statistical
Software | NCSS Statistical Software|
| Choosing a Statistical Analysis Package | Statistical Software
Review| Descriptive Statistics by Spreadsheet | Statistics with
Microsoft Excel|
Learning Statistics
| How to Study Statistics | Statistics Education | Web and
Statistical Education | Statistics & Decision
Sciences | Statistics | Statistical Education through Problem
Solving|
| Exam, tests samples | INFORMS Education and Students
Affairs | CHANCE Magazin | Chance Web Index|
| Statistics Education Bibliography | Teacher
Network | Computers in Teaching Statistics|
Glossary Collections
The following sites provide a wide range of keywords & phrases.
Visit them frequently to learn the language of statisticians.
|Data Analysis Briefbook | Glossary of Statistical Terms |Glossary
of Terms |Glossary of Statistics |Internet Glossary of Statistical
Terms |Lexicon|Selecting Statistics Glossary |Statistics
Glossary | SurfStat glossary|
Selected Topics
|ANOVA |Confidence Intervals |Regression
| Kolmogorov-Smirnov Test | Topics in Statistics-I | Topics in
Statistics-I | Statistical Topics | Resampling | Pattern
Recognition | Statistical Sites by Applications | Statistics and
Computing|
| Biostatistics | Biomathematics and Statistics | Introduction to
Biostatistics Bartlett Corrections|
| Statistical Planning | Regression Analysis | AI-Geostats | TotalQality | Analysis of Variance and Covariance|
| Significance Testing | Hypothesis Testing | Two-Tailed
Hypothesis Testing | Commentaries on Significance
Testing | Bayesian | Philosophy of Testing|
Questionnaire Design, Surveys Sampling and
Analysis
|Questionnaire Design and Statistical Data Analysis |Summary of
Survey Analysis Software |Sample Size in Surveys
Sampling |Survey Samplings|
|Multilevel Statistical Models | Write more effective survey
questions|
| Sampling In Research | Sampling, Questionnaire Distribution
and Interviewing | SRMSNET: An Electronic Bulletin Board for
Survey|
| Sampling and Surveying Handbook |Surveys Sampling
Routines |Survey Software |Multilevel Models Project|
Econometric and Forecasting
| Time Series Analysis for Official Statisticians | Time Series and
Forecasting | Business Forecasting | International Association of
Business Forecasting |Institute of Business Forecasting |Principles
of Forecasting|
| Financial Statistics | Econometric-Research | Econometric
Links | Economists | RFE: Resources for Economists | Business &
Economics Scout Reports|
| A Business Forecasting Course | A Forecasting Course | Time
Series Data Library | Journal of Forecasting|
| Economics and Teaching |Box-Jenkins Methodology |
Statistical Tables
The following Web sites provide critical values useful in statistical
testing and construction of confidence intervals. The results are
identical to those given in almost all textbook. However, in most
cases they are more extensive (therefore more accurate).
|Normal Curve Area |Normal Calculator |Normal Probability
Calculation |Critical Values for the t-Distribution | Critical Values
for the F-Distribution |Critical Values for the Chi- square
Distribution|
A selection of:
Academic Info: Business, AOL: Science and Technology, Biz/ed: Business
and Economics, BUBL Catalogue, Business & Economics: Scout
Report, Business & Finance, Business & Industrial,
Business Nation, Dogpile: Statistics, HotBot Directory:
Statistics, IFORS, LookSmart: Statistics, LookSmart: Data &
Statistics, MathForum: Business,McGraw-Hill: Business Statistics, NEEDS:
The National Engineering Education Delivery System, Netscape:
Statistics, NetFirst,
SavvySearch Guide: Statistics, Small Business, Social Science Information
Gateway, WebEc, and the Yahoo
The Copy Right Statements: The fair use, according the 1996 Fair
Use Guidelines for Educational Multimedia, of materials presented
on this Web site is permitted for noncommercial and classroom
purposes.
This site may be mirrored, intact including these notices, on any
server with the public access, it may be linked to any other Web
pages.
Kindly e-mail me your comments, suggestions, and concerns.
Thank you.
Professor Hossein Arsham
EOF
Estimation theory
From Wikipedia, the free encyclopedia
For other uses, see Estimation (disambiguation).
"Parameter estimation" redirects here. It is not to be confused with Point estimation or Interval
estimation.
Estimation theory is a branch of statistics that deals with estimating the values of parameters
based on measured/empirical data that has a random component. The parameters describe an
underlying physical setting in such a way that their value affects the distribution of the measured
data. An estimator attempts to approximate the unknown parameters using the measurements.
For example, it is desired to estimate the proportion of a population of voters who will vote for a
particular candidate. That proportion is the parameter sought; the estimate is based on a small
random sample of voters.
Or, for example, in radar the goal is to estimate the range of objects (airplanes, boats, etc.) by
analyzing the two-way transit timing of received echoes of transmitted pulses. Since the reflected
pulses are unavoidably embedded in electrical noise, their measured values are randomly
distributed, so that the transit time must be estimated.
In estimation theory, two approaches are generally considered. [1]

The probabilistic approach (described in this article) assumes that the measured data is
random with probability distribution dependent on the parameters of interest

The set-membership approach assumes that the measured data vector belongs to a set
which depends on the parameter vector.
For example, in electrical communication theory, the measurements which contain information
regarding the parameters of interest are often associated with a noisy signal. Without
randomness, or noise, the problem would be deterministic and estimation would not be needed.
Contents
[hide]








1Basics
2Estimators
3Examples
o 3.1Unknown constant in additive white Gaussian noise
 3.1.1Maximum likelihood
 3.1.2Cramér–Rao lower bound
o 3.2Maximum of a uniform distribution
4Applications
5See also
6Notes
7References
8References
Basics[edit]
To build a model, several statistical "ingredients" need to be known. These are needed to ensure
the estimator has some mathematical tractability.
The first is a set of statistical samples taken from a random vector (RV) of size N. Put into
a vector,
Secondly, there are the corresponding M parameters
which need to be established with their continuous probability density function (pdf) or its
discrete counterpart, the probability mass function (pmf)
It is also possible for the parameters themselves to have a probability distribution
(e.g., Bayesian statistics). It is then necessary to define the Bayesian probability
After the model is formed, the goal is to estimate the parameters, commonly
denoted , where the "hat" indicates the estimate.
One common estimator is the minimum mean squared error estimator, which
utilizes the error between the estimated parameters and the actual value of the
parameters
as the basis for optimality. This error term is then squared and minimized for
the MMSE estimator.
Estimators[edit]
Main article: Estimator
Commonly used estimators and estimation methods, and topics related to
them:





Maximum likelihood estimators
Bayes estimators
Method of moments estimators
Cramér–Rao bound
Minimum mean squared error (MMSE), also known as Bayes least
squared error (BLSE)









Maximum a posteriori (MAP)
Minimum variance unbiased estimator (MVUE)
nonlinear system identification
Best linear unbiased estimator (BLUE)
Unbiased estimators — see estimator bias.
Particle filter
Markov chain Monte Carlo (MCMC)
Kalman filter, and its various derivatives
Wiener filter
Examples[edit]
Unknown constant in additive white Gaussian noise[edit]
Consider a received discrete signal,
, of
independent samples that
consists of an unknown constant
with additive white Gaussian
noise (AWGN)
with knownvariance
(i.e.,
variance is known then the only unknown parameter is
). Since the
.
The model for the signal is then
Two possible (of many) estimators for the parameter
are:


which is the sample mean
Both of these estimators have a mean of , which can be shown
through taking the expected value of each estimator
and
At this point, these two estimators would appear to perform the
same. However, the difference between them becomes
apparent when comparing the variances.
and
It would seem that the sample mean is a better
estimator since its variance is lower for every N > 1.
Maximum likelihood[edit]
Main article: Maximum likelihood
Continuing the example using the maximum
likelihood estimator, the probability density
function (pdf) of the noise for one sample
and the probability of
thought of a
becomes (
is
can be
)
By independence, the probability
of becomes
Taking the natural logarithm of the pdf
and the maximum likelihood estimator
is
Taking the first derivative of the
log-likelihood function
and setting it to zero
This results in the
maximum likelihood
estimator
which is simply the
sample mean. From
this example, it was
found that the sample
mean is the maximum
likelihood estimator
for
samples of a
fixed, unknown
parameter corrupted
by AWGN.
Cramér–Rao lower
bound[edit]
For more details on
this topic,
see Cramér–Rao
bound.
To find the Cramér–
Rao lower
bound (CRLB) of the
sample mean
estimator, it is first
necessary to find
the Fisher
information number
and copying from
above
Taking the
second
derivative
and
finding
the
negative
expected
value is
trivial
since it is
now a
determini
stic
constant
Finally,
putting
the Fisher
informatio
n into
result
s in
C
o
m
p
ar
in
g
th
is
to
th
e
v
ar
ia
n
c
e
of
th
e
s
a
m
pl
e
m
e
a
n
(d
et
er
m
in
e
d
pr
e
vi
o
u
sl
y)
s
h
o
w
s
th
at
th
e
s
a
m
pl
e
m
e
a
n
is
e
q
u
al
to
t
h
e
C
ra
m
ér
–
R
a
o
lo
w
er
b
o
u
n
d
fo
r
al
l
v
al
u
e
s
of
a
n
d
.
In
ot
h
er
w
or
d
s,
th
e
s
a
m
pl
e
m
e
a
n
is
th
e
(n
e
c
e
s
s
ar
il
y
u
ni
q
u
e)
e
ffi
ci
e
nt
e
st
i
m
at
or
,
a
n
d
th
u
s
al
s
o
th
e
m
in
i
m
u
m
v
ar
ia
n
c
e
u
n
bi
a
s
e
d
e
st
i
m
at
or
(
M
V
U
E
),
in
a
d
di
ti
o
n
to
b
ei
n
g
th
e
m
a
xi
m
u
m
li
k
el
ih
o
o
d
e
st
i
m
at
or
.
M
a
x
i
m
u
m
o
f
a
u
n
if
o
r
m
d
i
s
tr
i
b
u
ti
o
n
[
e
di
t]
M
ai
n
ar
ti
cl
e:
G
er
m
a
n
ta
n
k
pr
o
bl
e
m
O
n
e
of
th
e
si
m
pl
e
st
n
o
ntri
vi
al
e
x
a
m
pl
e
s
of
e
st
i
m
at
io
n
is
th
e
e
st
i
m
at
io
n
of
th
e
m
a
xi
m
u
m
of
a
u
ni
fo
r
m
di
st
ri
b
ut
io
n.
It
is
u
s
e
d
a
s
a
h
a
n
d
so
n
cl
a
s
sr
o
o
m
e
x
er
ci
s
e
a
n
d
to
ill
u
st
ra
te
b
a
si
c
pr
in
ci
pl
e
s
of
e
st
i
m
at
io
n
th
e
or
y.
F
ur
th
er
,
in
th
e
c
a
s
e
of
e
st
i
m
at
io
n
b
a
s
e
d
o
n
a
si
n
gl
e
s
a
m
pl
e,
it
d
e
m
o
n
st
ra
te
s
p
hi
lo
s
o
p
hi
c
al
is
s
u
e
s
a
n
d
p
o
s
si
bl
e
m
is
u
n
d
er
st
a
n
di
n
g
s
in
th
e
u
s
e
of
m
a
xi
m
u
m
li
k
el
ih
o
o
d
e
st
i
m
at
or
s
a
n
d
li
k
el
ih
o
o
d
fu
n
ct
io
n
s.
G
iv
e
n
a
di
s
cr
et
e
u
ni
fo
r
m
di
st
ri
b
ut
io
n
w
it
h
u
n
k
n
o
w
n
m
a
xi
m
u
m
,
th
e
U
M
V
U
e
st
i
m
at
or
fo
r
th
e
m
a
xi
m
u
m
is
gi
v
e
n
b
y
w
h
e
r
e
m
i
s
t
h
e
s
a
m
p
l
e
m
a
x
i
m
u
m
a
n
d
k
i
s
t
h
e
s
a
m
p
l
e
s
i
z
e
,
s
a
m
p
l
i
n
g
w
i
t
h
o
u
t
r
e
p
l
a
c
e
m
e
n
t
.
[
2
]
[
3
]
T
h
i
s
p
r
o
b
l
e
m
i
s
c
o
m
m
o
n
l
y
k
n
o
w
n
a
s
t
h
e
G
e
r
m
a
n
t
a
n
k
p
r
o
b
l
e
m
,
d
u
e
t
o
a
p
p
l
i
c
a
t
i
o
n
o
f
m
a
x
i
m
u
m
e
s
t
i
m
a
t
i
o
n
t
o
e
s
t
i
m
a
t
e
s
o
f
G
e
r
m
a
n
t
a
n
k
p
r
o
d
u
c
t
i
o
n
d
u
r
i
n
g
W
o
r
l
d
W
a
r
I
I
.
T
h
e
f
o
r
m
u
l
a
m
a
y
b
e
u
n
d
e
r
s
t
o
o
d
i
n
t
u
i
t
i
v
e
l
y
a
s
:
"The sample maximum plus the average gap between observations in the sample",
t
h
e
g
a
p
b
e
i
n
g
a
d
d
e
d
t
o
c
o
m
p
e
n
s
a
t
e
f
o
r
t
h
e
n
e
g
a
t
i
v
e
b
i
a
s
o
f
t
h
e
s
a
m
p
l
e
m
a
x
i
m
u
m
a
s
a
n
e
s
t
i
m
a
t
o
r
f
o
r
t
h
e
p
o
p
u
l
a
t
i
o
n
m
a
x
i
m
u
m
.
[
n
o
t
e
1
]
T
h
i
s
h
a
s
a
v
a
r
i
a
n
c
e
o
f
[
2
]
s
o
a
s
t
a
n
d
a
r
d
d
e
v
i
a
t
i
o
n
o
f
a
p
p
r
o
x
i
m
a
t
e
l
y
,
t
h
e
(
p
o
p
u
l
a
t
i
o
n
)
a
v
e
r
a
g
e
s
i
z
e
o
f
a
g
a
p
b
e
t
w
e
e
n
s
a
m
p
l
e
s
;
c
o
m
p
a
r
e
a
b
o
v
e
.
T
h
i
s
c
a
n
b
e
s
e
e
n
a
s
a
v
e
r
y
s
i
m
p
l
e
c
a
s
e
o
f
m
a
x
i
m
u
m
s
p
a
c
i
n
g
e
s
t
i
m
a
t
i
o
n
.
T
h
e
s
a
m
p
l
e
m
a
x
i
m
u
m
i
s
t
h
e
m
a
x
i
m
u
m
l
i
k
e
l
i
h
o
o
d
e
s
t
i
m
a
t
o
r
f
o
r
t
h
e
p
o
p
u
l
a
t
i
o
n
m
a
x
i
m
u
m
,
b
u
t
,
a
s
d
i
s
c
u
s
s
e
d
a
b
o
v
e
,
i
t
i
s
b
i
a
s
e
d
.
A
p
p
l
i
c
a
t
i
o
n
s
[
e
d
i
t
]
N
u
m
e
r
o
u
s
f
i
e
l
d
s
r
e
q
u
i
r
e
t
h
e
u
s
e
o
f
e
s
t
i
m
a
t
i
o
n
t
h
e
o
r
y
.
S
o
m
e
o
f
t
h
e
s
e
f
i
e
l
d
s
i
n
c
l
u
d
e
(
b
u
t
a
r
e
b
y
n
o
m
e
a
n
s
l
i
m
i
t
e
d
t
o
)
:

I
n
t
e
r
p
r
e
t
a
t
i
o
n
o
f
s
c
i
e
n
t
i
f
i
c
e
x
p
e
r
i
m
e
n
t
s

S
i
g
n
a
l
p
r
o
c
e
s
s
i
n
g

C
l
i
n
i
c
a
l
t
r
i
a
l
s

O
p
i
n
i
o
n
p
o
l
l
s

Q
u
a
l
i
t
y
c
o
n
t
r
o
l

T
e
l
e
c
o
m
m
u
n
i
c
a
t
i
o
n
s

P
r
o
j
e
c
t
m
a
n
a
g
e
m
e
n
t

S
o
f
t
w
a
r
e
e
n
g
i
n
e
e
r
i
n
g

C
o
n
t
r
o
l
t
h
e
o
r
y
(
i
n
p
a
r
t
i
c
u
l
a
r
A
d
a
p
t
i
v
e

c
o
n
t
r
o
l
)
N
e
t
w
o
r
k
i
n
t
r
u
s
i
o
n
d
e
t
e
c
t
i
o
n
s
y
s
t
e
m

O
r
b
i
t
d
e
t
e
r
m
i
n
a
t
i
o
n
M
e
a
s
u
r
e
d
d
a
t
a
a
r
e
l
i
k
e
l
y
t
o
b
e
s
u
b
j
e
c
t
t
o
n
o
i
s
e
o
r
u
n
c
e
r
t
a
i
n
t
y
a
n
d
i
t
i
s
t
h
r
o
u
g
h
s
t
a
t
i
s
t
i
c
a
l
p
r
o
b
a
b
i
l
i
t
y
t
h
a
t
o
p
t
i
m
a
l
s
o
l
u
t
i
o
n
s
a
r
e
s
o
u
g
h
t
t
o
e
x
t
r
a
c
t
a
s
m
u
c
h
i
n
f
o
r
m
a
t
i
o
n
f
r
o
m
t
h
e
d
a
t
a
a
s
p
o
s
s
i
b
l
e
.
Estimation in Statistics
In statistics, estimation refers to the process by which one makes inferences about a population,
based on information obtained from a sample.
Point Estimate vs. Interval Estimate
Statisticians use sample statistics to estimate population parameters. For example, sample means are
used to estimate population means; sample proportions, to estimate population proportions.
An estimate of a population parameter may be expressed in two ways:

Point estimate. A point estimate of a population parameter is a single value of a statistic. For
example, the sample mean x is a point estimate of the population mean μ. Similarly, the
sample proportion p is a point estimate of the population proportion P.

Interval estimate. An interval estimate is defined by two numbers, between which a
population parameter is said to lie. For example, a < x < b is an interval estimate of the
population mean μ. It indicates that the population mean is greater than a but less than b.
Confidence Intervals
Statisticians use a confidence interval to express the precision and uncertainty associated with a
particular sampling method. A confidence interval consists of three parts.

A confidence level.

A statistic.

A margin of error.
The confidence level describes the uncertainty of a sampling method. The statistic and the margin of
error define an interval estimate that describes the precision of the method. The interval estimate of a
confidence interval is defined by the sample statistic + margin of error.
For example, suppose we compute an interval estimate of a population parameter. We might describe
this interval estimate as a 95% confidence interval. This means that if we used the same sampling
method to select different samples and compute different interval estimates, the true population
parameter would fall within a range defined by the sample statistic + margin of error 95% of the time.
Confidence intervals are preferred to point estimates, because confidence intervals indicate (a) the
precision of the estimate and (b) the uncertainty of the estimate.
Confidence Level
The probability part of a confidence interval is called a confidence level. The confidence level
describes the likelihood that a particular sampling method will produce a confidence interval that
includes the true population parameter.
Here is how to interpret a confidence level. Suppose we collected all possible samples from a given
population, and computed confidence intervals for each sample. Some confidence intervals would
include the true population parameter; others would not. A 95% confidence level means that 95% of
the intervals contain the true population parameter; a 90% confidence level means that 90% of the
intervals contain the population parameter; and so on.
Margin of Error
In a confidence interval, the range of values above and below the sample statistic is called the margin
of error.
For example, suppose the local newspaper conducts an election survey and reports that the
independent candidate will receive 30% of the vote. The newspaper states that the survey had a 5%
margin of error and a confidence level of 95%. These findings result in the following confidence
interval: We are 95% confident that the independent candidate will receive between 25% and 35% of
the vote.
Note: Many public opinion surveys report interval estimates, but not confidence intervals. They
provide the margin of error, but not the confidence level. To clearly interpret survey results you need
to know both! We are much more likely to accept survey findings if the confidence level is high (say,
95%) than if it is low (say, 50%).
Test Your Understanding
Problem 1
Which of the following statements is true.
I. When the margin of error is small, the confidence level is high.
II. When the margin of error is small, the confidence level is low.
III. A confidence interval is a type of point estimate.
IV. A population mean is an example of a point estimate.
(A) I only
(B) II only
(C) III only
(D) IV only.
(E) None of the above.
Solution
The correct answer is (E). The confidence level is not affected by the margin of error. When the margin
of error is small, the confidence level can low or high or anything in between. A confidence interval is
a type of interval estimate, not a type of point estimate. A population mean is not an example of a
point estimate; a sample mean is an example of a point estimate.
Software Estimation Techniques - Common
Test Estimation Techniques used in SDLC
In order to successful software project & proper execution of task, the Estimation
Techniques plays vital role in software development life cycle. The technique which is
used to calculate the time required to accomplish a particular task is called Estimation
Techniques. To estimate a task different effective Software Estimation
Techniques can be used to get the better estimation.
Before moving forward let’s ask some basic questions like What is use of this? or Why
this is needed? or Who will do this? So in this article I am discussing all your queries
regarding ESTIMATION.
What is Estimation?
“Estimation is the process of finding an estimate, or approximation, which is a value
that is usable for some purpose even if input data may be incomplete, uncertain,
or unstable.” [Wiki Definition]
Software Estimation Techniques
The Estimate is prediction or a rough idea to determine how much effort would take to
complete a defined task. Here the effort could be time or cost. An estimate is a forecast
or prediction and approximate of what it would Cost. A rough idea how long a task would
take to complete. An estimate is especially an approximate computation of the probable
cost of a piece of work.
The calculation of test estimation techniques is based on:

 Past Data/Past experience
Available documents/Knowledge
 Assumptions
 Calculated risks
Before starting one common question arises in the testers mind is that “Why do we
estimate?” The answer to this question is pretty simple, it is to avoid the exceeding
timescales and overshooting budgets for testing activities we estimate the task.
Few points need to be considered before estimating testing activities:
 Check if all requirements are finalize or not.
If it not then how frequently they are going to be changed.
 All responsibilities and dependencies are clear.
 Check if required infrastructure is ready for testing or not.
Check if before estimating task is all assumptions and risks are documented.


Software Estimation Techniques
There are different Software Testing Estimation Techniques which can be used for
estimating a task.
1) Delphi Technique
2) Work Breakdown Structure (WBS)
3) Three Point Estimation
4) Functional Point Method
1) Delphi Technique:
Delphi technique – This is one of the widely used software testing estimation technique.
In the Delphi Method is based on surveys and basically collects the information from
participants who are experts. In this estimation technique each task is assigned to each
team member & over multiple rounds surveys are conduct unless & until a final
estimation of task is not finalized. In each round the thought about task are gathered &
feedback is provided. By using this method, you can get quantitative and qualitative
results.
In overall techniques this technique gives good confidence in the estimation. This
technique can be used with the combination of the other techniques.
2) Work Breakdown Structure (WBS):
A big project is made manageable by first breaking it down into individual components in
a hierarchical structure, known as the Work breakdown structure, or the WBS.
The WBS helps to project manager and the team to create the task scheduling, detailed
cost estimation of the project. By using the WBS motions, the project manager and team
will have a pretty good idea whether or not they’ve captured all the necessary tasks,
based on the project requirements, which are going to need to happen to get the job
done.
In this technique the complex project is divided into smaller pieces. The modules are
divided into smaller sub-modules. Each sub-modules are further divided into
functionality. And each functionality can be divided into sub-functionalities. After
breakdown the work all functionality should review to check whether each & every
functionality is covered in the WBS.
Using this you can easily figure out the what all task needs to completed & they are
breakdown into details task so estimation to details task would be more easier than
estimating overall Complex project at one shot.
Work Breakdown Structure has four key benefits:

Work Breakdown Structure forces the team to create detailed steps:
In The WBS all steps required to build or deliver the service are divided into
detailed task by Project manager, Team and customer. It helps to raise the
critical issues early on, narrow down the scope of the project and create a
dialogue which will help make clear bring out assumptions, ambiguities, narrow
the scope of the project, and raise critical issues early on.

Work Breakdown Structure help to improve the schedule and budget.
WBS enables you to make an effective schedule and good budget plans. As all
tasks are already available so it helps in generating a meaningful schedule and
makes scheming a reliable budget easier.

Work Breakdown Structure creates accountability
The level of details task breakdown helps to assign particular module task to
individual, which makes easier to hold person accountable to complete the task.
Also the detailed task in WBS, people cannot allow hiding under the “cover of
broadness.”
 Work Breakdown Structure creation breeds commitment
The process of developing and completing a WBS breed excitement and
commitment. Although the project manager will often develop the high-level
WBS, he will seek the participation of his core team to flesh out the extreme
detail of the WBS. This participation will spark involvement in the project.
3) Three Point Estimation:
Three point estimation is the estimation method is based on statistical data. It is very
much similar to WBS technique, task are broken down into subtasks & three types of
estimation are done on this sub pieces.
Optimistic Estimate (Best case scenario in which nothing goes wrong and all conditions
are optimal.) = A
Most Likely Estimate (most likely duration and there may be some problem but most of
the things will go right.) = M
Pessimistic Estimate (worst case scenario which everything goes wrong.) = B
Formula to find Value for Estimate (E) = A + (4*M) + B / 6
Standard Deviation (SD) = = (B – A)/6
Now a days, planning poker and Delphi estimates are most popular
testing test estimation techniques.
4) Functional Point Method:
Functional Point is measured from a functional, or user, point of view.
It is independent of computer language, capability, technology or development
methodology of the team. It is based on available documents like SRS, Design etc.
In this FP technique we have to give weightage to each functional point. Prior to start
actual estimating tasks functional points are divided into three groups like Complex,
Medium & Simple. Based on similar projects & Organization standards we have to define
estimate per function points.
Total Effort Estimate = Total Function Points * Estimate defined per Functional Point
Let’s take a simple example to get clearer:
Weightage
Function Points
Total
5
5
25
Complex
Medium
3
20
60
1
35
35
Simple
Function Total Points
120
Estimate defined per point
4.15
Total Estimated Effort (Person Hours):
498
Advantages of the Functional Point Method:

 In pre-project stage the estimates can be prepared.
Based on requirement specification documents the method’s reliability is
relatively high.
Disadvantages of Software Estimation Techniques:

Due to hidden factors can be over or under estimated
 Not really accurate
 It is basd on thinking
 Involved Risk
 May give false result
 Bare to losing
 Sometimes cannot trust in estimate
Software Estimation Techniques Conclusion:
There may be different other methods also which can be effectively used for the
project test estimation techniques, in this article we have seen most popular Software
Estimation Techniques used in project estimation. There can’t be a sole hard and fast
rule for estimating the testing effort for a project. It is recommended to add on to the
possible knowledge base of test estimation methods and estimation templates constantly
revised based upon new findings.
- See more at: http://www.softwaretestingclass.com/software-estimationtechniques/#sthash.sIN4Wn9Q.dpuf
16.1. What is the difference between a statistic and a parameter?
A statistic is a numerical characteristic of a sample, and a parameter is a numerical
characteristic of a population.
16.2. What is the symbol for the population mean?
The symbol is the Greek letter mu (i.e., µ).
16.3. What is the symbol for the population correlation coefficient?
The symbol is the Greek letter rho (i.e., ρ ).
16.4. What is the definition of a sampling distribution?
The sampling distribution is the theoretical probability distribution of the values of a statistic
that results when all possible random samples of a particular size are drawn from a
population.
16.5. How does the idea of repeated sampling relate to the concept of a sampling
distribution?
Repeated sampling involves drawing many or all possible samples from a population.
16.6. Which of the two types of estimation do you like the most, and why?
This is an opinion question.
 Point estimation is nice because it provides an exact point estimate of the population
value. It provides you with the single best guess of the value of the population
parameter.
 Interval estimation is nice because it allows you to make statements of confidence
that an interval will include the true population value.
16.7. What are the advantages of using interval estimation rather than point
estimation?
The problem with using a point estimate is that although it is the single best guess you can
make about the value of a population parameter, it is also usually wrong.
 Take a look at the sampling distribution of the mean on page 468 and note that in that
case if you would have guessed $50,000 as the correct value (and this WAS the
correct value in this case) you would be wrong most of the time.
 A major advantage of using interval estimation is that you provide a range of values
with a known probability of capturing the population parameter (e.g., if you obtain
from SPSS a 95% confidence interval you can claim to have 95% confidence that it
will include the true population parameter.
 An interval estimate (i.e., confidence intervals) also help one to not be so confident
that the population value is exactly equal to the point estimate. That is, it makes us
more careful in how we interpret our data and helps keep us in proper perspective.
 Actually, perhaps the best thing of all to do is to provide both the point estimate and
the interval estimate. For example, our best estimate of the population mean is the
value $32,640 (the point estimate) and our 95% confidence interval is $30,913.71 to
$34,366.29.
 By the way, note that the bigger your sample size, the more narrow the confidence
interval will be.
 If you want narrow (i.e., very precise) confidence intervals, then remember to include
a lot of participants in your research study.
16.8 What is a null hypothesis?
A null hypothesis is a statement about a population parameter. It usually predicts no
difference or no relationship in the population. The null hypothesis is the “status quo,” the
“nothing new,” or the “business as usual” hypothesis. It is the hypothesis that is directly
tested in hypothesis testing.
16.9. To whom is the researcher similar to in hypothesis testing: the defense attorney or
the prosecuting attorney? Why?
The researcher is similar to the prosecuting attorney is the sense that the researcher brings the
null hypothesis “to trial” when she believes there is probability strong evidence against the
null.
 Just as the prosecutor usually believes that the person on trial is not innocent, the
researcher usually believes that the null hypothesis is not true.
 In the court system the jury must assume (by law) that the person is innocent until the
evidence clearly calls this assumption into question; analogously, in hypothesis
testing the researcher must assume (in order to use hypothesis testing) that the null
hypothesis is true until the evidence calls this assumption into question.
16.10. What is the difference between a probability value and the significance level?
Basically in hypothesis testing the goal is to see if the probability value is less than or equal
to the significance level (i.e., is p ≤ alpha).
 The probability value (also called the p-value) is the probability of the result found in
your research study of occurring (or an even more extreme result occurring), under
the assumption that the null hypothesis is true.
 That is, you assume that the null hypothesis is true and then see how often your
finding would occur if this assumption were true.
 The significance level (also called the alpha level) is the cutoff value the researcher
selects and then uses to decide when to reject the null hypothesis.
 Most researchers select the significance or alpha level of .05 to use in their research;
hence, they reject the null hypothesis when the p-value (which is obtained from the
computer printout) is less than or equal to .05.
16.11. Why do educational researchers usually use .05 as their significance level?
It has become part of the statistical hypothesis testing culture.
 It is a convention.
 It reflects a concern over making type I errors (i.e., wanting to avoid the situation
where you reject the null when it is true, that is, wanting to avoid “false positive”
errors).
 If you set the significance level at .05, then you will only reject a true null hypothesis
5% or the time (i.e., you will only make a type I error 5% of the time) in the long run.
16.12. State the two decision making rules of hypothesis testing.
 Rule one: If the p-value is less than or equal to the significance level then reject the
null hypothesis and conclude that the research finding is statistically significant.
 Rule two: If the p-value is greater than the significance level then you “fail to reject”
the null hypothesis and conclude that the finding is not statistically significant.
16.13. Do the following statements sound like typical null or alternative hypotheses? (A)
The coin is fair. (B) There is no difference between male and female incomes in the
population. (C) There is no correlation in the population. (D) The patient is not sick
(i.e., is well). (E) The defendant is innocent.
All of these sound like null alternative hypotheses (i.e., the “nothing new” or “status quo”
hypothesis). We usually assume that a coin is fair in games of chance; when testing the
difference between male and female incomes in hypothesis testing we assume the null of no
difference; when testing the statistical significance of a correlation coefficient using
hypothesis testing, we assume that the correlation in the population is zero; in medical testing
we assume the person does not have the illness until the medical tests suggest otherwise; and
in our system of jurisprudence we assume that a defendant is innocent until the evidence
strongly suggests otherwise.
16.14. What is a Type I error? What is a Type II error? How can you minimize the risk
of both of these types of errors?
In hypothesis testing there are two possible errors we can make: Type I and Type II errors.
 A Type I error occurs when your reject a true null hypothesis (remember that when
the null hypothesis is true you hope to retain it).
 A Type II error occurs when you fail to reject a false null hypothesis (remember that
when the null hypothesis is false you hope to reject it).
 The best way to allow yourself to set a low alpha level (i.e., to have a small chance of
making a Type I error) and to have a good chance of rejecting the null when it is false
(i.e., to have a small chance of making a Type II error) is to increase the sample size.
 The key in hypothesis testing is to use a large sample in your research study rather
than a small sample!
 If you do reject your null hypothesis, then it is also essential that you determine
whether the size of the relationship is practically significant (see the next question).
16.15. If a finding is statistically significant, why is it also important to consider
practical significance?
When your finding is statistically significant all you know is that your result would be
unlikely if the null hypothesis were true and that you therefore have decided to reject your
null hypothesis and to go with your alternative hypothesis. Unfortunately, this does not tell
you anything about how big of an effect is present or how important the effect would be for
practical purposes. That’s why once you determine that a finding is statistically significant
you must next use one of the effect size indicators to tell you how strong the relationship.
Think about this effect size and the nature of your variables (e.g., is the IV easily manipulated
in the real world? Will the amount of change relative to the costs in bringing this about be
reasonable?).
 Once you consider these additional issues beyond statistical significance, you will be
ready to make a decision about the practical significance of your study results.
16.16. How do you write the null and alternative hypotheses for each of the following:
(A) The t-test for independent samples, (B) One-way analysis of variance, (C) The t-test
for correlation coefficients?, (D) The t-test for a regression coefficient.
In each of these, the null hypothesis says there is no relationship and the alternative
hypothesis says that there is a relationship.
(A)
In this case the null hypothesis says that the two population means (i.e., mu
one and mu two) are equal; the alternative hypothesis says that they are not
equal.
(B)
In this case the null hypothesis says that all of the population means are equal;
the alternative hypothesis says that at least two of the means are not equal.
(C)
In this case the null hypothesis says that the population correlation (i.e., rho)
is zero; the alternative hypothesis says that it is not equal to zero.
(D)
In this case the null hypothesis says that the population regression coefficient
(beta) is zero, and the alternative says that it is not equal to zero.
You can examples of these null and alternative hypotheses written out in symbolic form for
cases A, B, C, and D in the following Table.
Hypothesis Testing for Means & Proportions
Introduction
This is the first of three modules that
will addresses the second area of statistical inference, which is hypothesis testing, in which a
specific statement or hypothesis is generated about a population parameter, and sample
statistics are used to assess the likelihood that the hypothesis is true. The hypothesis is
based on available information and the investigator's belief about the population parameters.
The process of hypothesis testing involves setting up two competing hypotheses, the null
hypothesis and the alternate hypothesis. One selects a random sample (or multiple samples
when there are more comparison groups), computes summary statistics and then assesses
the likelihood that the sample data support the research or alternative hypothesis. Similar to
estimation, the process of hypothesis testing is based on probability theory and the Central
Limit Theorem.
This module will focus on hypothesis testing for means and proportions. The next two
modules in this series will address analysis of variance and chi-squared tests.
Learning Objectives
After completing this module, the student will be able to:
1. Define null and research hypothesis, test statistic, level of significance and decision
rule
2. Distinguish between Type I and Type II errors and discuss the implications of each
3. Explain the difference between one and two sided tests of hypothesis
4. Estimate and interpret p-values
5. Explain the relationship between confidence interval estimates and p-values in
drawing inferences
6. Differentiate hypothesis testing procedures based on type of outcome variable and
number of sample
Introduction to Hypothesis Testing
Techniques for Hypothesis Testing
The techniques for hypothesis testing depend on



the type of outcome variable being analyzed (continuous, dichotomous, discrete)
the number of comparison groups in the investigation
whether the comparison groups are independent (i.e., physically separate such as
men versus women) or dependent (i.e., matched or paired such as pre- and postassessments on the same participants).
In estimation we focused explicitly on techniques for one and two samples and discussed
estimation for a specific parameter (e.g., the mean or proportion of a population), for
differences (e.g., difference in means, the risk difference) and ratios (e.g., the relative risk
and odds ratio). Here we will focus on procedures for one and two samples when the
outcome is either continuous (and we focus on means) or dichotomous (and we focus on
proportions).
General Approach: A Simple Example
The Centers for Disease Control (CDC) reported on trends in weight, height and body mass
index from the 1960's through 2002.1 The general trend was that Americans were much
heavier and slightly taller in 2002 as compared to 1960; both men and women gained
approximately 24 pounds, on average, between 1960 and 2002. In 2002, the mean weight
for men was reported at 191 pounds. Suppose that an investigator hypothesizes that weights
are even higher in 2006 (i.e., that the trend continued over the subsequent 4 years).
The research hypothesis is that the mean weight in men in 2006 is more than 191 pounds.
The null hypothesis is that there is no change in weight, and therefore the mean weight is
still 191 pounds in 2006.
Null Hypothesis
H0: μ= 191
(no change)
Research Hypothesis
H1: μ> 191
(investigator's belief)
In order to test the hypotheses, we select a random sample of American males in 2006 and
measure their weights. Suppose we have resources available to recruit n=100 men into our
sample. We weigh each participant and compute summary statistics on the sample data.
Suppose in the sample we determine the following:
Do the sample data support the null or research hypothesis? The sample mean of 197.1 is
numerically higher than 191. However, is this difference more than would be expected by
chance? In hypothesis testing, we assume that the null hypothesis holds until proven
otherwise. We therefore need to determine the likelihood of observing a sample mean of
197.1 or higher when the true population mean is 191 (i.e., if the null hypothesis is true or
under the null hypothesis). We can compute this probability using the Central Limit Theorem.
Specifically,
(Notice that we use the sample standard deviation in computing the Z score. This is
generally an appropriate substitution as long as the sample size is large, n > 30. Thus, there
is less than a 1% probability of observing a sample mean as large as 197.1 when the true
population mean is 191. Do you think that the null hypothesis is likely true? Based on how
unlikely it is to observe a sample mean of 197.1 under the null hypothesis (i.e., <1%
probability), we might infer, from our data, that the null hypothesis is probably not true.
Suppose that the sample data had turned out differently. Suppose that we instead observed
the following in 2006:
How likely it is to observe a sample mean of 192.1 or higher when the true population mean
is 191 (i.e., if the null hypothesis is true)? We can again compute this probability using the
Central Limit Theorem. Specifically,
There is a 33.4% probability of observing a sample mean as large as 192.1 when the true
population mean is 191. Do you think that the null hypothesis is likely true?
Neither of the sample means that we obtained allows us to know with certainty whether the
null hypothesis is true or not. However, our computations suggest that, if the null hypothesis
were true, the probability of observing a sample mean >197.1 is less than 1%. In contrast, if
the null hypothesis were true, the probability of observing a sample mean >192.1 is about
33%. We can't know whether the null hypothesis is true, but the sample that provided a
mean value of 197.1 provides much stronger evidence in favor of rejecting the null
hypothesis, than the sample that provided a mean value of 192.1. Note that this does not
mean that a sample mean of 192.1 indicates that the null hypothesis is true; it just doesn't
provide compelling evidence to reject it.
In essence, hypothesis testing is a procedure to compute a probability that reflects the
strength of the evidence (based on a given sample) for rejecting the null hypothesis. In
hypothesis testing, we determine a threshold or cut-off point (called the critical value) to
decide when to believe the null hypothesis and when to believe the research hypothesis. It is
important to note that it is possible to observe any sample mean when the true population
mean is true (in this example equal to 191), but some sample means are very unlikely.
Based on the two samples above it would seem reasonable to believe the research
hypothesis when
= 197.1, but to believe the null hypothesis when
=192.1. What we
need is a threshold value such that if
is above that threshold then we believe that H1 is
true and if
is below that threshold then we believe that H0 is true. The difficulty in
determining a threshold for
is that it depends on the scale of measurement. In this
example, the threshold, sometimes called the critical value, might be 195 (i.e., if the sample
mean is 195 or more then we believe that H1 is true and if the sample mean is less than 195
then we believe that H0 is true). Suppose we are interested in assessing an increase in
blood pressure over time, the critical value will be different because blood pressures are
measured in millimeters of mercury (mmHg) as opposed to in pounds. In the following we
will explain how the critical value is determined and how we handle the issue of scale.
First, to address the issue of scale in determining the critical value, we convert our sample
data (in particular the sample mean) into a Z score. We know from the module on probability
that the center of the Z distribution is zero and extreme values are those that exceed 2 or fall
below -2. Z scores above 2 and below -2 represent approximately 5% of all Z values. If the
observed sample mean is close to the mean specified in H0 (here m =191), then Z will be
close to zero. If the observed sample mean is much larger than the mean specified in H0,
then Z will be large.
In hypothesis testing, we select a critical value from the Z distribution. This is done by first
determining what is called the level of significance, denoted α ("alpha"). What we are doing
here is drawing a line at extreme values. The level of significance is the probability that we
reject the null hypothesis (in favor of the alternative) when it is actually true and is also called
the Type I error rate.
α = Level of significance = P(Type I error) = P(Reject H0 | H0 is true).
Because α is a probability, it ranges between 0 and 1. The most commonly used value in the
medical literature for α is 0.05, or 5%. Thus, if an investigator selects α=0.05, then they are
allowing a 5% probability of incorrectly rejecting the null hypothesis in favor of the alternative
when the null is in fact true. Depending on the circumstances, one might choose to use a
level of significance of 1% or 10%. For example, if an investigator wanted to reject the null
only if there were even stronger evidence than that ensured with α=0.05, they could choose
a =0.01as their level of significance. The typical values for α are 0.01, 0.05 and 0.10, with
α=0.05 the most commonly used value.
Suppose in our weight study we select α=0.05. We need to determine the value of Z that
holds 5% of the values above it (see below).
The critical value of Z for α =0.05 is Z = 1.645 (i.e., 5% of the distribution is above Z=1.645).
With this value we can set up what is called our decision rule for the test. The rule is to reject
H0 if the Z score is 1.645 or more.
With the first sample we have
Because 2.38 > 1.645, we reject the null hypothesis. (The same conclusion can be drawn by
comparing the 0.0087 probability of observing a sample mean as extreme as 197.1 to the
level of significance of 0.05. If the observed probability is smaller than the level of
significance we reject H0). Because the Z score exceeds the critical value, we conclude that
the mean weight for men in 2006 is more than 191 pounds, the value reported in 2002. If we
observed the second sample (i.e., sample mean =192.1), we would not be able to reject the
null hypothesis because the Z score is 0.43 which is not in the rejection region (i.e., the
region in the tail end of the curve above 1.645). With the second sample we do not have
sufficient evidence (because we set our level of significance at 5%) to conclude that weights
have increased. Again, the same conclusion can be reached by comparing probabilities. The
probability of observing a sample mean as extreme as 192.1 is 33.4% which is not below our
5% level of significance.
Hypothesis Testing: Upper-, Lower, and Two Tailed
Tests
The procedure for hypothesis testing is based on the ideas described above. Specifically, we
set up competing hypotheses, select a random sample from the population of interest and
compute summary statistics. We then determine whether the sample data supports the null
or alternative hypotheses. The procedure can be broken down into the following five steps.

Step 1. Set up hypotheses and select the level of significance α.
H0: Null hypothesis (no change, no difference); H1: Research hypothesis
(investigator's belief); α =0.05
Upper-tailed, Lower-tailed, Two-tailed Tests
The research or alternative hypothesis can take one of three forms. An investigator might believe that the pa
decreased or changed. For example, an investigator might hypothesize:
1. H1: μ > μ 0 , where μ0 is the comparator or null value (e.g., μ0 =191 in our example about weight in me
increase is hypothesized - this type of test is called an upper-tailed test;
2. H1: μ < μ0 , where a decrease is hypothesized and this is called a lower-tailed test; or
3. H1: μ ≠ μ 0, where a difference is hypothesized and this is called a two-tailed test.
The exact form of the research hypothesis depends on the investigator's belief about the parameter of intere
possibly increased, decreased or is different from the null value. The research hypothesis is set up by the inv
data are collected.

Step 2. Select the appropriate test statistic.
The test statistic is a single number that summarizes the sample information. An example
of a test statistic is the Z statistic computed as follows:
When the sample size is small, we will use t statistics (just as we did when constructing
confidence intervals for small samples). As we present each scenario, alternative test
statistics are provided along with conditions for their appropriate use.

Step 3. Set up decision rule.
The decision rule is a statement that tells under what circumstances to reject the null
hypothesis. The decision rule is based on specific values of the test statistic (e.g., reject H0 if
Z > 1.645). The decision rule for a specific test depends on 3 factors: the research or
alternative hypothesis, the test statistic and the level of significance. Each is discussed
below.
1. The decision rule depends on whether an upper-tailed, lower-tailed, or two-tailed test
is proposed. In an upper-tailed test the decision rule has investigators reject H0 if the
test statistic is larger than the critical value. In a lower-tailed test the decision rule has
investigators reject H0 if the test statistic is smaller than the critical value. In a twotailed test the decision rule has investigators reject H0 if the test statistic is extreme,
either larger than an upper critical value or smaller than a lower critical value.
2. The exact form of the test statistic is also important in determining the decision rule. If
the test statistic follows the standard normal distribution (Z), then the decision rule
will be based on the standard normal distribution. If the test statistic follows the t
distribution, then the decision rule will be based on the t distribution. The appropriate
critical value will be selected from the t distribution again depending on the specific
alternative hypothesis and the level of significance.
3. The third factor is the level of significance. The level of significance which is selected
in Step 1 (e.g., α =0.05) dictates the critical value. For example, in an upper tailed Z
test, if α =0.05 then the critical value is Z=1.645.
The following figures illustrate the rejection regions defined by the decision rule for upper-,
lower- and two-tailed Z tests with α=0.05. Notice that the rejection regions are in the upper,
lower and both tails of the curves, respectively. The decision rules are written below each
figure.
Lower-Tailed Test
a
Z
0.10
-1.282
0.05
-1.645
0.025
-1.960
0.010
-2.326
0.005
-2.576
0.001
-3.090
0.0001
-3.719
Rejection Region for Upper-Tailed Z Test (H1: μ > μ0 ) with α=0.05
The decision rule is: Reject H0 if Z > 1.645.
Upper-Tailed Test
α
Z
0.10
1.282
0.05
1.645
0.025
1.960
0.010
2.326
0.005
2.576
0.001
3.090
0.0001
3.719
Rejection Region for Lower-Tailed Z Test (H1: μ < μ0 ) with α =0.05
The decision rule is: Reject H0 if Z < 1.645.
Two-Tailed Test
α
Z
0.20
1.282
0.10
1.645
0.05
1.960
0.010
2.576
0.001
3.291
0.0001
3.819
Rejection Region for Two-Tailed Z Test (H1: μ ≠ μ 0 ) with α =0.05
The decision rule is: Reject H0 if Z < -1.960 or if Z > 1.960.
The complete table of critical values of Z for upper, lower and two-tailed tests can be found
in the table of Z values to the right in "Other Resources."
Critical values of t for upper, lower and two-tailed tests can be found in the table of t values
in "Other Resources."

Step 4. Compute the test statistic.
Here we compute the test statistic by substituting the observed sample data into the test
statistic identified in Step 2.

Step 5. Conclusion.
The final conclusion is made by comparing the test statistic (which is a summary of the
information observed in the sample) to the decision rule. The final conclusion will be either to
reject the null hypothesis (because the sample data are very unlikely if the null hypothesis is
true) or not to reject the null hypothesis (because the sample data are not very unlikely).
If the null hypothesis is rejected, then an exact significance level is computed to describe the
likelihood of observing the sample data assuming that the null hypothesis is true. The exact
level of significance is called the p-value and it will be less than the chosen level of
significance if we reject H0.
Statistical computing packages provide exact p-values as part of their standard output for
hypothesis tests. In fact, when using a statistical computing package, the steps outlined
about can be abbreviated. The hypotheses (step 1) should always be set up in advance of
any analysis and the significance criterion should also be determined (e.g., α =0.05).
Statistical computing packages will produce the test statistic (usually reporting the test
statistic as t) and a p-value. The investigator can then determine statistical significance using
the following: If p < α then reject H0.
Things to Remember When Interpreting P V
1. P-values summarize statistical significance and do not address clinical significance. There are instan
and others where they are one or the other but not both. This is because P-values depend upon both
(the sample size). When the sample size is large, results can reach statistical significance (i.e., small
Conversely, with small sample sizes, results can fail to reach statistical significance yet the effect is la
assess both statistical and clinical significance of results.
2. Statistical tests allow us to draw conclusions of significance or not based on a comparison of the p-va
conclusion is based on the selected level of significance ( α ) and could change with a different level
be examined for clinical importance.
3. When conducting any statistical analysis, there is always a possibility of an incorrect conclusion. Wi
increased. Investigators should only conduct the statistical analyses (e.g., tests) of interest and not
4. Many investigators inappropriately believe that the p-value represents the probability that the null hyp
that the null hypothesis is true. The p-value is the probability that the data could deviate from the null
value measures the compatibility of the data with the null hypothesis, not the probability that the null h
5. Statistical significance does not take into account the possibility of bias or confounding - these issues
6. Evidence-based decision making is important in public health and in medicine, but decisions are rare
always important to build a body of evidence to support findings.
We now use the five-step procedure to test the research hypothesis that the mean weight in
men in 2006 is more than 191 pounds. We will assume the sample data are as follows:
n=100,
=197.1 and s=25.6.

Step 1. Set up hypotheses and determine level of significance
H0: μ = 191 H1: μ > 191
α =0.05
The research hypothesis is that weights have increased, and therefore an upper tailed test is
used.

Step 2. Select the appropriate test statistic.
Because the sample size is large (n>30) the appropriate test statistic is
.

Step 3. Set up decision rule.
In this example, we are performing an upper tailed test (H1: μ> 191), with a Z test statistic
and selected α =0.05. Reject H0 if Z > 1.645.

Step 4. Compute the test statistic.
We now substitute the sample data into the formula for the test statistic identified in Step 2.
.
Step 5. Conclusion.
We reject H0 because 2.38 > 1.645. We have statistically significant evidence at a =0.05, to
show that the mean weight in men in 2006 is more than 191 pounds. Because we rejected
the null hypothesis, we now approximate the p-value which is the likelihood of observing the
sample data if the null hypothesis is true. An alternative definition of the p-value is the
smallest level of significance where we can still reject H0. In this example, we observed
Z=2.38 and for α=0.05, the critical value was 1.645. Because 2.38 exceeded 1.645 we
rejected H0. In our conclusion we reported a statistically significant increase in mean weight
at a 5% level of significance. Using the table of critical values for upper tailed tests, we can
approximate the p-value. If we select α=0.025, the critical value is 1.96, and we still reject
H0 because 2.38 > 1.960. If we select α=0.010 the critical value is 2.326, and we still reject
H0 because 2.38 > 2.326. However, if we select α=0.005, the critical value is 2.576, and we
cannot reject H0 because 2.38 < 2.576. Therefore, the smallest α where we still reject H0 is
0.010. This is the p-value. A statistical computing package would produce a more precise p-
value which would be in between 0.005 and 0.010. Here we are approximating the p-value
and would report p < 0.010.
Type I and Type II Errors
In all tests of hypothesis, there are two types of errors that can be committed. The first is
called a Type I error and refers to the situation where we incorrectly reject H0 when in fact it
is true. This is also called a false positive result (as we incorrectly conclude that the research
hypothesis is true when in fact it is not). When we run a test of hypothesis and decide to
reject H0 (e.g., because the test statistic exceeds the critical value in an upper tailed test)
then either we make a correct decision because the research hypothesis is true or we
commit a Type I error. The different conclusions are summarized in the table below. Note
that we will never know whether the null hypothesis is really true or false (i.e., we will never
know which row of the following table reflects reality).
Conclusion in Test of Hypothes
Do Not Reject H0
R
H0 is True
Correct Decision
Ty
H0 is False
Type II Error
In the first step of the hypothesis test, we select a level of significance, α, and α= P(Type I
error). Because we purposely select a small value for α, we control the probability of
committing a Type I error. For example, if we select α=0.05, and our test tells us to reject H0,
then there is a 5% probability that we commit a Type I error. Most investigators are very
comfortable with this and are confident when rejecting H0 that the research hypothesis is
true (as it is the more likely scenario when we reject H0).
When we run a test of hypothesis and decide not to reject H0 (e.g., because the test statistic
is below the critical value in an upper tailed test) then either we make a correct decision
because the null hypothesis is true or we commit a Type II error. Beta (β) represents the
probability of a Type II error and is defined as follows: β=P(Type II error) = P(Do not Reject
H0 | H0 is false). Unfortunately, we cannot choose β to be small (e.g., 0.05) to control the
probability of committing a Type II error because β depends on several factors including the
sample size, α, and the research hypothesis. When we do not reject H0, it may be very likely
that we are committing a Type II error (i.e., failing to reject H0 when in fact it is false).
Therefore, when tests are run and the null hypothesis is not rejected we often make a weak
concluding statement allowing for the possibility that we might be committing a Type II error.
If we do not reject H0, we conclude that we do not have significant evidence to show that
H1 is true. We do not conclude that H0 is true.
The most common reason for
a Type II error is a small
sample size.
Tests with One Sample, Continuous Outcome
Corre
Hypothesis testing applications with a continuous outcome variable in a single population are
performed according to the five-step procedure outlined above. A key component is setting
up the null and research hypotheses. The objective is to compare the mean in a single
population to known mean (μ0). The known value is generally derived from another study or
report, for example a study in a similar, but not identical, population or a study performed
some years ago. The latter is called a historical control. It is important in setting up the
hypotheses in a one sample test that the mean specified in the null hypothesis is a fair and
reasonable comparator. This will be discussed in the examples that follow.
In one sample tests for a continuous outcome, we set up our hypotheses against an
appropriate comparator. We select a sample and compute descriptive statistics on the
sample data - including the sample size (n), the sample mean (
) and the sample
standard deviation (s). We then determine the appropriate test statistic (Step 2) for the
hypothesis test. The formulas for test statistics depend on the sample size and are given
below.
Test Statistics for Testing H0: μ= μ0
if n > 30
if n < 30
where df=n-1
Note that statistical computing packages will use the t statistic exclusively and make the
necessary adjustments for comparing the test statistic to appropriate values from probability
tables to produce a p-value.
Example:
The National Center for Health Statistics (NCHS) published a report in 2005 entitled Health,
United States, containing extensive information on major trends in the health of Americans.
Data are provided for the US population as a whole and for specific ages, sexes and
races. The NCHS report indicated that in 2002 Americans paid an average of $3,302 per
year on health care and prescription drugs. An investigator hypothesizes that in 2005
expenditures have decreased primarily due to the availability of generic drugs. To test the
hypothesis, a sample of 100 Americans are selected and their expenditures on health care
and prescription drugs in 2005 are measured. The sample data are summarized as follows:
n=100,
=$3,190 and s=$890. Is there statistical evidence of a reduction in expenditures
on health care and prescription drugs in 2005? Is the sample mean of $3,190 evidence of a
true reduction in the mean or is it within chance fluctuation? We will run the test using the
five-step approach.

Step 1. Set up hypotheses and determine level of significance
H0: μ = 3,302 H1: μ < 3,302
α =0.05
The research hypothesis is that expenditures have decreased, and therefore a
lower-tailed test is used.

Step 2. Select the appropriate test statistic.
Because the sample size is large (n>30) the appropriate test statistic is
.

Step 3. Set up decision rule.
This is a lower tailed test, using a Z statistic and a 5% level of
significance. Reject H0 if Z < -1.645.

Step 4. Compute the test statistic.
We now substitute the sample data into the formula for the test statistic
identified in Step 2.

Step 5. Conclusion.
We do not reject H0 because -1.26 > -1.645. We do not have statistically
significant evidence at α=0.05 to show that the mean expenditures on health
care and prescription drugs are lower in 2005 than the mean of $3,302 reported
in 2002.
Recall that when we fail to reject H0 in a test of hypothesis that either the null hypothesis is
true (here the mean expenditures in 2005 are the same as those in 2002 and equal to
$3,302) or we committed a Type II error (i.e., we failed to reject H0 when in fact it is false). In
summarizing this test, we conclude that we do not have sufficient evidence to reject H0. We
do not conclude that H0 is true, because there may be a moderate to high probability that we
committed a Type II error. It is possible that the sample size is not large enough to detect a
difference in mean expenditures.
Example.
The NCHS reported that the mean total cholesterol level in 2002 for all adults was 203. Total
cholesterol levels in participants who attended the seventh examination of the Offspring in
the Framingham Heart Study are summarized as follows: n=3,310,
=200.3, and s=36.8.
Is there statistical evidence of a difference in mean cholesterol levels in the Framingham
Offspring?
Here we want to assess whether the sample mean of 200.3 in the Framingham sample is
statistically significantly different from 203 (i.e., beyond what we would expect by chance).
We will run the test using the five-step approach.

Step 1. Set up hypotheses and determine level of significance
H0: μ= 203 H1: μ≠ 203
α=0.05
The research hypothesis is that cholesterol levels are different in the
Framingham Offspring, and therefore a two-tailed test is used.

Step 2. Select the appropriate test statistic.
Because the sample size is large (n>30) the appropriate test statistic is
.

Step 3. Set up decision rule.
This is a two-tailed test, using a Z statistic and a 5% level of significance. Reject
H0 if Z < -1.960 or is Z > 1.960.

Step 4. Compute the test statistic.
We now substitute the sample data into the formula for the test statistic
identified in Step 2.

Step 5. Conclusion.
We reject H0 because -4.22 ≤ -1. .960. We have statistically significant evidence
at α=0.05 to show that the mean total cholesterol level in the Framingham
Offspring is different from the national average of 203 reported in
2002. Because we reject H0, we also approximate a p-value. Using the twosided significance levels, p < 0.0001.
Statistical Significance versus Clinical (Practical) Significance
This example raises an important concept of statistical versus clinical or practical
significance. From a statistical standpoint, the total cholesterol levels in the Framingham
sample are highly statistically significantly different from the national average with p < 0.0001
(i.e., there is less than a 0.01% chance that we are incorrectly rejecting the null hypothesis).
However, the sample mean in the Framingham Offspring study is 200.3, less than 3 units
different from the national mean of 203. The reason that the data are so highly statistically
significant is due to the very large sample size. It is always important to assess both
statistical and clinical significance of data. This is particularly relevant when the sample size
is large. Is a 3 unit difference in total cholesterol a meaningful difference?
Example
Consider again the NCHS-reported mean total cholesterol level in 2002 for all adults of 203.
Suppose a new drug is proposed to lower total cholesterol. A study is designed to evaluate
the efficacy of the drug in lowering cholesterol. Fifteen patients are enrolled in the study
and asked to take the new drug for 6 weeks. At the end of 6 weeks, each patient's total
cholesterol level is measured and the sample statistics are as follows: n=15,
=195.9 and
s=28.7. Is there statistical evidence of a reduction in mean total cholesterol in patients after
using the new drug for 6 weeks? We will run the test using the five-step approach.

Step 1. Set up hypotheses and determine level of significance
H0: μ= 203 H1: μ< 203

α=0.05
Step 2. Select the appropriate test statistic.
Because the sample size is small (n<30) the appropriate test statistic is
.

Step 3. Set up decision rule.
This is a lower tailed test, using a t statistic and a 5% level of significance. In
order to determine the critical value of t, we need degrees of freedom, df,
defined as df=n-1. In this example df=15-1=14. The critical value for a lower
tailed test with df=14 and a =0.05 is -2.145 and the decision rule is as
follows: Reject H0 if t < -2.145.

Step 4. Compute the test statistic.
We now substitute the sample data into the formula for the test statistic
identified in Step 2.

Step 5. Conclusion.
We do not reject H0 because -0.96 > -2.145. We do not have statistically
significant evidence at α=0.05 to show that the mean total cholesterol level is
lower than the national mean in patients taking the new drug for 6 weeks. Again,
because we failed to reject the null hypothesis we make a weaker concluding
statement allowing for the possibility that we may have committed a Type II error
(i.e., failed to reject H0 when in fact the drug is efficacious).
This example raises an important issue in terms of study design. In this example we ass
null hypothesis that the mean cholesterol level is 203. This is taken to be the mean cho
patients without treatment. Is this an appropriate comparator? Alternative and potentiall
study designs to evaluate the effect of the new drug could involve two treatment groups
group receives the new drug and the other does not, or we could measure each patient
pre-treatment cholesterol level and then assess changes from baseline to 6 weeks post
These designs are also discussed here.
Tests with One Sample, Dichotomous Outcome
Hypothesis testing applications with a dichotomous outcome variable in a single population
are also performed according to the five-step procedure. Similar to tests for means, a key
component is setting up the null and research hypotheses. The objective is to compare the
proportion of successes in a single population to a known proportion (p0). That known
proportion is generally derived from another study or report and is sometimes called a
historical control. It is important in setting up the hypotheses in a one sample test that the
proportion specified in the null hypothesis is a fair and reasonable comparator.
In one sample tests for a dichotomous outcome, we set up our hypotheses against an
appropriate comparator. We select a sample and compute descriptive statistics on the
sample data. Specifically, we compute the sample size (n) and the sample proportion which
is computed by taking the ratio of the number of successes to the sample size,
We then determine the appropriate test statistic (Step 2) for the hypothesis test. The formula
for the test statistic is given below.
Test Statistic for Testing H0: p = p 0
if min(np0 , n(1-p0))> 5
The formula above is appropriate for large samples, defined when the smaller of np0 and
n(1-p0) is at least 5. This is similar, but not identical, to the condition required for appropriate
use of the confidence interval formula for a population proportion:
Here we use the proportion specified in the null hypothesis as the true proportion of
successes rather than the sample proportion. If we fail to satisfy the condition, then
alternative procedures, called exact methods must be used to test the hypothesis about the
population proportion.
.
Example
The NCHS report indicated that in 2002 the prevalence of cigarette smoking among
American adults was 21.1%. Data on prevalent smoking in n=3,536 participants who
attended the seventh examination of the Offspring in the Framingham Heart Study indicated
that 482/3,536 = 13.6% of the respondents were currently smoking at the time of the exam.
Suppose we want to assess whether the prevalence of smoking is lower in the Framingham
Offspring sample given the focus on cardiovascular health in that community. Is there
evidence of a statistically lower prevalence of smoking in the Framingham Offspring study as
compared to the prevalence among all Americans?

Step 1. Set up hypotheses and determine level of significance
H0: p = 0.211 H1: p < 0.211

α=0.05
Step 2. Select the appropriate test statistic.
We must first check that the sample size is adequate. Specifically, we need to
check min(np0, n(1-p0)) = min( 3,536(0.211), 3,536(1-0.211))=min(746,
2790)=746. The sample size is more than adequate so the following formula can
be used:
.

Step 3. Set up decision rule.
This is a lower tailed test, using a Z statistic and a 5% level of significance.
Reject H0 if Z < -1.645.

Step 4. Compute the test statistic.
We now substitute the sample data into the formula for the test statistic
identified in Step 2.

Step 5. Conclusion.
We reject H0 because -10.93 < -1.645. We have statistically significant evidence
at α=0.05 to show that the prevalence of smoking in the Framingham Offspring
is lower than the prevalence nationally (21.1%). Here, p < 0.0001.
The NCHS report indicated that in 2002, 75% of children aged 2 to 17 saw a dentist in the past y
investigator wants to assess whether use of dental services is similar in children living in the city
sample of 125 children aged 2 to 17 living in Boston are surveyed and 64 reported seeing a dent
past 12 months. Is there a significant difference in use of dental services between children living
the national data?
Calculate this on your own before checking the answer.
Answer
Tests with Two Independent Samples, Continuous
Outcome
There are many applications where it is of interest to compare two independent groups with
respect to their mean scores on a continuous outcome. Here we compare means between
groups, but rather than generating an estimate of the difference, we will test whether the
observed difference (increase, decrease or difference) is statistically significant or not.
Remember, that hypothesis testing gives an assessment of statistical significance, whereas
estimation gives an estimate of effect and both are important.
Here we discuss the comparison of means when the two comparison groups are
independent or physically separate. The two groups might be determined by a particular
attribute (e.g., sex, diagnosis of cardiovascular disease) or might be set up by the
investigator (e.g., participants assigned to receive an experimental treatment or placebo).
The first step in the analysis involves computing descriptive statistics on each of the two
samples. Specifically, we compute the sample size, mean and standard deviation in each
sample and we denote these summary statistics as follows:
n1,
1
and s1 for sample 1 and n2,
2
and s2 for sample 2.
The designation of sample 1 and sample 2 is arbitrary. In a clinical trial setting the
convention is to call the treatment group 1 and the control group 2. However, when
comparing men and women, for example, either group can be 1 or 2.
In the two independent samples application with a continuous outcome, the parameter of
interest in the test of hypothesis is the difference in population means, μ1-μ2. The null
hypothesis is always that there is no difference between groups with respect to means, i.e.,
H0: μ1 - μ2 = 0.
The null hypothesis can also be written as follows: H0: μ1 = μ2. In the research hypothesis,
an investigator can hypothesize that the first mean is larger than the second (H1: μ1 > μ2 ),
that the first mean is smaller than the second (H1: μ1 < μ2 ), or that the means are different
(H1: μ1 ≠ μ2 ). The three different alternatives represent upper-, lower-, and two-tailed tests,
respectively. The following test statistics are used to test these hypotheses.
Test Statistics for Testing H0: μ1 = μ2
if n1 > 30 and n2 > 30
if n1 < 30 or n2 < 30
where df =n1+n2-2.
NOTE: The formulas above assume equal variability in the two populations (i.e., the
population variances are equal, or s12 = s22). This means that the outcome is equally variable
in each of the comparison populations. For analysis, we have samples from each of the
comparison populations. If the sample variances are similar, then the assumption about
variability in the populations is probably reasonable. As a guideline, if the ratio of the sample
variances, s12/s22 is between 0.5 and 2 (i.e., if one variance is no more than double the
other), then the formulas above are appropriate. If the ratio of the sample variances is
greater than 2 or less than 0.5 then alternative formulas must be used to account for the
heterogeneity in variances.
The test statistics include Sp, which is the pooled estimate of the common standard
deviation (again assuming that the variances in the populations are similar) computed as the
weighted average of the standard deviations in the samples as follows:
Because we are assuming equal variances between groups, we pool the information on
variability (sample variances) to generate an estimate of the variability in the
population. (Note: Because Sp is a weighted average of the standard deviations in the
sample, Sp will always be in between s1 and s2.)
Example
Data measured on n=3,539 participants who attended the seventh examination of the
Offspring in the Framingham Heart Study are shown below.
Men
Characteristic
n
S
Systolic Blood Pressure
1,623
128.2
17.5
Diastolic Blood Pressure
1,622
75.6
9.8
Total Serum Cholesterol
1,544
192.4
35.2
Weight
1,612
194.0
33.8
Height
1,545
68.9
2.7
Body Mass Index
1,545
28.8
4.6
Suppose we now wish to assess whether there is a statistically significant difference in mean
systolic blood pressures between men and women using a 5% level of significance.

Step 1. Set up hypotheses and determine level of significance
H0: μ1 = μ2 H1: μ1 ≠ μ2

α=0.05
Step 2. Select the appropriate test statistic.
Because both samples are large (> 30), we can use the Z test statistic as
opposed to t. Note that statistical computing packages use t throughout. Before
implementing the formula, we first check whether the assumption of equality of
population variances is reasonable. The guideline suggests investigating the
ratio of the sample variances, s12/s22. Suppose we call the men group 1 and the
women group 2. Again, this is arbitrary; it only needs to be noted when
interpreting the results. The ratio of the sample variances is 17.52/20.12 = 0.76,
which falls between 0.5 and 2 suggesting that the assumption of equality of
population variances is reasonable. The appropriate test statistic is
.

Step 3. Set up decision rule.
This is a two-tailed test, using a Z statistic and a 5% level of significance. Reject
H0 if Z < -1.960 or is Z > 1.960.

Step 4. Compute the test statistic.
We now substitute the sample data into the formula for the test statistic
identified in Step 2. Before substituting, we will first compute Sp, the pooled
estimate of the common standard deviation.
Notice that the pooled estimate of the common standard deviation, Sp, falls in
between the standard deviations in the comparison groups (i.e., 17.5 and 20.1).
Sp is slightly closer in value to the standard deviation in the women (20.1) as
there were slightly more women in the sample. Recall, Sp is a weight average
of the standard deviations in the comparison groups, weighted by the respective
sample sizes.
Now the test statistic:

Step 5. Conclusion.
We reject H0 because 2.66 > 1.960. We have statistically significant evidence at
α=0.05 to show that there is a difference in mean systolic blood pressures
between men and women. The p-value is p < 0.010.
Here again we find that there is a statistically significant difference in mean systolic blood
pressures between men and women at p < 0.010. Notice that there is a very small difference
in the sample means (128.2-126.5 = 1.7 units), but this difference is beyond what would be
expected by chance. Is this a clinically meaningful difference? The large sample size in this
example is driving the statistical significance. A 95% confidence interval for the difference in
mean systolic blood pressures is: 1.7 + 1.26 or (0.44, 2.96). The confidence interval provides
an assessment of the magnitude of the difference between means whereas the test of
hypothesis and p-value provide an assessment of the statistical significance of the
difference.
Above we performed a study to evaluate a new drug designed to lower total cholesterol. The
study involved one sample of patients, each patient took the new drug for 6 weeks and had
their cholesterol measured. As a means of evaluating the efficacy of the new drug, the mean
total cholesterol following 6 weeks of treatment was compared to the NCHS-reported mean
total cholesterol level in 2002 for all adults of 203. At the end of the example, we discussed
the appropriateness of the fixed comparator as well as an alternative study design to
evaluate the effect of the new drug involving two treatment groups, where one group
receives the new drug and the other does not. Here, we revisit the example with a
concurrent or parallel control group, which is very typical in randomized controlled trials or
clinical trials (refer to the EP713 module on Clinical Trials).
Example
A new drug is proposed to lower total cholesterol. A randomized controlled trial is designed
to evaluate the efficacy of the medication in lowering cholesterol. Thirty participants are
enrolled in the trial and are randomly assigned to receive either the new drug or a placebo.
The participants do not know which treatment they are assigned. Each participant is asked
to take the assigned treatment for 6 weeks. At the end of 6 weeks, each patient's total
cholesterol level is measured and the sample statistics are as follows.
Treatment
Sample Size
Mean
Standard Deviation
New Drug
15
195.9
28.7
Placebo
15
217.4
30.3
Is there statistical evidence of a reduction in mean total cholesterol in patients taking the new
drug for 6 weeks as compared to participants taking placebo? We will run the test using the
five-step approach.

Step 1. Set up hypotheses and determine level of significance
H0: μ1 = μ2 H1: μ1 < μ2

α=0.05
Step 2. Select the appropriate test statistic.
Because both samples are small (< 30), we use the t test statistic. Before
implementing the formula, we first check whether the assumption of equality of
population variances is reasonable. The ratio of the sample variances,
s12/s22 =28.72/30.32 = 0.90, which falls between 0.5 and 2, suggesting that the
assumption of equality of population variances is reasonable. The appropriate
test statistic is:
.

Step 3. Set up decision rule.
This is a lower-tailed test, using a t statistic and a 5% level of significance. The
appropriate critical value can be found in the t Table (in More Resources to the
right). In order to determine the critical value of t we need degrees of freedom,
df, defined as df=n1+n2-2 = 15+15-2=28. The critical value for a lower tailed test
with df=28 and α=0.05 is -2.048 and the decision rule is: Reject H0 if t < -2.048.

Step 4. Compute the test statistic.
We now substitute the sample data into the formula for the test statistic
identified in Step 2. Before substituting, we will first compute Sp, the pooled
estimate of the common standard deviation.
Now the test statistic,

Step 5. Conclusion.
We reject H0 because -2.92 < -2.048. We have statistically significant evidence
at α=0.05 to show that the mean total cholesterol level is lower in patients taking
the new drug for 6 weeks as compared to patients taking placebo, p < 0.005.
The clinical trial in this example finds a statistically significant reduction in total cholesterol,
whereas in the previous example where we had a historical control (as opposed to a parallel
control group) we did not demonstrate efficacy of the new drug. Notice that the mean total
cholesterol level in patients taking placebo is 217.4 which is very different from the mean
cholesterol reported among all Americans in 2002 of 203 and used as the comparator in the
prior example. The historical control value may not have been the most appropriate
comparator as cholesterol levels have been increasing over time. In the next section, we
present another design that can be used to assess the efficacy of the new drug.
Tests with Matched Samples, Continuous Outcome
In the previous section we compared two groups with respect to their mean scores on a
continuous outcome. An alternative study design is to compare matched or paired samples.
The two comparison groups are said to be dependent, and the data can arise from a single
sample of participants where each participant is measured twice (possibly before and after
an intervention) or from two samples that are matched on specific characteristics (e.g.,
siblings). When the samples are dependent, we focus on difference scores in each
participant or between members of a pair and the test of hypothesis is based on the mean
difference, μd. The null hypothesis again reflects "no difference" and is stated as H0: μd =0 .
Note that there are some instances where it is of interest to test whether there is a difference
of a particular magnitude (e.g., μd =5) but in most instances the null hypothesis reflects no
difference (i.e., μd=0).
The appropriate formula for the test of hypothesis depends on the sample size. The formulas
are shown below and are identical to those we presented for estimating the mean of a single
sample presented (e.g., when comparing against an external or historical control), except
here we focus on difference scores.
Test Statistics for Testing H0: μd =0
if n > 30
if n < 30
where df =n-1
Example
A new drug is proposed to lower total cholesterol and a study is designed to evaluate the
efficacy of the drug in lowering cholesterol. Fifteen patients agree to participate in the study
and each is asked to take the new drug for 6 weeks. However, before starting the treatment,
each patient's total cholesterol level is measured. The initial measurement is a pre-treatment
or baseline value. After taking the drug for 6 weeks, each patient's total cholesterol level is
measured again and the data are shown below. The rightmost column contains difference
scores for each patient, computed by subtracting the 6 week cholesterol level from the
baseline level. The differences represent the reduction in total cholesterol over 4 weeks.
(The differences could have been computed by subtracting the baseline total cholesterol
level from the level measured at 6 weeks. The way in which the differences are computed
does not affect the outcome of the analysis only the interpretation.)
Subject Identification Number
Baseline
6 Weeks
1
215
205
2
190
156
3
230
190
4
220
180
5
214
201
6
240
227
7
210
197
8
193
173
9
210
204
10
230
217
11
180
142
12
260
262
13
210
207
14
190
184
15
200
193
Because the differences are computed by subtracting the cholesterols measured at 6 weeks
from the baseline values, positive differences indicate reductions and negative differences
indicate increases (e.g., participant 12 increases by 2 units over 6 weeks). The goal here is
to test whether there is a statistically significant reduction in cholesterol. Because of the way
in which we computed the differences, we want to look for an increase in the mean
difference (i.e., a positive reduction). In order to conduct the test, we need to summarize the
differences. In this sample, we have
The calculations are shown below.
Subject Identification Number
Difference
Difference2
1
10
100
2
34
1156
3
40
1600
4
40
1600
5
13
169
6
13
169
7
13
169
8
20
400
9
6
36
10
13
169
11
38
1444
12
-2
4
13
3
9
14
6
36
15
7
49
254
7110
Is there statistical evidence of a reduction in mean total cholesterol in patients after using the
new medication for 6 weeks? We will run the test using the five-step approach.

Step 1. Set up hypotheses and determine level of significance
H0: μd = 0 H1: μd > 0
α=0.05
NOTE: If we had computed differences by subtracting the baseline level from
the level measured at 6 weeks then negative differences would have reflected
reductions and the research hypothesis would have been H1: μd < 0.

Step 2. Select the appropriate test statistic.
Because the sample size is small (n<30) the appropriate test statistic is
.

Step 3. Set up decision rule.
This is an upper-tailed test, using a t statistic and a 5% level of significance. The
appropriate critical value can be found in the t Table at the right, with df=151=14. The critical value for an upper-tailed test with df=14 and α=0.05 is 2.145
and the decision rule is Reject H0 if t > 2.145.

Step 4. Compute the test statistic.
We now substitute the sample data into the formula for the test statistic
identified in Step 2.

Step 5. Conclusion.
We reject H0 because 4.61 > 2.145. We have statistically significant evidence at
α=0.05 to show that there is a reduction in cholesterol levels over 6 weeks.
Here we illustrate the use of a matched design to test the efficacy of a new drug to lower
total cholesterol. We also considered a parallel design (randomized clinical trial) and a study
using a historical comparator. It is extremely important to design studies that are best suited
to detect a meaningful difference when one exists. There are often several alternatives and
investigators work with biostatisticians to determine the best design for each application. It is
worth noting that the matched design used here can be problematic in that observed
differences may only reflect a "placebo" effect. All participants took the assigned medication,
but is the observed reduction attributable to the medication or a result of these participation
in a study
Tests with Two Independent Samples, Dichotomous
Outcome
Here we consider the situation where there are two independent comparison groups and the
outcome of interest is dichotomous (e.g., success/failure). The goal of the analysis is to
compare proportions of successes between the two groups. The relevant sample data are
the sample sizes in each comparison group (n1 and n2) and the sample proportions (
and 2 ) which are computed by taking the ratios of the numbers of successes to the
sample sizes in each group, i.e.,
1
and
.
There are several approaches that can be used to test hypotheses concerning two
independent proportions. Here we present one approach - the chi-square test of
independence is an alternative, equivalent, and perhaps more popular approach to the same
analysis. Hypothesis testing with the chi-square test is addressed in the third module in this
series: BS704_HypothesisTesting-ChiSquare.
In tests of hypothesis comparing proportions between two independent groups, one test is
performed and results can be interpreted to apply to a risk difference, relative risk or odds
ratio. As a reminder, the risk difference is computed by taking the difference in proportions
between comparison groups, the risk ratio is computed by taking the ratio of proportions, and
the odds ratio is computed by taking the ratio of the odds of success in the comparison
groups. Because the null values for the risk difference, the risk ratio and the odds ratio are
different, the hypotheses in tests of hypothesis look slightly different depending on which
measure is used. When performing tests of hypothesis for the risk difference, relative risk or
odds ratio, the convention is to label the exposed or treated group 1 and the unexposed or
control group 2.
For example, suppose a study is designed to assess whether there is a significant difference
in proportions in two independent comparison groups. The test of interest is as follows:
H0: p1 = p2 versus H1: p1 ≠ p2.
The following are the hypothesis for testing for a difference in proportions using the risk
difference, the risk ratio and the odds ratio. First, the hypotheses above are equivalent to the
following:


For the risk difference, H0: p1 - p2 = 0 versus H1: p1 - p2 ≠ 0 which are, by definition,
equal to H0: RD = 0 versus H1: RD ≠ 0.
If an investigator wants to focus on the risk ratio, the equivalent hypotheses are H0:
RR = 1 versus H1: RR ≠ 1.

If the investigator wants to focus on the odds ratio, the equivalent hypotheses are H0:
OR = 1 versus H1: OR ≠ 1.
Suppose a test is performed to test H0: RD = 0 versus H1: RD ≠ 0 and the test rejects H0 at
α=0.05. Based on this test we can conclude that there is significant evidence, α=0.05, of a
difference in proportions, significant evidence that the risk difference is not zero, significant
evidence that the risk ratio and odds ratio are not one. The risk difference is analogous to
the difference in means when the outcome is continuous. Here the parameter of interest is
the difference in proportions in the population, RD = p1-p2 and the null value for the risk
difference is zero. In a test of hypothesis for the risk difference, the null hypothesis is always
H0: RD = 0. This is equivalent to H0: RR = 1 and H0: OR = 1. In the research hypothesis, an
investigator can hypothesize that the first proportion is larger than the second (H1: p 1 > p 2 ,
which is equivalent to H1: RD > 0, H1: RR > 1 and H1: OR > 1), that the first proportion is
smaller than the second (H1: p 1 < p 2 , which is equivalent to H1: RD < 0, H1: RR < 1 and H1:
OR < 1), or that the proportions are different (H1: p 1 ≠ p 2 , which is equivalent to H1: RD ≠ 0,
H1: RR ≠ 1 and H1: OR ≠ 1). The three different alternatives represent upper-, lower- and
two-tailed tests, respectively.
The formula for the test of hypothesis for the difference in proportions is given below.
Test Statistics for Testing H0: p 1 = p
Where
1
is the proportion of successes in sample 1,
2
is the proportion of successes in
sample 2, and is the proportion of successes in the pooled sample.
summing all of the successes and dividing by the total sample size,
is computed by
(this is similar to the pooled estimate of the standard deviation, Sp, used in two independent
samples tests with a continuous outcome; just as Sp is in between s1 and s2,
between
1
and
will be in
2).
The formula above is appropriate for large samples, defined as at least 5 successes (np>5)
and at least 5 failures (n(1-p>5)) in each of the two samples. If there are fewer than 5
successes or failures in either comparison group, then alternative procedures, called exact
methods must be used to estimate the difference in population proportions.
Example
The following table summarizes data from n=3,799 participants who attended the fifth
examination of the Offspring in the Framingham Heart Study. The outcome of interest is
prevalent CVD and we want to test whether the prevalence of CVD is significantly higher in
smokers as compared to non-smokers.
Non-Smoker
Current Smoker
Total
Free of CVD
History of CVD
2,757
298
663
81
3,420
379
The prevalence of CVD (or proportion of participants with prevalent CVD) among nonsmokers is 298/3,055 = 0.0975 and the prevalence of CVD among current smokers is
81/744 = 0.1089. Here smoking status defines the comparison groups and we will call the
current smokers group 1 (exposed) and the non-smokers (unexposed) group 2. The test of
hypothesis is conducted below using the five step approach.

Step 1. Set up hypotheses and determine level of significance
H0: p1 = p2

H1: p1 ≠ p2
α=0.05
Step 2. Select the appropriate test statistic.
We must first check that the sample size is adequate. Specifically, we need to
ensure that we have at least 5 successes and 5 failures in each comparison
group. In this example, we have more than enough successes (cases of
prevalent CVD) and failures (persons free of CVD) in each comparison group.
The sample size is more than adequate so the following formula can be used:
.

Step 3. Set up decision rule.
Reject H0 if Z < -1.960 or if Z > 1.960.

Step 4. Compute the test statistic.
We now substitute the sample data into the formula for the test statistic
identified in Step 2. We first compute the overall proportion of successes:
We now substitute to compute the test statistic.

Step 5. Conclusion.
We do not reject H0 because -1.960 < 0.927 < 1.960. We do not have
statistically significant evidence at α=0.05 to show that there is a difference in
prevalent CVD between smokers and non-smokers.
A 95% confidence interval for the difference in prevalent CVD (or risk difference) between
smokers and non-smokers as 0.0114 + 0.0247, or between -0.0133 and 0.0361. Because
the 95% confidence interval for the risk difference includes zero we again conclude that
there is no statistically significant difference in prevalent CVD between smokers and nonsmokers.
Smoking has been shown over and over to be a risk factor for cardiovascular disease. What
might explain the fact that we did not observe a statistically significant difference using data
from the Framingham Heart Study? HINT: Here we consider prevalent CVD, would the
results have been different if we considered incident CVD?
Example
A randomized trial is designed to evaluate the effectiveness of a newly developed pain
reliever designed to reduce pain in patients following joint replacement surgery. The trial
compares the new pain reliever to the pain reliever currently in use (called the standard of
care). A total of 100 patients undergoing joint replacement surgery agreed to participate in
the trial. Patients were randomly assigned to receive either the new pain reliever or the
standard pain reliever following surgery and were blind to the treatment assignment. Before
receiving the assigned treatment, patients were asked to rate their pain on a scale of 0-10
with higher scores indicative of more pain. Each patient was then given the assigned
treatment and after 30 minutes was again asked to rate their pain on the same scale. The
primary outcome was a reduction in pain of 3 or more scale points (defined by clinicians as a
clinically meaningful reduction). The following data were observed in the trial.
Treatment Group
n
Number with Reduction
of 3+ Points
New Pain Reliever
50
23
Standard Pain Reliever
50
11
We now test whether there is a statistically significant difference in the proportions of
patients reporting a meaningful reduction (i.e., a reduction of 3 or more scale points) using
the five step approach.

Step 1. Set up hypotheses and determine level of significance
H0: p1 = p2
H1: p1 ≠ p2
α=0.05
Propo
Here the new or experimental pain reliever is group 1 and the standard pain
reliever is group 2.

Step 2. Select the appropriate test statistic.
We must first check that the sample size is adequate. Specifically, we need to
ensure that we have at least 5 successes and 5 failures in each comparison
group, i.e.,
In this example, we have min(50(0.46), 50(1-0.46), 50(0.22), 50(1-0.22)) =
min(23, 27, 11, 39) = 11. The sample size is adequate so the following formula
can be used
.

Step 3. Set up decision rule.
Reject H0 if Z < -1.960 or if Z > 1.960.

Step 4. Compute the test statistic.
We now substitute the sample data into the formula for the test statistic
identified in Step 2. We first compute the overall proportion of successes:
We now substitute to compute the test statistic.

Step 5. Conclusion.
We reject H0 because 2.526 > 1960. We have statistically significant evidence at
a =0.05 to show that there is a difference in the proportions of patients on the
new pain reliever reporting a meaningful reduction (i.e., a reduction of 3 or more
scale points) as compared to patients on the standard pain reliever.
A 95% confidence interval for the difference in proportions of patients on the new pain
reliever reporting a meaningful reduction (i.e., a reduction of 3 or more scale points) as
compared to patients on the standard pain reliever is 0.24 + 0.18 or between 0.06 and 0.42.
Because the 95% confidence interval does not include zero we concluded that there was a
statistically significant difference in proportions which is consistent with the test of hypothesis
result.
Again, the procedures discussed here apply to applications where there are two independent
comparison groups and a dichotomous outcome. There are other applications in which it is
of interest to compare a dichotomous outcome in matched or paired samples. For example,
in a clinical trial we might wish to test the effectiveness of a new antibiotic eye drop for the
treatment of bacterial conjunctivitis. Participants use the new antibiotic eye drop in one eye
and a comparator (placebo or active control treatment) in the other. The success of the
treatment (yes/no) is recorded for each participant for each eye. Because the two
assessments (success or failure) are paired, we cannot use the procedures discussed here.
The appropriate test is called McNemar's test (sometimes called McNemar's test for
dependent proportions).
Summary
Here we presented hypothesis testing techniques for means and proportions in one and two
sample situations. Tests of hypothesis involve several steps, including specifying the null
and alternative or research hypothesis, selecting and computing an appropriate test statistic,
setting up a decision rule and drawing a conclusion. There are many details to consider in
hypothesis testing. The first is to determine the appropriate test. We discussed Z and t tests
here for different applications. The appropriate test depends on the distribution of the
outcome variable (continuous or dichotomous), the number of comparison groups (one, two)
and whether the comparison groups are independent or dependent. The following table
summarizes the different tests of hypothesis discussed here.
Outcome Variable, Number of Groups: Null Hypothesis
Continuous Outcome, One Sample:
H0: μ = μ0
Continuous Outcome, Two Independent Samples:
H0: μ1 = μ2
Continuous Outcome, Two Matched Samples:
H0: μd = 0
Dichotomous Outcome, One Sample:
H0: p = p 0
Dichotomous Outcome, Two Independent Samples:
H0: p1 = p2, RD=0, RR=1, OR=1
Test
Once the type of test is determined, the details of the test must be specified. Specifically, the
null and alternative hypotheses must be clearly stated. The null hypothesis always reflects
the "no change" or "no difference" situation. The alternative or research hypothesis reflects
the investigator's belief. The investigator might hypothesize that a parameter (e.g., a mean,
proportion, difference in means or proportions) will increase, will decrease or will be different
under specific conditions (sometimes the conditions are different experimental conditions
and other times the conditions are simply different groups of participants). Once the
hypotheses are specified, data are collected and summarized. The appropriate test is then
conducted according to the five step approach. If the test leads to rejection of the null
hypothesis, an approximate p-value is computed to summarize the significance of the
findings. When tests of hypothesis are conducted using statistical computing packages,
exact p-values are computed. Because the statistical tables in this textbook are limited, we
can only approximate p-values. If the test fails to reject the null hypothesis, then a weaker
concluding statement is made for the following reason.
In hypothesis testing, there are two types of errors that can be committed. A Type I error
occurs when a test incorrectly rejects the null hypothesis. This is referred to as a false
positive result, and the probability that this occurs is equal to the level of significance, α. The
investigator chooses the level of significance in Step 1, and purposely chooses a small value
such as α=0.05 to control the probability of committing a Type I error. A Type II error occurs
when a test fails to reject the null hypothesis when in fact it is false. The probability that this
occurs is equal to β. Unfortunately, the investigator cannot specify β at the outset because it
depends on several factors including the sample size (smaller samples have higher b), the
level of significance (β decreases as a increases), and the difference in the parameter under
the null and alternative hypothesis.
We noted in several examples in this chapter, the relationship between confidence intervals
and tests of hypothesis. The approaches are different, yet related. It is possible to draw a
conclusion about statistical significance by examining a confidence interval. For example, if a
95% confidence interval does not contain the null value (e.g., zero when analyzing a mean
difference or risk difference, one when analyzing relative risks or odds ratios), then one can
conclude that a two-sided test of hypothesis would reject the null at α=0.05. It is important to
note that the correspondence between a confidence interval and test of hypothesis relates to
a two-sided test and that the confidence level corresponds to a specific level of significance
(e.g., 95% to α=0.05, 90% to α=0.10 and so on). The exact significance of the test, the pvalue, can only be determined using the hypothesis testing approach and the p-value
provides an assessment of the strength of the evidence and not an estimate of the effect.
Standard Error of the Mean (2 of 2)
A graph of the effect of sample size on the standard error for a standard deviation
of 10 is shown below:
As you can see, the function levels off. Increasing the sample size by a few
subjects makes a big difference when the sample size is small but makes much
less of a difference when the sample size is large. Notice that the graph is
consistent with the formulas. If σM is 10 for a sample size of 1 then σ Mshould be
equal to
for a sample size of 25. When s is used as an estimate of σ, the
estimated standard error of the mean is
. The standard error of the
mean is used in the computation of confidence intervalsand significance tests for
the mean.
Fundamentals of Statistics 3: Sampling :: The standard error of the mean
We saw with the sampling distribution of the mean that every sample
we take to estimate the unknown population parameter will
overestimate or underestimate the mean by some amount. But what's
interesting is that the distribution of all these sample means will itself
be normally distributed, even if the population is not normally
distributed. The central limit theorem states that the mean of the
sampling distribution of the mean will be the unknown population
mean. The standard deviation of the sampling distribution of the mean
is called the standard error. In fact, it is just another standard
deviation, we just call it the standard error so we know we're talking
about the standard deviation of the sample means instead of the
standard deviation of the raw data. The standard deviation of data is
the average distance values are from the mean.
Ok, so, the variability of the sample means is called the standard
error of the mean or the standard deviation of the mean (these
terms will be used interchangeably since they mean the same thing)
and it looks like this.
Standard Error of the Mean (SEM) =
The symbol σ sigma represents the population standard deviation and
n is the sample size. Population parameters are symbolized using
Greek symbols and we almost never know the population parameters.
That is also the case with the standard error. Just like we estimated
the population standard deviation using the sample standard
deviation, we can estimate the population standard error using the
sample standard deviation.
When we repeatedly sample from a population, the mean of each
sample will vary far less than any individual value. For example, when
we takerandom samples of women's heights, while any individual
height will vary by as much as 12 inches (a woman who is 5'10 and
one who is 4'10), the mean will only vary by a few inches.
The distribution of sample means varies far less than the individual values
in a sample.If we know the population mean height of women is 65 inches
then it would be extremely rare to have a sampe mean of 30 women at 74
inches.
In fact, if we took a sample of 30 women and found an average height
of 6'1, then we would wonder whether these were really from the total
population of women. Perhaps it was a population of Olympic
Volleyball players. It is possible that a random sample of women from
the general population could be 6'1 but it is extremely rare (like
winning the lottery).
The standard deviation tells us how much variation we can expect in a
population. We know from the empirical rule that 95% of values will
fall within 2 standard deviations of the mean. Since the standard error
is just the standard deviation of the distribution of sample mean, we
can also use this rule.
So how much variation in the standard error of the mean should we
expect from chance alone? Using the empirical rule we'd expect 68%
of our sample means to fall within 1 standard error of the true
unknown population mean. 95% would fall within 2 standard errors
and about 99.7% of the sample means will be within 3 standard errors
of the population mean. Just as z-scores can be used to understand
the probability of obtaining a raw value given the mean and standard
deviation, we can do the same thing with sample means.
Sampling Distribution of Difference Between Means
Author(s)
David M. Lane
Prerequisites
Sampling Distributions, Sampling Distribution of the
Mean, Variance Sum Law I
Learning Objectives
1. State the mean and variance of the sampling distribution of
the difference between means
2. Compute the standard error of the difference between
means
3. Compute the probability of a difference between means
being above a specified value
Statistical analyses are very often concerned with the difference
between means. A typical example is an experiment designed to
compare the mean of a control group with the mean of an
experimental group. Inferential statistics used in the analysis of
this type of experiment depend on the sampling distribution of
the difference between means.
The sampling distribution of the difference between means
can be thought of as the distribution that would result if we
repeated the following three steps over and over again: (1)
sample n1 scores from Population 1 and n2 scores from Population
2, (2) compute the means of the two samples (M1 and M2), and
(3) compute the difference between means, M1 - M2. The
distribution of the differences between means is the sampling
distribution of the difference between means.
As you might expect, the mean of the sampling distribution of
the difference between means is:
which says that the mean of the distribution of differences
between sample means is equal to the difference between
population means. For example, say that the mean test score of
all 12-year-olds in a population is 34 and the mean of 10-yearolds is 25. If numerous samples were taken from each age group
and the mean difference computed each time, the mean of these
numerous differences between sample means would be 34 - 25 =
9.
From the variance sum law, we know that:
which says that the variance of the sampling distribution of the
difference between means is equal to the variance of the
sampling distribution of the mean for Population 1 plus the
variance of the sampling distribution of the mean for Population
2. Recall the formula for the variance of the sampling distribution
of the mean:
Since we have two populations and two samples sizes, we
need to distinguish between the two variances and sample sizes.
We do this by using the subscripts 1 and 2. Using this convention,
we can write the formula for the variance of the sampling
distribution of the difference between means as:
Since the standard error of a sampling distribution is the standard
deviation of the sampling distribution, the standard error of the
difference between means is:
Just to review the notation, the symbol on the left contains a
sigma (σ), which means it is a standard deviation. The subscripts
M1 - M2 indicate that it is the standard deviation of the sampling
distribution of M1 - M2.
Now let's look at an application of this formula. Assume there
are two species of green beings on Mars. The mean height of
Species 1 is 32 while the mean height of Species 2 is 22. The
variances of the two species are 60 and 70, respectively and the
heights of both species are normally distributed. You randomly
sample 10 members of Species 1 and 14 members of Species 2.
What is the probability that the mean of the 10 members of
Species 1 will exceed the mean of the 14 members of Species 2
by 5 or more? Without doing any calculations, you probably know
that the probability is pretty high since the difference in
population means is 10. But what exactly is the probability?
First, let's determine the sampling distribution of the
difference between means. Using the formulas above, the mean
is
The standard error is:
The sampling distribution is shown in Figure 1. Notice that it is
normally distributed with a mean of 10 and a standard deviation
of 3.317. The area above 5 is shaded blue.
Figure 1. The sampling distribution of the difference between
means.
The last step is to determine the area that is shaded blue. Using
either a Z table or the normal calculator, the area can be
determined to be 0.934. Thus the probability that the mean of
the sample from Species 1 will exceed the mean of the sample
from Species 2 by 5 or more is 0.934.
As shown below, the formula for the standard error of the
difference between means is much simpler if the sample sizes
and the population variances are equal. When the variances and
samples sizes are the same, there is no need to use the
subscripts 1 and 2 to differentiate these terms.
This simplified version of the formula can be used for the
following problem: The mean height of 15-year-old boys (in cm)
is 175 and the variance is 64. For girls, the mean is 165 and the
variance is 64. If eight boys and eight girls were sampled, what is
the probability that the mean height of the sample of girls would
be higher than the mean height of the sample of boys? In other
words, what is the probability that the mean height of girls minus
the mean height of boys is greater than 0?
As before, the problem can be solved in terms of the sampling
distribution of the difference between means (girls - boys). The
mean of the distribution is 165 - 175 = -10. The standard
deviation of the distribution is:
A graph of the distribution is shown in Figure 2. It is clear that
it is unlikely that the mean height for girls would be higher than
the mean height for boys since in the population boys are quite a
bit taller. Nonetheless it is not inconceivable that the girls' mean
could be higher than the boys' mean.
Figure 2. Sampling distribution of the difference between mean
heights.
A difference between means of 0 or higher is a difference of
10/4 = 2.5 standard deviations above the mean of -10. The
probability of a score 2.5 or more standard deviations above the
mean is 0.0062.
Question 1 out of 4.
Population 1 has a mean of 20 and a variance of 100. Population
2 has a mean of 15 and a variance of 64. You sample 20 scores
from Pop 1 and 16 scores from Pop 2. What is the mean of the
sampling distribution of the difference between means (Pop 1 Pop 2)?
Test of significance for small samples
So far we have discussed problems belonging to large samples. When a small
sample (size < 30) is considered, the above tests are inapplicable because the
assumptions we made for large sample tests, do not hold good for small samples.
In case of small samples it is not possible to assume (i) that the random sampling
distribution of a statistics normal and (ii) the sample values are sufficiently close to
population values to calculate the S.E. of estimate.
Thus an entirely new approach is required to deal with problems of small samples.
But one should note that the methods and theory of small samples are applicable to
large samples but its converse is not true.
Degree of freedom ( df ): By degree of freedom we mean the number of classes to
which the value can be assigned arbitrarily or at will without voicing the
restrictions or limitations placed.
For example, we are asked to choose any 4 numbers whose total is 50. Clearly we
are at freedom to choose any 3 numbers say 10, 23, 7 but the fourth number, 10 is
fixed since the total is 50 [50 - (10 + 23 + 7) = 10]. Thus we are given a restriction,
hence the freedom of selection of number is 4 - 1 = 3.
The degree of freedom ( df ) is denoted by (nu) or df and it is given by  = n - k,
where n = number of classes and k = number of independent constrains (or
restrictions).
In general for a Binomial distribution,  = n - 1
For Poisson distribution,  = n - 2 (since we use total frequency and arithmetic
mean).
For normal distribution,  = n - 3 (since we use total frequency, mean and standard
deviation) etc.
Student's t-distribution
This concept has been introduced by W. S. Gosset (1876 - 1937). He adopted the
pen name "student." Therefore, the distribution is known as 'student’s tdistribution'.
It is used to establish confidence limits and test the hypothesis when the population
variance is not known and sample size is small ( < 30 ).
If a random sample x1, x2, ......., xn of n values be drawn from a normal population
with mean  and standard deviation s then the mean of sample
Estimate of the variance : Let s2 be the estimate of the variance of the sample then
s2 given by
( n - 1 ) as denominator in place 'n'.
( I ) The statistic 't' is defined as
t=
Where x = sample mean,  = actual or hypotheticalmean of population, n = sample
size, s = standard deviation of sample.
Where s =
Note: 't' is distributed as the student distribution with
( n - 1 ) degree of freedom (df ).
(II)1) The variable 't' distribution ranges from minus infinity to plus infinity.
2) Like standard normal distribution, it is also symmetrical and has mean zero
3) 2 of t-distribution is greater than 1, but becomes 1 as 'df' increases and thus the sample
size becomes large. Thus the variance of t-distribution approaches the variance of the normal
distribution as the sample size increases for  ( df ) =, the t-distribution matches with
the normal distribution. (observe the adjoining figure).
Also note that the t-distribution is lower at the mean and higher at the tails than the
normal distribution. The t-distribution has proportionally greater area at its tails
than the normal distribution.
(III) 1) If | t | exceeds t0.05 then difference between x and  is significant at 0.05
level of significance.
2) If | t | exceeds t0.01, then difference is said to highly significant at 0.01 level of
significance.
3) If | t | < t0.05 we conclude that the difference between
and m is not significant and the
sample might have been drawn from a population with mean =  i.e. the data is consistent
with the hypothesis.
(IV)Fiducial limits of population mean
Example A random sample of 16 values from a normal population is found to
leave a mean of 41.5 and standard deviation of 2.795. On this basis is there any
reason to reject the hypothesis that the population mean  = 43? Also find the
confidence limits for .
Solution: Here n = 16 - 1 = 15,
= 41.5 ,  = 2.795 and  = 43.
Now
From the t-table for 15 degree of freedom, the probability of t being 0.05, the value of t =
2.13. Since 2.078 < 2.13. The difference between
Now, null hypothesis : Ho :  = 43 and
Alternative hypothesis : H :   43.
and  is not significant.
Thus there is no reason to reject Ho. To find the limits,
Example Ten individuals are chosen at random from the population and their heights are
found to be inches 63, 63, 64, 65, 66, 69, 69, 70, 70, 71. Discuss the suggestion that the mean
height in the universe is 65 inches given that for 9 degree of freedom the value of student’s 't'
at 0.05 level of significance is 2.262.
Solution: xi = 63, 63, 64, 65, 66, 69, 69, 70, 70, 71 and
n = 10
The difference is not significant at a t 0.05 level thus Ho is accepted and we conclude that the
mean height is 65 inches.
Example Nine items of a sample have the following values 45, 47, 50, 52, 8, 47,
49, 53, 51, 50. Does the mean of the 9 items differ significantly from the assumed
population mean of 47.5 ?
Given that for degree of freedom = 8. P = 0.945 for t = 1.8 and P = 0.953 for t =
1.9.
Solution: Given that for degree of freedom = 8. P = 0.945 for t = 1.8 and P = 0.953 for t =
1.9.
 xi = 45 + 47 + 52 + 48 + 47 + 49 + 53 + 51 + 50 = 442
n=9
Therefore for difference of t = 0.043, the difference of P = 0.0034. Hence for t =
1.843, P = 0.9484. Therefore the probability of getting a value of t > 1.43 is ( 1 0.9484 ) = 0.051 which is in fact 2  0.051 = 0.102 and it is greater than 0.05. Thus
Ho is accepted, i.e. the mean of 9 items differ significantly from the assumed
population mean.
Example A certain stimulus administered to each of 12 patients resulted in the
following increments in 'Blood pressure' 5, 2, 8, -1, 3, 0, 6, -2, 1, 5, 0, 4. Can it be
concluded that the stimulus will in general be accompanied by an increase in blood
pressure, given that for all df the value of t0.05 = 2.201?
Solution:
The null hypothesis Ho :  = 0 i.e. assuming that the stimulus will not be accompanied
by an increase in blood pressure (or the mean increase in blood pressure for the
population is zero).
Now
The table value, t0.05, n = 11 = 2.201
Therefore, 2.924 > 2.201
Thus the null hypothesis Ho is rejected i.e. we find that our assumption is wrong
and we say that as a result of the stimulus the blood pressure will increase.
Statistics – Textbook
Nonparametric Statistics
Last revised: 5/8/2015
Previous
Next
Contents
Nonparametric Statistics
How to Analyze Data with Low Quality or Small Samples, Nonparametric Statistics

General Purpose

Brief Overview of Nonparametric Procedures

When to Use Which Method

Nonparametric Correlations
General Purpose
Brief review of the idea of significance testing. To understand the idea
of nonparametric statistics (the termnonparametric was first used by Wolfowitz, 1942) first
requires a basic understanding of parametric statistics.Elementary Concepts introduces the concept
of statistical significance testing based on the sampling distribution of a particular statistic (you
may want to review that topic before reading on). In short, if we have a basic knowledge of the
underlying distribution of a variable, then we can make predictions about how, in repeated samples
of equal size, this particular statistic will "behave," that is, how it is distributed. For example, if we
draw 100 random samples of 100 adults each from the general population, and compute the mean
height in each sample, then the distribution of the standardized means across samples will likely
approximate the normal distribution (to be precise, Student's t distribution with 99 degrees of
freedom; see below). Now imagine that we take an additional sample in a particular city
("Tallburg") where we suspect that people are taller than the average population. If the mean
height in that sample falls outside the upper 95% tail area of the t distribution then we conclude
that, indeed, the people of Tallburg are taller than the average population.
Are most variables normally distributed? In the above example we relied on our knowledge that,
in repeated samples of equal size, the standardized means (for height) will be distributed following
the t distribution (with a particular mean and variance). However, this will only be true if in the
population the variable of interest (height in our example) is normally distributed, that is, if the
distribution of people of particular heights follows the normal distribution (the bell-shape
distribution).
For many variables of interest, we simply do not know for sure that this is the case. For example, is
income distributed normally in the population? -- probably not. The incidence rates of rare diseases
are not normally distributed in the population, the number of car accidents is also not normally
distributed, and neither are very many other variables in which a researcher might be interested.
For more information on the normal distribution, see Elementary Concepts; for information on tests
of normality, see Normality tests.
Sample size. Another factor that often limits the applicability of tests based on the assumption
that the sampling distribution is normal is the size of the sample of data available for the analysis
(sample size; n). We can assume that the sampling distribution is normal even if we are not sure
that the distribution of the variable in the population is normal, as long as our sample is large
enough (e.g., 100 or more observations). However, if our sample is very small, then those tests can
be used only if we are sure that the variable is normally distributed, and there is no way to test this
assumption if the sample is small.
Problems in measurement. Applications of tests that are based on the normality assumptions are
further limited by a lack of precise measurement. For example, let us consider a study where grade
point average (GPA) is measured as the major variable of interest. Is an A average twice as good as
a C average? Is the difference between a B and an A average comparable to the difference between
a D and a C average? Somehow, the GPA is a crude measure of scholastic accomplishments that only
allows us to establish a rank ordering of students from "good" students to "poor" students. This
general measurement issue is usually discussed in statistics textbooks in terms of types of
measurement or scale of measurement. Without going into too much detail, most common
statistical techniques such as analysis of variance (and t- tests), regression, etc., assume that the
underlying measurements are at least of interval, meaning that equally spaced intervals on the
scale can be compared in a meaningful manner (e.g, B minus A is equal to D minus C). However, as
in our example, this assumption is very often not tenable, and the data rather represent
a rank ordering of observations (ordinal) rather than precise measurements.
Parametric and nonparametric methods. Hopefully, after this somewhat lengthy introduction, the
need is evident for statistical procedures that enable us to process data of "low quality," from small
samples, on variables about which nothing is known (concerning their distribution). Specifically,
nonparametric methods were developed to be used in cases when the researcher knows nothing
about the parameters of the variable of interest in the population (hence the
name nonparametric). In more technical terms, nonparametric methods do not rely on the
estimation of parameters (such as the mean or the standard deviation) describing the distribution
of the variable of interest in the population. Therefore, these methods are also sometimes (and
more appropriately) called parameter-free methods or distribution-free methods.
Back to Top
Brief Overview of Nonparametric Methods
Basically, there is at least one nonparametric equivalent for each parametric general type of test.
In general, these tests fall into the following categories:

Tests of differences between groups (independent samples);

Tests of differences between variables (dependent samples);

Tests of relationships between variables.
Differences between independent groups. Usually, when we have two samples that we want to
compare concerning their mean value for some variable of interest, we would use the t-test for
independent samples); nonparametric alternatives for this test are the Wald-Wolfowitz runs test,
the Mann-Whitney U test, and the Kolmogorov-Smirnov two-sample test. If we have multiple
groups, we would use analysis of variance (seeANOVA/MANOVA; the nonparametric equivalents to
this method are the Kruskal-Wallis analysis of ranks and the Median test.
Differences between dependent groups. If we want to compare two variables measured in the
same sample we would customarily use the t-test for dependent samples (in Basic Statistics for
example, if we wanted to compare students' math skills at the beginning of the semester with their
skills at the end of the semester). Nonparametric alternatives to this test are the Sign test
and Wilcoxon's matched pairs test. If the variables of interest are dichotomous in nature (i.e.,
"pass" vs. "no pass") then McNemar's Chi-square test is appropriate. If there are more than two
variables that were measured in the same sample, then we would customarily use repeated
measures ANOVA. Nonparametric alternatives to this method are Friedman's two-way analysis of
variance and Cochran Q test (if the variable was measured in terms of categories, e.g., "passed" vs.
"failed"). Cochran Q is particularly useful for measuring changes in frequencies (proportions) across
time.
Relationships between variables. To express a relationship between two variables one usually
computes the correlation coefficient. Nonparametric equivalents to the standard correlation
coefficient are Spearman R,Kendall Tau, and coefficient Gamma (see Nonparametric correlations).
If the two variables of interest are categorical in nature (e.g., "passed" vs. "failed" by "male" vs.
"female") appropriate nonparametric statistics for testing the relationship between the two
variables are the Chi-square test, the Phi coefficient, and the Fisher exact test. In addition, a
simultaneous test for relationships between multiple cases is available: Kendall coefficient of
concordance. This test is often used for expressing inter-rater agreement among independent
judges who are rating (ranking) the same stimuli.
Descriptive statistics. When one's data are not normally distributed, and the measurements at best
contain rank order information, then computing the standard descriptive statistics (e.g., mean,
standard deviation) is sometimes not the most informative way to summarize the data. For
example, in the area of psychometrics it is well known that the rated intensity of a stimulus (e.g.,
perceived brightness of a light) is often a logarithmic function of the actual intensity of the
stimulus (brightness as measured in objective units of Lux). In this example, the simple mean rating
(sum of ratings divided by the number of stimuli) is not an adequate summary of the average actual
intensity of the stimuli. (In this example, one would probably rather compute the geometric mean.)
Nonparametrics and Distributions will compute a wide variety of measures of location
(mean, median,mode, etc.) and dispersion (variance, average deviation, quartile range, etc.) to
provide the "complete picture" of one's data.
Back to Top
When to Use Which Method
It is not easy to give simple advice concerning the use of nonparametric procedures. Each
nonparametric procedure has its peculiar sensitivities and blind spots. For example, the
Kolmogorov-Smirnov two-sample test is not only sensitive to differences in the location of
distributions (for example, differences in means) but is also greatly affected by differences in their
shapes. The Wilcoxon matched pairs test assumes that one can rank order the magnitude of
differences in matched observations in a meaningful manner. If this is not the case, one should
rather use the Sign test. In general, if the result of a study is important (e.g., does a very expensive
and painful drug therapy help people get better?), then it is always advisable to run different
nonparametric tests; should discrepancies in the results occur contingent upon which test is used,
one should try to understand why some tests give different results. On the other hand,
nonparametric statistics are less statistically powerful (sensitive) than their parametric
counterparts, and if it is important to detect even small effects (e.g., is this food additive harmful
to people?) one should be very careful in the choice of a test statistic.
Large data sets and nonparametric methods. Nonparametric methods are most appropriate when
the sample sizes are small. When the data set is large (e.g., n > 100) it often makes little sense to
use nonparametric statistics at all. Elementary Concepts briefly discusses the idea of the central
limit theorem. In a nutshell, when the samples become very large, then the sample means will
follow the normal distribution even if the respective variable is not normally distributed in the
population, or is not measured very well. Thus, parametric methods, which are usually much more
sensitive (i.e., have more statistical power) are in most cases appropriate for large samples.
However, the tests of significance of many of the nonparametric statistics described here are based
on asymptotic (large sample) theory; therefore, meaningful tests can often not be performed if the
sample sizes become too small. Please refer to the descriptions of the specific tests to learn more
about their power and efficiency.
Back to Top
Nonparametric Correlations
The following are three types of commonly used nonparametric correlation coefficients (Spearman
R, Kendall Tau, and Gamma coefficients). Note that the chi-square statistic computed for two-way
frequency tables, also provides a careful measure of a relation between the two (tabulated)
variables, and unlike the correlation measures listed below, it can be used for variables that are
measured on a simple nominal scale.
Spearman R. Spearman R (Siegel & Castellan, 1988) assumes that the variables under consideration
were measured on at least an ordinal (rank order) scale, that is, that the individual observations
can be ranked into two ordered series. Spearman R can be thought of as the regular Pearson
product moment correlation coefficient, that is, in terms of proportion of variability accounted for,
except that Spearman R is computed from ranks.
Kendall tau. Kendall tau is equivalent to Spearman R with regard to the underlying assumptions. It
is also comparable in terms of its statistical power. However, Spearman R and Kendall tau are
usually not identical in magnitude because their underlying logic as well as their computational
formulas are very different. Siegel and Castellan (1988) express the relationship of the two
measures in terms of the inequality: More importantly, Kendall tau and Spearman R imply different
interpretations: Spearman R can be thought of as the regular Pearson product moment correlation
coefficient, that is, in terms of proportion of variability accounted for, except that Spearman R is
computed from ranks. Kendall tau, on the other hand, represents a probability, that is, it is the
difference between the probability that in the observed data the two variables are in the same
order versus the probability that the two variables are in different orders.
-1  3 * Kendall tau - 2 * Spearman R  1
Gamma. The Gamma statistic (Siegel & Castellan, 1988) is preferable to Spearman R or Kendall tau
when the data contain many tied observations. In terms of the underlying assumptions, Gamma is
equivalent to Spearman R or Kendall tau; in terms of its interpretation and computation it is more
similar to Kendall tau than Spearman R. In short, Gamma is also a probability; specifically, it is
computed as the difference between the probability that the rank ordering of the two variables
agree minus the probability that they disagree, divided by 1 minus the probability of ties. Thus,
Gamma is basically equivalent to Kendall tau, except that ties are explicitly taken into account.
Introduction to Statistical Decision
Theory
By John Pratt, Howard Raiffa and Robert Schlaifer
Overview
The Bayesian revolution in statistics—where statistics is integrated with decision making
in areas such as management, public policy, engineering, and clinical medicine—is here
to stay. Introduction to Statistical Decision Theory states the case and in a selfcontained, comprehensive way shows how the approach is operational and relevant for
real-world decision making under uncertainty.
Starting with an extensive account of the foundations of decision theory, the authors
develop the intertwining concepts of subjective probability and utility. They then
systematically and comprehensively examine the Bernoulli, Poisson, and Normal
(univariate and multivariate) data generating processes. For each process they consider
how prior judgments about the uncertain parameters of the process are modified given
the results of statistical sampling, and they investigate typical decision problems in
which the main sources of uncertainty are the population parameters. They also discuss
the value of sampling information and optimal sample sizes given sampling costs and the
economics of the terminal decision problems.
Unlike most introductory texts in statistics, Introduction to Statistical Decision
Theory integrates statistical inference with decision making and discusses real-world
actions involving economic payoffs and risks. After developing the rationale and
demonstrating the power and relevance of the subjective, decision approach, the text
also examines and critiques the limitations of the objective, classical approach.
Reviews
“An excellent introduction to Bayesian statistical theory.”—Frank Windmeijer, Times
Higher Education Supplement
“This book is a classic.... The strengths of this text are twofold. First, it gives a general
and well-motivated introduction to the principles of Bayesian decision theory that
should be accessible to anyone with a good mathematical statistics background. Second,
it provides a good introduction to Bayesian inference in general with particular emphasis
on the use of subjective information to choose prior distributions.”—Mark J.
Schervish , Journal of the American Statistical Association
“This is the authoritative introductory treatise (almost 900 pages) on applied Bayesian
statistical theory. It is self-contained and well-presented, developed with great care and
obvious affection by the founders of the subject.”—James M. Dickey, Mathematical
Reviews
Decision Trees for Decision Making

John F. Magee
FROM THE JULY 1964 ISSUE

SAVE

SHARE

COMMENT

TEXT SIZE

PRINT

8.95
BUY COPIES
Decision Trees for Decision Making
VIEW MORE FROM THE
July 1964 Issue
EXPLORE THE ARCHIVE
RECOMMENDED

Decision Trees for Decision Making
STRATEGY & EXECUTION HBR ARTICLE
o John F. Magee
8.95 ADD TO CART
O
SAVE
O
SHARE

Decision Trees
LEADERSHIP & MANAGING PEOPLE INDUSTRY AND BACKGROUND NOTE
o Robin Greenwood, Lucy White
8.95 ADD TO CART
O
SAVE
O
SHARE

Structuring a Competitive Analysis: Decision Trees, Decision Forests, and Payoff Matrices
STRATEGY & EXECUTION CASE
o Matthias Hild
8.95 ADD TO CART
O
SAVE
O
SHARE
The management of a company that I shall call Stygian Chemical Industries, Ltd., must
decide whether to build a small plant or a large one to manufacture a new product with an
expected market life of ten years. The decision hinges on what size the market for the product
will be.
Possibly demand will be high during the initial two years but, if many initial users find the
product unsatisfactory, will fall to a low level thereafter. Or high initial demand might
indicate the possibility of a sustained high-volume market. If demand is high and the
company does not expand within the first two years, competitive products will surely be
introduced.
If the company builds a big plant, it must live with it whatever the size of market demand. If
it builds a small plant, management has the option of expanding the plant in two years in the
event that demand is high during the introductory period; while in the event that demand is
low during the introductory period, the company will maintain operations in the small plant
and make a tidy profit on the low volume.
Management is uncertain what to do. The company grew rapidly during the 1950’s; it kept
pace with the chemical industry generally. The new product, if the market turns out to be
large, offers the present management a chance to push the company into a new period of
profitable growth. The development department, particularly the development project
engineer, is pushing to build the large-scale plant to exploit the first major product
development the department has produced in some years.
The chairman, a principal stockholder, is wary of the possibility of large unneeded plant
capacity. He favors a smaller plant commitment, but recognizes that later expansion to meet
high-volume demand would require more investment and be less efficient to operate. The
chairman also recognizes that unless the company moves promptly to fill the demand which
develops, competitors will be tempted to move in with equivalent products.
The Stygian Chemical problem, oversimplified as it is, illustrates the uncertainties and issues
that business management must resolve in making investment decisions. (I use the term
“investment” in a broad sense, referring to outlays not only for new plants and equipment but
also for large, risky orders, special marketing facilities, research programs, and other
purposes.) These decisions are growing more important at the same time that they are
increasing in complexity. Countless executives want to make them better—but how?
In this article I shall present one recently developed concept called the “decision tree,” which
has tremendous potential as a decision-making tool. The decision tree can clarify for
management, as can no other analytical tool that I know of, the choices, risks, objectives,
monetary gains, and information needs involved in an investment problem. We shall be
hearing a great deal about decision trees in the years ahead. Although a novelty to most
businessmen today, they will surely be in common management parlance before many more
years have passed.
Later in this article we shall return to the problem facing Stygian Chemical and see how
management can proceed to solve it by using decision trees. First, however, a simpler
example will illustrate some characteristics of the decision-tree approach.
Displaying Alternatives
Let us suppose it is a rather overcast Saturday morning, and you have 75 people coming for
cocktails in the afternoon. You have a pleasant garden and your house is not too large; so if
the weather permits, you would like to set up the refreshments in the garden and have the
party there. It would be more pleasant, and your guests would be more comfortable. On the
other hand, if you set up the party for the garden and after all the guests are assembled it
begins to rain, the refreshments will be ruined, your guests will get damp, and you will
heartily wish you had decided to have the party in the house. (We could complicate this
problem by considering the possibility of a partial commitment to one course or another and
opportunities to adjust estimates of the weather as the day goes on, but the simple problem is
all we need.)
This particular decision can be represented in the form of a “payoff” table:
Much more complex decision questions can be portrayed in payoff table form. However,
particularly for complex investment decisions, a different representation of the information
pertinent to the problem—the decision tree—is useful to show the routes by which the
various possible outcomes are achieved. Pierre Massé, Commissioner General of the National
Agency for Productivity and Equipment Planning in France, notes:
“The decision problem is not posed in terms of an isolated decision (because today’s decision
depends on the one we shall make tomorrow) nor yet in terms of a sequence of decisions
(because under uncertainty, decisions taken in the future will be influenced by what we have
learned in the meanwhile). The problem is posed in terms of a tree of decisions.”1
Exhibit I illustrates a decision tree for the cocktail party problem. This tree is a different way
of displaying the same information shown in the payoff table. However, as later examples
will show, in complex decisions the decision tree is frequently a much more lucid means of
presenting the relevant information than is a payoff table.
Exhibit I. Decision Tree for Cocktail Party
The tree is made up of a series of nodes and branches. At the first node on the left, the host
has the choice of having the party inside or outside. Each branch represents an alternative
course of action or decision. At the end of each branch or alternative course is another node
representing a chance event—whether or not it will rain. Each subsequent alternative course
to the right represents an alternative outcome of this chance event. Associated with each
complete alternative course through the tree is a payoff, shown at the end of the rightmost or
terminal branch of the course.
When I am drawing decision trees, I like to indicate the action or decision forks with square
nodes and the chance-event forks with round ones. Other symbols may be used instead, such
as single-line and double-line branches, special letters, or colors. It does not matter so much
which method of distinguishing you use so long as you do employ one or another. A decision
tree of any size will always combine (a) actionchoices with (b) different
possible events or results of action which are partially affected by chance or other
uncontrollable circumstances.
Decision-event chains
The previous example, though involving only a single stage of decision, illustrates the
elementary principles on which larger, more complex decision trees are built. Let us take a
slightly more complicated situation:
You are trying to decide whether to approve a development budget for an improved product.
You are urged to do so on the grounds that the development, if successful, will give you a
competitive edge, but if you do not develop the product, your competitor may—and may
seriously damage your market share. You sketch out a decision tree that looks something like
the one in Exhibit II.
Exhibit II. Decision Tree with Chains of Actions and Events
Your initial decision is shown at the left. Following a decision to proceed with the project, if
development is successful, is a second stage of decision at Point A. Assuming no important
change in the situation between now and the time of Point A, you decide now what
alternatives will be important to you at that time. At the right of the tree are the outcomes of
different sequences of decisions and events. These outcomes, too, are based on your present
information. In effect you say, “If what I know now is true then, this is what will happen.”
Of course, you do not try to identify all the events that can happen or all the decisions you
will have to make on a subject under analysis. In the decision tree you lay out only those
decisions and events or results that are important to you and have consequences you wish to
compare. (For more illustrations, see the Appendix.)
Appendix
For readers interested in further examples of decision-tree structure, I shall describe in this
appendix two representative situations with which I am familiar and show the trees that might
be drawn to analyze management’s decision-making alternatives. We shall not concern
ourselves here with costs, yields, probabilities, or expected values.
New Facility
The choice of alternatives in building a plant depends upon market forecasts. The alternative
chosen will, in turn, affect the market outcome. For example, the military products division of
a diversified firm, after some period of low profits due to intense competition, has won a
contract to produce a new type of military engine suitable for Army transport vehicles. The
division has a contract to build productive capacity and to produce at a specified contract
level over a period of three years.
Figure A illustrates the situation. The dotted line shows the contract rate. The solid line
shows the proposed buildup of production for the military. Some other possibilities are
portrayed by dashed lines. The company is not sure whether the contract will be continued at
a relatively high rate after the third year, as shown by Line A, or whether the military will
turn to another newer development, as indicated by Line B. The company has no guarantee of
compensation after the third year. There is also the possibility, indicated by Line C, of a large
additional commercial market for the product, this possibility being somewhat dependent on
the cost at which the product can be made and sold.
If this commercial market could be tapped, it would represent a major new business for the
company and a substantial improvement in the profitability of the division and its importance
to the company.
Management wants to explore three ways of producing the product as follows:
1. It might subcontract all fabrication and set up a simple assembly with limited need for
investment in plant and equipment; the costs would tend to be relatively high and the
company’s investment and profit opportunity would be limited, but the company assets which
are at risk would also be limited.
2. It might undertake the major part of the fabrication itself but use general-purpose machine
tools in a plant of general-purpose construction. The division would have a chance to retain
more of the most profitable operations itself, exploiting some technical developments it has
made (on the basis of which it got the contract). While the cost of production would still be
relatively high, the nature of the investment in plant and equipment would be such that it
could probably be turned to other uses or liquidated if the business disappeared.
3. The company could build a highly mechanized plant with specialized fabrication and
assembly equipment, entailing the largest investment but yielding a substantially lower unit
manufacturing cost if manufacturing volume were adequate. Following this plan would
improve the chances for a continuation of the military contract and penetration into the
commercial market and would improve the profitability of whatever business might be
obtained in these markets. Failure to sustain either the military or the commercial market,
however, would cause substantial financial loss.
Either of the first two alternatives would be better adapted to low-volume production than
would the third.
Some major uncertainties are: the cost-volume relationships under the alternative
manufacturing methods; the size and structure of the future market—this depends in part on
cost, but the degree and extent of dependence are unknown; and the possibilities of
competitive developments which would render the product competitively or technologically
obsolete.
How would this situation be shown in decision-tree form? (Before going further you might
want to draw a tree for the problem yourself.) Figure B shows my version of a tree. Note that
in this case the chance alternatives are somewhat influenced by the decision made. A
decision, for example, to build a more efficient plant will open possibilities for an expanded
market.
Plant Modernization
A company management is faced with a decision on a proposal by its engineering staff
which, after three years of study, wants to install a computer-based control system in the
company’s major plant. The expected cost of the control system is some $30 million. The
claimed advantages of the system will be a reduction in labor cost and an improved product
yield. These benefits depend on the level of product throughput, which is likely to rise over
the next decade. It is thought that the installation program will take about two years and will
cost a substantial amount over and above the cost of equipment. The engineers calculate that
the automation project will yield a 20% return on investment, after taxes; the projection is
based on a ten-year forecast of product demand by the market research department, and an
assumption of an eight-year life for the process control system.
What would this investment yield? Will actual product sales be higher or lower than forecast?
Will the process work? Will it achieve the economies expected? Will competitors follow if
the company is successful? Are they going to mechanize anyway? Will new products or
processes make the basic plant obsolete before the investment can be recovered? Will the
controls last eight years? Will something better come along sooner?
The initial decision alternatives are (a) to install the proposed control system, (b) postpone
action until trends in the market and/or competition become clearer, or (c) initiate more
investigation or an independent evaluation. Each alternative will be followed by resolution of
some uncertain aspect, in part dependent on the action taken. This resolution will lead in turn
to a new decision. The dotted lines at the right of Figure C indicate that the decision tree
continues indefinitely, though the decision alternatives do tend to become repetitive. In the
case of postponement or further study, the decisions are to install, postpone, or restudy; in the
case of installation, the decisions are to continue operation or abandon.
An immediate decision is often one of a sequence. It may be one of a number of sequences.
The impact of the present decision in narrowing down future alternatives and the effect of
future alternatives in affecting the value of the present choice must both be considered.
READ MORE
Adding Financial Data
Now we can return to the problems faced by the Stygian Chemical management. A decision
tree characterizing the investment problem as outlined in the introduction is shown in Exhibit
III. At Decision #1 the company must decide between a large and a small plant. This is all
that must be decided now. But if the company chooses to build a small plant and then finds
demand high during the initial period, it can in two years—at Decision #2—choose to expand
its plant.
Exhibit III. Decisions and Events for Stygian Chemical Industries, Ltd.
But let us go beyond a bare outline of alternatives. In making decisions, executives must take
account of the probabilities, costs, and returns which appear likely. On the basis of the data
now available to them, and assuming no important change in the company’s situation, they
reason as follows:

Marketing estimates indicate a 60% chance of a large market in the long run and a
40% chance of a low demand, developing initially as follows:

Therefore, the chance that demand initially will be high is 70% (60 + 10). If demand is high
initially, the company estimates that the chance it will continue at a high level is 86% (60 ÷
70). Comparing 86% to 60%, it is apparent that a high initial level of sales changes the
estimated chance of high sales in the subsequent periods. Similarly, if sales in the initial
period are low, the chances are 100% (30 ÷ 30) that sales in the subsequent periods will be
low. Thus the level of sales in the initial period is expected to be a rather accurate indicator of
the level of sales in the subsequent periods.
Estimates of annual income are made under the assumption of each alternative outcome:

1. A large plant with high volume would yield $1,000,000 annually in cash flow.
2. A large plant with low volume would yield only $100,000 because of high fixed costs and
inefficiencies.
3. A small plant with low demand would be economical and would yield annual cash income
of $400,000.
4. A small plant, during an initial period of high demand, would yield $450,000 per year, but
this would drop to $300,000 yearly in the long run because of competition. (The market
would be larger than under Alternative 3, but would be divided up among more competitors.)
5. If the small plant were expanded to meet sustained high demand, it would yield$700,000
cash flow annually, and so would be less efficient than a large plant built initially.
6. If the small plant were expanded but high demand were not sustained, estimated annual
cash flow would be $50,000.

It is estimated further that a large plant would cost $3 million to put into operation, a small
plant would cost $1.3 million, and the expansion of the small plant would cost an
additional $2.2 million.
When the foregoing data are incorporated, we have the decision tree shown in Exhibit IV.
Bear in mind that nothing is shown here which Stygian Chemical’s executives did not know
before; no numbers have been pulled out of hats. However, we are beginning to see dramatic
evidence of the value of decision trees in laying out what management knows in a way that
enables more systematic analysis and leads to better decisions. To sum up the requirements of
making a decision tree, management must:
1. Identify the points of decision and alternatives available at each point.
2. Identify the points of uncertainty and the type or range of alternative outcomes at each
point.
3. Estimate the values needed to make the analysis, especially the probabilities of different
events or results of action and the costs and gains of various events and actions.
4. Analyze the alternative values to choose a course.
Exhibit IV. Decision Tree with Financial Data
Choosing Course of Action
We are now ready for the next step in the analysis—to compare the consequences of different
courses of action. A decision tree does not give management the answer to an investment
problem; rather, it helps management determine which alternative at any particular choice
point will yield the greatest expected monetary gain, given the information and alternatives
pertinent to the decision.
Of course, the gains must be viewed with the risks. At Stygian Chemical, as at many
corporations, managers have different points of view toward risk; hence they will draw
different conclusions in the circumstances described by the decision tree shown in Exhibit IV.
The many people participating in a decision—those supplying capital, ideas, data, or
decisions, and having different values at risk—will see the uncertainty surrounding the
decision in different ways. Unless these differences are recognized and dealt with, those who
must make the decision, pay for it, supply data and analyses to it, and live with it will judge
the issue, relevance of data, need for analysis, and criterion of success in different and
conflicting ways.
For example, company stockholders may treat a particular investment as one of a series of
possibilities, some of which will work out, others of which will fail. A major investment may
pose risks to a middle manager—to his job and career—no matter what decision is made.
Another participant may have a lot to gain from success, but little to lose from failure of the
project. The nature of the risk—as each individual sees it—will affect not only the
assumptions he is willing to make but also the strategy he will follow in dealing with the risk.
The existence of multiple, unstated, and conflicting objectives will certainly contribute to the
“politics” of Stygian Chemical’s decision, and one can be certain that the political element
exists whenever the lives and ambitions of people are affected. Here, as in similar cases, it is
not a bad exercise to think through who the parties to an investment decision are and to try to
make these assessments:



What is at risk? Is it profit or equity value, survival of the business, maintenance of a job,
opportunity for a major career?
Who is bearing the risk? The stockholder is usually bearing risk in one form. Management,
employees, the community—all may be bearing different risks.
What is the character of the risk that each person bears? Is it, in his terms, unique, once-in-alifetime, sequential, insurable? Does it affect the economy, the industry, the company, or a
portion of the company?
Considerations such as the foregoing will surely enter into top management’s thinking, and
the decision tree in Exhibit IV will not eliminate them. But the tree will show management
what decision today will contribute most to its long-term goals. The tool for this next step in
the analysis is the concept of “rollback.”
“Rollback” concept
Here is how rollback works in the situation described. At the time of making Decision #1 (see
Exhibit IV), management does not have to make Decision #2 and does not even know if it
will have the occasion to do so. But if it were to have the option at Decision #2, the company
would expand the plant, in view of its current knowledge. The analysis is shown in Exhibit V.
(I shall ignore for the moment the question of discounting future profits; that is introduced
later.) We see that the total expected value of the expansion alternative is $160,000 greater
than the no-expansion alternative, over the eight-year life remaining. Hence that is the
alternative management would choose if faced with Decision #2 with its existing information
(and thinking only of monetary gain as a standard of choice).
Exhibit V. Analysis of Possible Decision #2 (Using Maximum Expected Total Cash Flow as
Criterion)
Readers may wonder why we started with Decision #2 when today’s problem is Decision #1.
The reason is the following: We need to be able to put a monetary value on Decision #2 in
order to “roll back” to Decision #1 and compare the gain from taking the lower branch
(“Build Small Plant”) with the gain from taking the upper branch (“Build Big Plant”). Let us
call that monetary value for Decision #2 its position value. The position value of a decision is
the expected value of the preferred branch (in this case, the plant-expansion fork). The
expected value is simply a kind of average of the results you would expect if you were to
repeat the situation over and over—getting a$5,600 thousand yield 86% of the time and
a $400 thousand yield 14% of the time.
Stated in another way, it is worth $2,672 thousand to Stygian Chemical to get to the position
where it can make Decision #2. The question is: Given this value and the other data shown in
Exhibit IV, what now appears to be the best action at Decision #1?
Turn now to Exhibit VI. At the right of the branches in the top half we see the yields for
various events if a big plant is built (these are simply the figures in Exhibit IV multiplied
out). In the bottom half we see the small plant figures, including Decision #2 position value
plus the yield for the two years prior to Decision #2. If we reduce all these yields by their
probabilities, we get the following comparison:
Build big plant: ($10 × .60) + ($2.8 × .10) + ($1 × .30) – $3 = $3,600 thousand
Build small plant: ($3.6 × .70) + ($4 × .30) – $1.3 = $2,400 thousand
Exhibit VI. Cash Flow Analysis for Decision #1
The choice which maximizes expected total cash yield at Decision #1, therefore, is to build
the big plant initially.
Accounting for Time
What about taking differences in the time of future earnings into account? The time between
successive decision stages on a decision tree may be substantial. At any stage, we may have
to weigh differences in immediate cost or revenue against differences in value at the next
stage. Whatever standard of choice is applied, we can put the two alternatives on a
comparable basis if we discount the value assigned to the next stage by an appropriate
percentage. The discount percentage is, in effect, an allowance for the cost of capital and is
similar to the use of a discount rate in the present value or discounted cash flow techniques
already well known to businessmen.
When decision trees are used, the discounting procedure can be applied one stage at a time.
Both cash flows and position values are discounted.
For simplicity, let us assume that a discount rate of 10% per year for all stages is decided on
by Stygian Chemical’s management. Applying the rollback principle, we again begin with
Decision #2. Taking the same figures used in previous exhibits and discounting the cash
flows at 10%, we get the data shown in Part A of Exhibit VII. Note particularly that these are
the present values as of the time Decision #2 is made.
Exhibit VII. Analysis of Decision #2 with Discounting Note: For simplicity, the first year
cash flow is not discounted, the second year cash flow is discounted one year, and so on.
Now we want to go through the same procedure used in Exhibit V when we obtained
expected values, only this time using the discounted yield figures and obtaining a discounted
expected value. The results are shown in Part B of Exhibit VII. Since the discounted expected
value of the no-expansion alternative is higher, that figure becomes the position value of
Decision #2 this time.
Having done this, we go back to work through Decision #1 again, repeating the same
analytical procedure as before only with discounting. The calculations are shown in Exhibit
VIII. Note that the Decision #2 position value is treated at the time of Decision #1 as if it
were a lump sum received at the end of the two years.
Exhibit VIII. Analysis of Decision #1
The large-plant alternative is again the preferred one on the basis of discounted expected cash
flow. But the margin of difference over the small-plant alternative ($290 thousand) is smaller
than it was without discounting.
Uncertainty Alternatives
In illustrating the decision-tree concept, I have treated uncertainty alternatives as if they were
discrete, well-defined possibilities. For my examples I have made use of uncertain situations
depending basically on a single variable, such as the level of demand or the success or failure
of a development project. I have sought to avoid unnecessary complication while putting
emphasis on the key interrelationships among the present decision, future choices, and the
intervening uncertainties.
In many cases, the uncertain elements do take the form of discrete, single-variable
alternatives. In others, however, the possibilities for cash flow during a stage may range
through a whole spectrum and may depend on a number of independent or partially related
variables subject to chance influences—cost, demand, yield, economic climate, and so forth.
In these cases, we have found that the range of variability or the likelihood of the cash flow
falling in a given range during a stage can be calculated readily from knowledge of the key
variables and the uncertainties surrounding them. Then the range of cash-flow possibilities
during the stage can be broken down into two, three, or more “subsets,” which can be used as
discrete chance alternatives.
Conclusion
Peter F. Drucker has succinctly expressed the relation between present planning and future
events: “Long-range planning does not deal with future decisions. It deals with the futurity of
present decisions.”2 Today’s decision should be made in light of the anticipated effect it and
the outcome of uncertain events will have on future values and decisions. Since today’s
decision sets the stage for tomorrow’s decision, today’s decision must balance economy with
flexibility; it must balance the need to capitalize on profit opportunities that may exist with
the capacity to react to future circumstances and needs.
The unique feature of the decision tree is that it allows management to combine analytical
techniques such as discounted cash flow and present value methods with a clear portrayal of
the impact of future decision alternatives and events. Using the decision tree, management
can consider various courses of action with greater ease and clarity. The interactions between
present decision alternatives, uncertain events, and future choices and their results become
more visible.
Of course, there are many practical aspects of decision trees in addition to those that could be
covered in the space of just one article. When these other aspects are discussed in subsequent
articles,3 the whole range of possible gains for management will be seen in greater detail.
Surely the decision-tree concept does not offer final answers to managements making
investment decisions in the face of uncertainty. We have not reached that stage, and perhaps
we never will. Nevertheless, the concept is valuable for illustrating the structure of
investment decisions, and it can likewise provide excellent help in the evaluation of capital
investment opportunities.
1. Optimal Investment Decisions: Rules for Action and Criteria for Choice (Englewood
Cliffs, New Jersey, Prentice-Hall, Inc., 1962), p. 250.
2. “Long-Range Planning,” Management Science, April 1959, p. 239.
3. We are expecting another article by Mr. Magee in a forthcoming issue.—The Editors
A version of this article appeared in the July 1964 issue of Harvard Business Review.
John F. Magee is chairman of the board of directors of Arthur D. Little, Inc. Over the past
three decades, his professional consulting assignments have taken him to Europe frequently.
This article is about DECISION MAKING
Decision Making Tools
Good managers do not simply just make decisions. Instead, they use tools to determine the best
course of action, making it possible for the manager to make an informed decision. That does not
mean that good managers always make the right decisions, but they certainly are making
decisions that are more informed than they would be based purely on guesswork.
There are many different tools managers use to make decisions, but the ones that we see the
most are thedecision tree, payback analysis and simulation. While there are more, these are
the three we see the most, and it's important to understand them and know how to use them
before you start making decisions.
The Decision Tree
Example of a decision tree
The first tool we will look at is the decision tree. This tool has us write down an issue or
problem, and then, as we think through the problem, we draw solutions or steps that branch out
from the original issue. You start your decision tree by taking a piece of paper and drawing a
small square to represent the decision you need to make. It could look something like this:
'Should we decide to continue to produce dress shoes only, or should we look at making
sneakers as well?'
This first block represents the issue that requires you to make a decision. From there, just like
the name 'decision tree' implies, branches start to sprout out with your thoughts for different
solutions or directions for this issue.
You can make any number of branches, and there is no set pattern to the decision tree - it is
defined more by its functionality than its form. Using a decision tree, you can capture your
thoughts, review them and, if needed, add more branches, hopefully continuing on until you find
your answer. Each decision you make leads you to another decision (or would/could choice) and
that choice leads you to another.
Payback Analysis
I'm happy to say that payback analysis is much easier and much more finite than the decision
tree. This tool will help you analyze financial investments. When we use payback analysis, we
look at an investment and the anticipated savings or cost increase that will result from that
investment. Then, we use a calculation that gives us a time frame for us to make back the money
spent in the initial investment.
For example, let's say we are going to invest in energy-efficient lighting. We know that the initial
investment would be $15,000, but we also know that our energy cost savings would be $5,000 a
year. We can simply use some basic math to understand how long it would take for us to earn
back our initial investment of $15,000. Did he say math? Nooo! Don't worry though - it's pretty
simple.
We can divide $15,000 (our investment) by the amount we would save each year ($5,000), and
from that basic calculation, we can see that it would take 3 years to make back our investment. If
that time frame is something that is good for you, then you decide to make the investment. If not,
see if you can come up with some other investment or savings to change the calculation. Either
way, you will have used the payback analysis tool to make a well-informed decision.
Decision Tree Definition
A decision tree is a graphical representation of possible solutions to a decision based on certain
conditions. It's called a decision tree because it starts with a single box (or root), which then
branches off into a number of solutions, just like a tree.
Decision trees are helpful, not only because they are graphics that help you 'see' what you are
thinking, but also because making a decision tree requires a systematic, documented thought
process. Often, the biggest limitation of our decision making is that we can only select from the
known alternatives. Decision trees help formalize the brainstorming process so we can identify
more potential solutions.
Decision Tree Example
Applied in real life, decision trees can be very complex and end up including pages of options.
But, regardless of the complexity, decision trees are all based on the same principles. Here is a
basic example of a decision tree:
You are making your weekend plans and find out that your parents might come to town. You'd
like to have plans in place, but there are a few unknown factors that will determine what you can,
and can't, do. Time for a decision tree.
First, you draw your decision box. This is the box that includes the event that starts your decision
tree. In this case it is your parents coming to town. Out of that box, you have a branch for each
possible outcome. In our example, it's easy: yes or no - either your parents come or they don't.
Your parents love the movies, so if they come to town, you'll go to the cinema. Since the goal of
the decision tree is to decide your weekend plans, you have an answer. But, what about if your
parents don't come to town? We can go back up to the 'no branch' from the decision box and
finish our decision tree.
If your parents don't come to town, you need to decide what you are going to do. As you think of
options, you realize the weather is an important factor. Weather becomes your next box. Since
it's spring time, you know it will either be rainy, sunny, or windy. Those three possibilities become
your branches.
If it's sunny or rainy, you know what you'll do - play tennis or stay in, respectively. But, what if it's
windy? If it's windy, you want to get out of the house, but you probably won't be able to play
tennis. You could either go to the movies or go shopping. What will determine if you go shopping
or go see a movie? Money.
To unlock this lesson you must be a Study.com Member. Create your account
Register for a free trial
Are you a student or a teacher?
I am a student
Decision Trees for Decision Making
by John F. Magee
The management of a company that I shall call Stygian Chemical Industries, Ltd., must decide whether to build a small plant or a
large one to manufacture a new product with an expected market life of ten years. The decision hinges on what size the market
for the product will be.
Possibly demand will be high during the initial two years but, if many initial users find the product unsatisfactory, will fall to a low
level thereafter. Or high initial demand might indicate the possibility of a sustained high-volume market. If demand is high and the
company does not expand within the first two years, competitive products will surely be introduced.
If the company builds a big plant, it must live with it whatever the size of market demand. If it builds a small plant, management
has the option of expanding the plant in two years in the event that demand is high during the introductory period; while in the
event that demand is low during the introductory period, the company will maintain operations in the small plant and make a tidy
profit on the low volume.
Management is uncertain what to do. The company grew rapidly during the 1950’s; it kept pace with the chemical industry
generally. The new product, if the market turns out to be large, offers the present management a chance to push the company
into a new period of profitable growth. The development department, particularly the development project engineer, is pushing to
build the large-scale plant to exploit the first major product development the department has produced in some years.
The chairman, a principal stockholder, is wary of the possibility of large unneeded plant capacity. He favors a smaller plant
commitment, but recognizes that later expansion to meet high-volume demand would require more investment and be less
efficient to operate. The chairman also recognizes that unless the company moves promptly to fill the demand which develops,
competitors will be tempted to move in with equivalent products.
The Stygian Chemical problem, oversimplified as it is, illustrates the uncertainties and issues that business management must
resolve in making investment decisions. (I use the term “investment” in a broad sense, referring to outlays not only for new plants
and equipment but also for large, risky orders, special marketing facilities, research programs, and other purposes.) These
decisions are growing more important at the same time that they are increasing in complexity. Countless executives want to make
them better—but how?
In this article I shall present one recently developed concept called the “decision tree,” which has tremendous potential as a
decision-making tool. The decision tree can clarify for management, as can no other analytical tool that I know of, the choices,
risks, objectives, monetary gains, and information needs involved in an investment problem. We shall be hearing a great deal
about decision trees in the years ahead. Although a novelty to most businessmen today, they will surely be in common
management parlance before many more years have passed.
Later in this article we shall return to the problem facing Stygian Chemical and see how management can proceed to solve it by
using decision trees. First, however, a simpler example will illustrate some characteristics of the decision-tree approach.
Displaying Alternatives
Let us suppose it is a rather overcast Saturday morning, and you have 75 people coming for cocktails in the afternoon. You have
a pleasant garden and your house is not too large; so if the weather permits, you would like to set up the refreshments in the
garden and have the party there. It would be more pleasant, and your guests would be more comfortable. On the other hand, if
you set up the party for the garden and after all the guests are assembled it begins to rain, the refreshments will be ruined, your
guests will get damp, and you will heartily wish you had decided to have the party in the house. (We could complicate this
problem by considering the possibility of a partial commitment to one course or another and opportunities to adjust estimates of
the weather as the day goes on, but the simple problem is all we need.)
This particular decision can be represented in the form of a “payoff” table:
Page 2
Much more complex decision questions can be portrayed in payoff table form. However, particularly for complex investment
decisions, a different representation of the information pertinent to the problem—the decision tree—is useful to show the routes by
which the various possible outcomes are achieved. Pierre Massé, Commissioner General of the National Agency for Productivity
and Equipment Planning in France, notes:
“The decision problem is not posed in terms of an isolated decision (because today’s decision depends on the one we shall make
tomorrow) nor yet in terms of a sequence of decisions (because under uncertainty, decisions taken in the future will be influenced
by what we have learned in the meanwhile). The problem is posed in terms of a tree of decisions.”
Exhibit I illustrates a decision tree for the cocktail party problem. This tree is a different way of displaying the same information
shown in the payoff table. However, as later examples will show, in complex decisions the decision tree is frequently a much more
lucid means of presenting the relevant information than is a payoff table.
The tree is made up of a series of nodes and branches. At the first node on the left, the host has the choice of having the party
inside or outside. Each branch represents an alternative course of action or decision. At the end of each branch or alternative
course is another node representing a chance event—whether or not it will rain. Each subsequent alternative course to the right
represents an alternative outcome of this chance event. Associated with each complete alternative course through the tree is a
payoff, shown at the end of the rightmost or terminal branch of the course.
1
Page 3
When I am drawing decision trees, I like to indicate the action or decision forks with square nodes and the chance-event forks with
round ones. Other symbols may be used instead, such as single-line and double-line branches, special letters, or colors. It does
not matter so much which method of distinguishing you use so long as you do employ one or another. A decision tree of any size
will always combine (a) action choices with (b) different possible events or results of action which are partially affected by chance
or other uncontrollable circumstances.
Decision-event chains
The previous example, though involving only a single stage of decision, illustrates the elementary principles on which larger, more
complex decision trees are built. Let us take a slightly more complicated situation:
You are trying to decide whether to approve a development budget for an improved product. You are urged to do so on the
grounds that the development, if successful, will give you a competitive edge, but if you do not develop the product, your
competitor may—and may seriously damage your market share. You sketch out a decision tree that looks something like the one
in Exhibit II.
Your initial decision is shown at the left. Following a decision to proceed with the project, if development is successful, is a second
stage of decision at Point A. Assuming no important change in the situation between now and the time of Point A, you decide now
what alternatives will be important to you at that time. At the right of the tree are the outcomes of different sequences of decisions
Page 4
and events. These outcomes, too, are based on your present information. In effect you say, “If what I know now is true then, this
is what will happen.”
Of course, you do not try to identify all the events that can happen or all the decisions you will have to make on a subject under
analysis. In the decision tree you lay out only those decisions and events or results that are important to you and have
consequences you wish to compare. (For more illustrations, see the Appendix.)
Appendix (Located at the end of this article)
Adding Financial Data
Now we can return to the problems faced by the Stygian Chemical management. A decision tree characterizing the investment
problem as outlined in the introduction is shown in Exhibit III. At Decision #1 the company must decide between a large and a
small plant. This is all that must be decided now. But if the company chooses to build a small plant and then finds demand high
during the initial period, it can in two years—at Decision #2—choose to expand its plant.
But let us go beyond a bare outline of alternatives. In making decisions, executives must take account of the probabilities, costs,
and returns which appear likely. On the basis of the data now available to them, and assuming no important change in the
company’s situation, they reason as follows:
• Marketing estimates indicate a 60% chance of a large market in the long run and a 40% chance of a low demand, developing
initially as follows:
Page 5
• Therefore, the chance that demand initially will be high is 70% (60 + 10). If demand is high initially, the company estimates that
the
chance it will continue at a high level is 86% (60 ÷ 70). Comparing 86% to 60%, it is apparent that a high initial level of sales
changes the estimated chance of high sales in the subsequent periods. Similarly, if sales in the initial period are low, the chances
are 100% (30 ÷ 30) that sales in the subsequent periods will be low. Thus the level of sales in the initial period is expected to be a
rather accurate indicator of the level of sales in the subsequent periods.
• Estimates of annual income are made under the assumption of each alternative outcome:
1. A large plant with high volume would yield $1,000,000 annually in cash flow.
2. A large plant with low volume would yield only $100,000 because of high fixed costs and inefficiencies.
3. A small plant with low demand would be economical and would yield annual cash income of $400,000.
4. A small plant, during an initial period of high demand, would yield $450,000 per year, but this would drop to $300,000 yearly in
the long run because of competition. (The market would be larger than under Alternative 3, but would be divided up among more
competitors.)
5. If the small plant were expanded to meet sustained high demand, it would yield $700,000 cash flow annually, and so would be
less efficient than a large plant built initially.
6. If the small plant were expanded but high demand were not sustained, estimated annual cash flow would be $50,000.
• It is estimated further that a large plant would cost $3 million to put into operation, a small plant would cost $1.3 million, and the
expansion of the small plant would cost an additional $2.2 million.
When the foregoing data are incorporated, we have the decision tree shown in Exhibit IV. Bear in mind that nothing is shown here
which Stygian Chemical’s executives did not know before; no numbers have been pulled out of hats. However, we are beginning
to see dramatic evidence of the value of decision trees in laying out what management knows in a way that enables more
systematic analysis and leads to better decisions. To sum up the requirements of making a decision tree, management must:
1. Identify the points of decision and alternatives available at each point.
2. Identify the points of uncertainty and the type or range of alternative outcomes at each point.
3. Estimate the values needed to make the analysis, especially the probabilities of different events or results of action and the
costs and gains of various events and actions.
4. Analyze the alternative values to choose a course.
Page 6
Choosing Course of Action
We are now ready for the next step in the analysis—to compare the consequences of different courses of action. A decision tree
does not give management the answer to an investment problem; rather, it helps management determine which alternative at any
particular choice point will yield the greatest expected monetary gain, given the information and alternatives pertinent to the
decision.
Of course, the gains must be viewed with the risks. At Stygian Chemical, as at many corporations, managers have different points
of view toward risk; hence they will draw different conclusions in the circumstances described by the decision tree shown in
Exhibit IV. The many people participating in a decision—those supplying capital, ideas, data, or decisions, and having different
values at risk—will see the uncertainty surrounding the decision in different ways. Unless these differences are recognized and
dealt with, those who must make the decision, pay for it, supply data and analyses to it, and live with it will judge the issue,
relevance of data, need for analysis, and criterion of success in different and conflicting ways.
For example, company stockholders may treat a particular investment as one of a series of possibilities, some of which will work
out, others of which will fail. A major investment may pose risks to a middle manager—to his job and career—no matter what
decision is made. Another participant may have a lot to gain from success, but little to lose from failure of the project. The nature
of the risk—as each individual sees it—will affect not only the assumptions he is willing to make but also the strategy he will follow
in dealing with the risk.
The existence of multiple, unstated, and conflicting objectives will certainly contribute to the “politics” of Stygian Chemical’s
decision, and one can be certain that the political element exists whenever the lives and ambitions of people are affected. Here,
as in similar cases, it is not a bad exercise to think through who the parties to an investment decision are and to try to make these
assessments:
• What is at risk? Is it profit or equity value, survival of the business, maintenance of a job, opportunity for a major career?
• Who is bearing the risk? The stockholder is usually bearing risk in one form. Management, employees, the community—all may
be
bearing different risks.
• What is the character of the risk that each person bears? Is it, in his terms, unique, once-in-a-lifetime, sequential, insurable? Does
it affect the economy, the industry, the company, or a portion of the company?
Page 7
Considerations such as the foregoing will surely enter into top management’s thinking, and the decision tree in Exhibit IV will not
eliminate them. But the tree will show management what decision today will contribute most to its long-term goals. The tool for this
next step in the analysis is the concept of “rollback.”
“Rollback” concept
Here is how rollback works in the situation described. At the time of making Decision #1 (see Exhibit IV), management does not
have to make Decision #2 and does not even know if it will have the occasion to do so. But if it were to have the option at
Decision #2, the company would expand the plant, in view of its current knowledge. The analysis is shown in Exhibit V. (I shall
ignore for the moment the question of discounting future profits; that is introduced later.) We see that the total expected value of
the expansion alternative is $160,000 greater than the no-expansion alternative, over the eight-year life remaining. Hence that is
the alternative management would choose if faced with Decision #2 with its existing information (and thinking only of monetary
gain as a standard of choice).
Readers may wonder why we started with Decision #2 when today’s problem is Decision #1. The reason is the following: We
need to be able to put a monetary value on Decision #2 in order to “roll back” to Decision #1 and compare the gain from taking the
lower branch (“Build Small Plant”) with the gain from taking the upper branch (“Build Big Plant”). Let us call that monetary value
for Decision #2 its position value. The position value of a decision is the expected value of the preferred branch (in this case, the
plant-expansion fork). The expected value is simply a kind of average of the results you would expect if you were to repeat the
situation over and over—getting a $5,600 thousand yield 86% of the time and a $400 thousand yield 14% of the time.
Stated in another way, it is worth $2,672 thousand to Stygian Chemical to get to the position where it can make Decision #2. The
question is: Given this value and the other data shown in Exhibit IV, what now appears to be the best action at Decision #1?
Turn now to Exhibit VI. At the right of the branches in the top half we see the yields for various events if a big plant is built (these
are simply the figures in Exhibit IV multiplied out). In the bottom half we see the small plant figures, including Decision #2 position
value plus the yield for the two years prior to Decision #2. If we reduce all these yields by their probabilities, we get the following
comparison:
Build big plant: ($10 × .60) + ($2.8 × .10) + ($1 × .30) – $3 = $3,600 thousand
Build small plant: ($3.6 × .70) + ($4 × .30) – $1.3 = $2,400 thousand
Page 8
The choice which maximizes expected total cash yield at Decision #1, therefore, is to build the big plant initially.
Accounting for Time
What about taking differences in the time of future earnings into account? The time between successive decision stages on a
decision tree may be substantial. At any stage, we may have to weigh differences in immediate cost or revenue against
differences in value at the next stage. Whatever standard of choice is applied, we can put the two alternatives on a comparable
basis if we discount the value assigned to the next stage by an appropriate percentage. The discount percentage is, in effect, an
allowance for the cost of capital and is similar to the use of a discount rate in the present value or discounted cash flow
techniques already well known to businessmen.
When decision trees are used, the discounting procedure can be applied one stage at a time. Both cash flows and position values
are discounted.
For simplicity, let us assume that a discount rate of 10% per year for all stages is decided on by Stygian Chemical’s management.
Applying the rollback principle, we again begin with Decision #2. Taking the same figures used in previous exhibits and
discounting the cash flows at 10%, we get the data shown in Part A of Exhibit VII. Note particularly that these are the present
values as of the time Decision #2 is made.
Page 9
Now we want to go through the same procedure used in Exhibit V when we obtained expected values, only this time using the
discounted yield figures and obtaining a discounted expected value. The results are shown in Part B of Exhibit VII. Since the
discounted expected value of the no-expansion alternative is higher, that figure becomes the position value of Decision #2 this
time.
Having done this, we go back to work through Decision #1 again, repeating the same analytical procedure as before only with
discounting. The calculations are shown in Exhibit VIII. Note that the Decision #2 position value is treated at the time of Decision
#1 as if it were a lump sum received at the end of the two years.
Page 10
The large-plant alternative is again the preferred one on the basis of discounted expected cash flow. But the margin of difference
over the small-plant alternative ($290 thousand) is smaller than it was without discounting.
Uncertainty Alternatives
In illustrating the decision-tree concept, I have treated uncertainty alternatives as if they were discrete, well-defined possibilities.
For my examples I have made use of uncertain situations depending basically on a single variable, such as the level of demand or
the success or failure of a development project. I have sought to avoid unnecessary complication while putting emphasis on the
key interrelationships among the present decision, future choices, and the intervening uncertainties.
In many cases, the uncertain elements do take the form of discrete, single-variable alternatives. In others, however, the
possibilities for cash flow during a stage may range through a whole spectrum and may depend on a number of independent or
partially related variables subject to chance influences—cost, demand, yield, economic climate, and so forth. In these cases, we
have found that the range of variability or the likelihood of the cash flow falling in a given range during a stage can be calculated
readily from knowledge of the key variables and the uncertainties surrounding them. Then the range of cash-flow possibilities
during the stage can be broken down into two, three, or more “subsets,” which can be used as discrete chance alternatives.
Conclusion
Peter F. Drucker has succinctly expressed the relation between present planning and future events: “Long-range planning does
not deal with future decisions. It deals with the futurity of present decisions.” Today’s decision should be made in light of the
anticipated effect it and the outcome of uncertain events will have on future values and decisions. Since today’s decision sets the
stage for tomorrow’s decision, today’s decision must balance economy with flexibility; it must balance the need to capitalize on
profit opportunities that may exist with the capacity to react to future circumstances and needs.
The unique feature of the decision tree is that it allows management to combine analytical techniques such as discounted cash
flow and present value methods with a clear portrayal of the impact of future decision alternatives and events. Using the decision
tree, management can consider various courses of action with greater ease and clarity. The interactions between present decision
alternatives, uncertain events, and future choices and their results become more visible.
Of course, there are many practical aspects of decision trees in addition to those that could be covered in the space of just one
article. When these other aspects are discussed in subsequent articles, the whole range of possible gains for management will
be seen in greater detail.
2
3
Page 11
Surely the decision-tree concept does not offer final answers to managements making investment decisions in the face of
uncertainty. We have not reached that stage, and perhaps we never will. Nevertheless, the concept is valuable for illustrating the
structure of investment decisions, and it can likewise provide excellent help in the evaluation of capital investment opportunities.
1. Optimal Investment Decisions: Rules for Action and Criteria for Choice (Englewood Cliffs, New Jersey, Prentice-Hall, Inc., 1962), p. 250.
2. “Long-Range Planning,” Management Science, April 1959, p. 239.
3. We are expecting another article by Mr. Magee in a forthcoming issue.—The Editors
Appendix
For readers interested in further examples of decision-tree structure, I shall describe in this appendix two representative situations
with which I am familiar and show the trees that might be drawn to analyze management’s decision-making alternatives. We shall
not concern ourselves here with costs, yields, probabilities, or expected values.
New Facility
The choice of alternatives in building a plant depends upon market forecasts. The alternative chosen will, in turn, affect the market
outcome. For example, the military products division of a diversified firm, after some period of low profits due to intense
competition, has won a contract to produce a new type of military engine suitable for Army transport vehicles. The division has a
contract to build productive capacity and to produce at a specified contract level over a period of three years.
Figure A illustrates the situation. The dotted line shows the contract rate. The solid line shows the proposed buildup of production
for the military. Some other possibilities are portrayed by dashed lines. The company is not sure whether the contract will be
continued at a relatively high rate after the third year, as shown by Line A, or whether the military will turn to another newer
development, as indicated by Line B. The company has no guarantee of compensation after the third year. There is also the
possibility, indicated by Line C, of a large additional commercial market for the product, this possibility being somewhat dependent
on the cost at which the product can be made and sold.
If this commercial market could be tapped, it would represent a major new business for the company and a substantial
improvement in the profitability of the division and its importance to the company.
Management wants to explore three ways of producing the product as follows:
1. It might subcontract all fabrication and set up a simple assembly with limited need for investment in plant and equipment; the
costs would tend to be relatively high and the company’s investment and profit opportunity would be limited, but the company
assets which are at risk would also be limited.
Page 12
2. It might undertake the major part of the fabrication itself but use general-purpose machine tools in a plant of general-purpose
construction. The division would have a chance to retain more of the most profitable operations itself, exploiting some technical
developments it has made (on the basis of which it got the contract). While the cost of production would still be relatively high, the
nature of the investment in plant and equipment would be such that it could probably be turned to other uses or liquidated if the
business disappeared.
3. The company could build a highly mechanized plant with specialized fabrication and assembly equipment, entailing the largest
investment but yielding a substantially lower unit manufacturing cost if manufacturing volume were adequate. Following this plan
would improve the chances for a continuation of the military contract and penetration into the commercial market and would
improve the profitability of whatever business might be obtained in these markets. Failure to sustain either the military or the
commercial market, however, would cause substantial financial loss.
Either of the first two alternatives would be better adapted to low-volume production than would the third.
Some major uncertainties are: the cost-volume relationships under the alternative manufacturing methods; the size and structure
of the future market—this depends in part on cost, but the degree and extent of dependence are unknown; and the possibilities of
competitive developments which would render the product competitively or technologically obsolete.
How would this situation be shown in decision-tree form? (Before going further you might want to draw a tree for the problem
yourself.) Figure B shows my version of a tree. Note that in this case the chance alternatives are somewhat influenced by the
decision made. A decision, for example, to build a more efficient plant will open possibilities for an expanded market.
Plant Modernization
Page 13
A company management is faced with a decision on a proposal by its engineering staff which, after three years of study, wants to
install a computer-based control system in the company’s major plant. The expected cost of the control system is some $30
million. The claimed advantages of the system will be a reduction in labor cost and an improved product yield. These benefits
depend on the level of product throughput, which is likely to rise over the next decade. It is thought that the installation program
will take about two years and will cost a substantial amount over and above the cost of equipment. The engineers calculate that
the automation project will yield a 20% return on investment, after taxes; the projection is based on a ten-year forecast of product
demand by the market research department, and an assumption of an eight-year life for the process control system.
What would this investment yield? Will actual product sales be higher or lower than forecast? Will the process work? Will it
achieve the economies expected? Will competitors follow if the company is successful? Are they going to mechanize anyway?
Will new products or processes make the basic plant obsolete before the investment can be recovered? Will the controls last eight
years? Will something better come along sooner?
The initial decision alternatives are (a) to install the proposed control system, (b) postpone action until trends in the market and/or
competition become clearer, or (c) initiate more investigation or an independent evaluation. Each alternative will be followed by
resolution of some uncertain aspect, in part dependent on the action taken. This resolution will lead in turn to a new decision. The
dotted lines at the right of Figure C indicate that the decision tree continues indefinitely, though the decision alternatives do tend
to become repetitive. In the case of postponement or further study, the decisions are to install, postpone, or restudy; in the case of
installation, the decisions are to continue operation or abandon.
An immediate decision is often one of a sequence. It may be one of a number of sequences. The impact of the present decision in
narrowing down future alternatives and the effect of future alternatives in affecting the value of the present choice must both be
considered