Download Simple random sampling

Measures of Central Tendency (a quick review) Topic Index | Algebra2/Trig Index | Regents Exam Prep Center You are already familiar with measures of central tendency used with single data sets: mean, median and mode. Let's quickly refresh our memories on these methods of indicating the center of a data set: Mean (or average): Median (middle): (n is the number of values in the data set) (n is the number of values in the data set) • is the number found by adding all of the values in the data set and dividing by the total number of values in that set. • is the middle number in an ordered data set. The number of values that precede the median will be the same as the number of values that follow it. To find the median: 1. Arrange the values in the data set into increasing or decreasing order. 2. If n is odd, the number in the middle is the median. 3. If n is even, the median is the average of the two middle numbers. Mode (most): (least reliable indicator of the center of the data set) • is the value in the data set that occurs most often. When in table form, the mode is the value with the highest frequency. If there is no repeated number in the set, there is no mode. It is possible that a set has more than one mode. See how to use your TI83+/TI-84+ graphing calculator with mean, mode and median. Click calculator. Check out fast ways to use the calculator with grouped data (frequency tables): See how to use your TI83+/TI-84+ graphing calculator with mean, mode, median and grouped data. Click calculator. It is possible to get a sense of a data set's distribution by examining a five statistical summary, the (1) minimum, (2) maximum, (3) median (or second quartile), (4) the first quartile, and (5) the third quartile. Such information will show the extent to which the data is located near the median or near the extremes. Quartiles: We know that the median of a set of data separates the data into two equal parts. Data can be further separated into quartiles. Quartiles separate the original set of data into four equal parts. Each of these parts contains onefourth of the data. Quartiles are percentiles that divide the data into fourths. • The first quartile is the • The second quartile is • The third quartile is the middle (the median) of another name for the middle (the median) of the lower half of the median of the entire set the upper half of the data. One-fourth of the of data. data. Three-fourths of data lies below the first Median of data set = the data lies below the quartile and threesecond quartile of data third quartile and onefourths lies above. set. fourth lies above. th th (the 25 percentile) (the 50 percentile) (the 75th percentile) A quartile is a number, it is not a range of values. A value can be described as "above" or "below" the first quartile, but a value is never "in" the first quartile. Consider: Check out this five statistical summary for a set of tests scores. minimum first quartile second quartile (median) 65 70 80 third quartile 90 maximum 100 While we do not know every test score, we do know that half of the scores is below 80 and half is above 80. We also know that half of the scores is between 70 and 90. The difference between the third and first quartiles is called the interquartile range, IQR. For this example, the interquartile range is 20.) The interquartile range (IQR), also called the midspread or middle fifty, is the range between the third and first quartiles and is considered a more stable statistic than the total range. The IQR contains 50% of the data. Box and Whisker Plots: A five statistical summary can be represented graphically as a box and whisker plot. The first and third quartiles are at the ends of the box, the median is indicated with a vertical line in the interior of the box, and the maximum and minimum are at the ends of the whiskers. See how to use your TI-83+/TI84+ graphing calculator with box and whisker plots. Click calculator. Box-and-whisker plots are helpful in interpreting the distribution of data. NOTE: You may see a box-and-whisker plot which contains an asterisk. Sometimes there is ONE piece of data that falls well outside the range of the other values. This single piece of data is called an outlier. If the outlier is included in the whisker, readers may think that there are grades dispersed throughout the whole range from the first quartile to the outlier, which is not true. To avoid this misconception, an * is used to mark this "out of the ordinary" value. Example of working with grouped data: A survey was taken in biology class regarding the number of siblings of each student. The table shows the class data with the frequency of responses. The mean of this data is 2.5. Find the value of k in the table. Siblings 1 2 3 4 5 Frequency 5 k 8 4 1 Solution: Set up for finding the average (mean), simplify, and solve. Measures of Dispersion Topic Index | Algebra2/Trig Index | Regents Exam Prep Center While knowing the mean value for a set of data may give us some information about the set itself, many varying sets can have the same mean value. To determine how the sets are different, we need more information. Another way of examining single variable data is to look at how the data is spread out, or dispersed about the mean. We will discuss 4 ways of examining the dispersion of data. The smaller the values from these methods, the more consistent the data. 1. Range: The simplest of our methods for measuring dispersion is range. Range is the difference between the largest value and the smallest value in the data set. While being simple to compute, the range is often unreliable as a measure of dispersion since it is based on only two values in the set. A range of 50 tells us very little about how the values are dispersed. Are the values all clustered to one end with the low value (12) or the high value (62) being an outlier? Or are the values more evenly dispersed among the range? Before discussing our next methods, let's establish some vocabulary: Population form: Sample form: The population form is used when the data being analyzed includes the entire set of possible data. When using this form, divide by n, the number of values in the data set. The sample form is used when the data is a random sample taken from the entire set of data. When using this form, divide by n - 1. (It can be shown that dividing by n - 1 makes S2 for the sample, a better estimate of for the population from which the sample was taken.) All people living in the US. Sam, Pete and Claire who live in the US. The population form should be used unless you know a random sample is being analyzed. 2. Mean Absolute Deviation (MAD): The mean absolute deviation is the mean (average) of the absolute value of the difference between the individual values in the data set and the mean. The method tries to measure the average distances between the values in the data set and the mean. 3. Variance: To find the variance: • subtract the mean, • square the result , from each of the values in the data set, . • add all of these squares • and divide by the number of values in the data set. 4. Standard Deviation: Standard deviation is the square root of the variance. The formulas are: Mean absolute deviation, variance and standard deviation are ways to describe the difference between the mean and the values in the data set without worrying about the signs of these differences. These values are usually computed using a calculator. Warning!!! Be sure you know where to find "population" forms versus "sample" forms on the calculator. If you are unsure, check out the information at these links. See how to use your TI83+/TI-84+ graphing calculator with measures of dispersion ongrouped data. Click calculator. See how to use your TI83+/TI-84+ graphing calculator with measures of dispersion. Click calculator. Examples: 1. Find, to the nearest tenth, the standard deviation and variance of the distribution: 100 200 300 400 500 Score Frequency 15 21 19 24 17 Solution: For more detailed information on using the graphing calculator, follow the links provided above. Grab your graphing calculator. Enter the data and frequencies in lists. Population variance is 17069.7 Choose 1-Var Stats and enter as grouped data. Population standard deviation is 134.0 2. Find, to the nearest tenth, the mean absolute deviation for the set {2, 5, 7, 9, 1, 3, 4, 2, 6, 7, 11, 5, 8, 2, 4}. Enter the data in list. Be sure to have the calculator Mean absolute deviation first determine the mean. is 2.3 For more detailed information on using the graphing calculator, follow the links provided above. Topic Index | Algebra2/Trig Index | Regents Exam Prep Center Created by Donna Roberts Copyright 1998-2012 http://regentsprep.org Oswego City School District Regents Exam Prep Center Practice with Central Tendency and Dispersion Topic Index | Algebra2/Trig Index | Regents Exam Prep Center Choose the best answer to the following questions. Grab your calculator. 1. The table displays the frequency of scores on a Choose: twenty point quiz. The mean of the quiz scores is 18. 8 Find the value of k in the table. 11 12 Score Frequency 15 16 17 18 19 20 2 4 7 13 k 5 2. 3. Choose: The table displays the frequency of scores on a 10 point quiz. Find the median of the scores. Score 5 6 7 8 9 10 Frequency 1 5 8 14 12 7 The table displays the number of uncles of each student in a class of Algebra 2. Find the mean, median and mode of the uncles per student for this data set. Express answers to the nearest hundredth. Score 0 1 2 3 4 5 Frequency 2 5 4 6 10 8 4. The average amount earned by 110 juniors for a week was $35, while during the same week 90 seniors averaged $50. What were the average earnings for that week for the combined group? 7 8 9 Choose: answers are stated in the order mean, median, mode 3.17, 4, 4 3.17, 3, 4 3.18, 4, 4 Choose: $41.75 $43.50 $47.55 Choose: 5. For the data set: {5, 4, 2, 5, 9, 3, 4, 5, 3, 1, 6, 7, 5, 8, 3, 7} find the interquartile range. 6. 3 3.5 6.5 Choose: Find, to the nearest tenth, the standard deviation of the distribution: Score Frequency 1 2 3 4 5 14 15 14 17 10 1.3 1.4 2.9 Choose: 7. If all of the data in a set were multiplied by 8, the variance of the new data set would be changed by a factor of ____. 4 8 16 64 Choose: 8. x = 0, y = If the five numbers {3, 4, 7, x, y} have a mean of 5 and a standard deviation of , find x and y given that y > x. 1 x = 0, y = 4 x = 0, y = 6 x = 5, y = 6 Sampling (statistics) From Wikipedia, the free encyclopedia Not to be confused with Sample (statistics). For computer simulation, see pseudo-random number sampling. A visual representation of the sampling process. In statistics, quality assurance, and survey methodology, sampling is concerned with the selection of a subset of individuals from within a statistical population to estimate characteristics of the whole population. Each observation measures one or more properties (such as weight, location, color) of observable bodies distinguished as independent objects or individuals. Insurvey sampling, weights can be applied to the data to adjust for the sample design, particularly stratified sampling. Results from probability theory and statistical theory are employed to guide practice. In business and medical research, sampling is widely used for gathering information about a population.[1] The sampling process comprises several stages:        Defining the population of concern Specifying a sampling frame, a set of items or events possible to measure Specifying a sampling method for selecting items or events from the frame Determining the sample size Implementing the sampling plan Sampling and data collecting Data which can be selected Contents [hide]                  1Population definition 2Sampling frame 3Probability and nonprobability sampling o 3.1Probability sampling o 3.2Nonprobability sampling 4Sampling methods o 4.1Simple random sampling o 4.2Systematic sampling o 4.3Stratified sampling o 4.4Probability-proportional-to-size sampling o 4.5Cluster sampling o 4.6Quota sampling o 4.7Minimax sampling o 4.8Accidental sampling o 4.9Line-intercept sampling o 4.10Panel sampling o 4.11Snowball sampling o 4.12Theoretical sampling 5Replacement of selected units 6Sample size o 6.1Steps for using sample size tables 7Sampling and data collection 8Applications of Sampling 9Errors in sample surveys o 9.1Sampling errors and biases o 9.2Non-sampling error 10Survey weights 11Methods of producing random samples 12History 13See also 14Notes 15References 16Further reading 17Standards o 17.1ISO o 17.2ASTM o 17.3ANSI, ASQ o 17.4U.S. federal and military standards Population definition[edit] Successful statistical practice is based on focused problem definition. In sampling, this includes defining the population from which our sample is drawn. A population can be defined as including all people or items with the characteristic one wishes to understand. Because there is very rarely enough time or money to gather information from everyone or everything in a population, the goal becomes finding a representative sample (or subset) of that population. Sometimes what defines a population is obvious. For example, a manufacturer needs to decide whether a batch of material from production is of high enough quality to be released to the customer, or should be sentenced for scrap or rework due to poor quality. In this case, the batch is the population. Although the population of interest often consists of physical objects, sometimes we need to sample over time, space, or some combination of these dimensions. For instance, an investigation of supermarket staffing could examine checkout line length at various times, or a study on endangered penguins might aim to understand their usage of various hunting grounds over time. For the time dimension, the focus may be on periods or discrete occasions. In other cases, our 'population' may be even less tangible. For example, Joseph Jagger studied the behaviour of roulette wheels at a casino in Monte Carlo, and used this to identify a biased wheel. In this case, the 'population' Jagger wanted to investigate was the overall behaviour of the wheel (i.e. the probability distribution of its results over infinitely many trials), while his 'sample' was formed from observed results from that wheel. Similar considerations arise when taking repeated measurements of some physical characteristic such as the electrical conductivity of copper. This situation often arises when we seek knowledge about the cause system of which the observed population is an outcome. In such cases, sampling theory may treat the observed population as a sample from a larger 'superpopulation'. For example, a researcher might study the success rate of a new 'quit smoking' program on a test group of 100 patients, in order to predict the effects of the program if it were made available nationwide. Here the superpopulation is "everybody in the country, given access to this treatment" – a group which does not yet exist, since the program isn't yet available to all. Note also that the population from which the sample is drawn may not be the same as the population about which we actually want information. Often there is large but not complete overlap between these two groups due to frame issues etc. (see below). Sometimes they may be entirely separate – for instance, we might study rats in order to get a better understanding of human health, or we might study records from people born in 2008 in order to make predictions about people born in 2009. Time spent in making the sampled population and population of concern precise is often well spent, because it raises many issues, ambiguities and questions that would otherwise have been overlooked at this stage. Sampling frame[edit] Main article: Sampling frame In the most straightforward case, such as the sentencing of a batch of material from production (acceptance sampling by lots), it is possible to identify and measure every single item in the population and to include any one of them in our sample. However, in the more general case this is not possible. There is no way to identify all rats in the set of all rats. Where voting is not compulsory, there is no way to identify which people will actually vote at a forthcoming election (in advance of the election). These imprecise populations are not amenable to sampling in any of the ways below and to which we could apply statistical theory. As a remedy, we seek a sampling frame which has the property that we can identify every single element and include any in our sample.[2][3][4][5] The most straightforward type of frame is a list of elements of the population (preferably the entire population) with appropriate contact information. For example, in an opinion poll, possible sampling frames include an electoral register and a telephone directory. Probability and nonprobability sampling[edit] Probability sampling[edit] A probability sample is a sample in which every unit in the population has a chance (greater than zero) of being selected in the sample, and this probability can be accurately determined. The combination of these traits makes it possible to produce unbiased estimates of population totals, by weighting sampled units according to their probability of selection. Example: We want to estimate the total income of adults living in a given street. We visit each household in that street, identify all adults living there, and randomly select one adult from each household. (For example, we can allocate each person a random number, generated from a uniform distribution between 0 and 1, and select the person with the highest number in each household). We then interview the selected person and find their income. People living on their own are certain to be selected, so we simply add their income to our estimate of the total. But a person living in a household of two adults has only a one-in-two chance of selection. To reflect this, when we come to such a household, we would count the selected person's income twice towards the total. (The person who is selected from that household can be loosely viewed as also representing the person who isn't selected.) In the above example, not everybody has the same probability of selection; what makes it a probability sample is the fact that each person's probability is known. When every element in the population does have the same probability of selection, this is known as an 'equal probability of selection' (EPS) design. Such designs are also referred to as 'self-weighting' because all sampled units are given the same weight. Probability sampling includes: Simple Random Sampling, Systematic Sampling, Stratified Sampling, Probability Proportional to Size Sampling, and Cluster or Multistage Sampling. These various ways of probability sampling have two things in common: 1. Every element has a known nonzero probability of being sampled and 2. involves random selection at some point. Nonprobability sampling[edit] Main article: Nonprobability sampling Nonprobability sampling is any sampling method where some elements of the population have no chance of selection (these are sometimes referred to as 'out of coverage'/'undercovered'), or where the probability of selection can't be accurately determined. It involves the selection of elements based on assumptions regarding the population of interest, which forms the criteria for selection. Hence, because the selection of elements is nonrandom, nonprobability sampling does not allow the estimation of sampling errors. These conditions give rise to exclusion bias, placing limits on how much information a sample can provide about the population. Information about the relationship between sample and population is limited, making it difficult to extrapolate from the sample to the population. Example: We visit every household in a given street, and interview the first person to answer the door. In any household with more than one occupant, this is a nonprobability sample, because some people are more likely to answer the door (e.g. an unemployed person who spends most of their time at home is more likely to answer than an employed housemate who might be at work when the interviewer calls) and it's not practical to calculate these probabilities. Nonprobability sampling methods include convenience sampling, quota sampling and purposive sampling. In addition, nonresponse effects may turn any probability design into a nonprobability design if the characteristics of nonresponse are not well understood, since nonresponse effectively modifies each element's probability of being sampled. Sampling methods[edit] Within any of the types of frame identified above, a variety of sampling methods can be employed, individually or in combination. Factors commonly influencing the choice between these designs include:      Nature and quality of the frame Availability of auxiliary information about units on the frame Accuracy requirements, and the need to measure accuracy Whether detailed analysis of the sample is expected Cost/operational concerns Simple random sampling [edit] Main article: Simple random sampling A visual representation of selecting a simple random sample In a simple random sample (SRS) of a given size, all such subsets of the frame are given an equal probability. Furthermore, any given pair of elements has the same chance of selection as any other such pair (and similarly for triples, and so on). This minimises bias and simplifies analysis of results. In particular, the variance between individual results within the sample is a good indicator of variance in the overall population, which makes it relatively easy to estimate the accuracy of results. SRS can be vulnerable to sampling error because the randomness of the selection may result in a sample that doesn't reflect the makeup of the population. For instance, a simple random sample of ten people from a given country will on averageproduce five men and five women, but any given trial is likely to overrepresent one sex and underrepresent the other. Systematic and stratified techniques attempt to overcome this problem by "using information about the population" to choose a more "representative" sample. SRS may also be cumbersome and tedious when sampling from an unusually large target population. In some cases, investigators are interested in "research questions specific" to subgroups of the population. For example, researchers might be interested in examining whether cognitive ability as a predictor of job performance is equally applicable across racial groups. SRS cannot accommodate the needs of researchers in this situation because it does not provide subsamples of the population. "Stratified sampling" addresses this weakness of SRS. Systematic sampling[edit] Main article: Systematic sampling A visual representation of selecting a random sample using the systematic sampling technique Systematic sampling (also known as interval sampling) relies on arranging the study population according to some ordering scheme and then selecting elements at regular intervals through that ordered list. Systematic sampling involves a random start and then proceeds with the selection of every kth element from then onwards. In this case,k=(population size/sample size). It is important that the starting point is not automatically the first in the list, but is instead randomly chosen from within the first to the kth element in the list. A simple example would be to select every 10th name from the telephone directory (an 'every 10th' sample, also referred to as 'sampling with a skip of 10'). As long as the starting point is randomized, systematic sampling is a type of probability sampling. It is easy to implement and the stratification induced can make it efficient, if the variable by which the list is ordered is correlated with the variable of interest. 'Every 10th' sampling is especially useful for efficient sampling from databases. For example, suppose we wish to sample people from a long street that starts in a poor area (house No. 1) and ends in an expensive district (house No. 1000). A simple random selection of addresses from this street could easily end up with too many from the high end and too few from the low end (or vice versa), leading to an unrepresentative sample. Selecting (e.g.) every 10th street number along the street ensures that the sample is spread evenly along the length of the street, representing all of these districts. (Note that if we always start at house #1 and end at #991, the sample is slightly biased towards the low end; by randomly selecting the start between #1 and #10, this bias is eliminated. However, systematic sampling is especially vulnerable to periodicities in the list. If periodicity is present and the period is a multiple or factor of the interval used, the sample is especially likely to be unrepresentative of the overall population, making the scheme less accurate than simple random sampling. For example, consider a street where the odd-numbered houses are all on the north (expensive) side of the road, and the even-numbered houses are all on the south (cheap) side. Under the sampling scheme given above, it is impossible to get a representative sample; either the houses sampled will all be from the odd-numbered, expensive side, or they will all be from the evennumbered, cheap side, unless the researcher has previous knowledge of this bias and avoids it by a using a skip which ensures jumping between the two sides (any odd-numbered skip). Another drawback of systematic sampling is that even in scenarios where it is more accurate than SRS, its theoretical properties make it difficult to quantify that accuracy. (In the two examples of systematic sampling that are given above, much of the potential sampling error is due to variation between neighbouring houses – but because this method never selects two neighbouring houses, the sample will not give us any information on that variation.) As described above, systematic sampling is an EPS method, because all elements have the same probability of selection (in the example given, one in ten). It is not 'simple random sampling' because different subsets of the same size have different selection probabilities – e.g. the set {4,14,24,...,994} has a one-in-ten probability of selection, but the set {4,13,24,34,...} has zero probability of selection. Systematic sampling can also be adapted to a non-EPS approach; for an example, see discussion of PPS samples below. Stratified sampling[edit] Main article: Stratified sampling A visual representation of selecting a random sample using the stratified sampling technique There is a proposal that portions of this section be split into a new article titled Stratified sampling. (Discuss) (June 2014) Where the population embraces a number of distinct categories, the frame can be organized by these categories into separate "strata." Each stratum is then sampled as an independent subpopulation, out of which individual elements can be randomly selected.[2] There are several potential benefits to stratified sampling. First, dividing the population into distinct, independent strata can enable researchers to draw inferences about specific subgroups that may be lost in a more generalized random sample. Second, utilizing a stratified sampling method can lead to more efficient statistical estimates (provided that strata are selected based upon relevance to the criterion in question, instead of availability of the samples). Even if a stratified sampling approach does not lead to increased statistical efficiency, such a tactic will not result in less efficiency than would simple random sampling, provided that each stratum is proportional to the group's size in the population. Third, it is sometimes the case that data are more readily available for individual, pre-existing strata within a population than for the overall population; in such cases, using a stratified sampling approach may be more convenient than aggregating data across groups (though this may potentially be at odds with the previously noted importance of utilizing criterion-relevant strata). Finally, since each stratum is treated as an independent population, different sampling approaches can be applied to different strata, potentially enabling researchers to use the approach best suited (or most cost-effective) for each identified subgroup within the population. There are, however, some potential drawbacks to using stratified sampling. First, identifying strata and implementing such an approach can increase the cost and complexity of sample selection, as well as leading to increased complexity of population estimates. Second, when examining multiple criteria, stratifying variables may be related to some, but not to others, further complicating the design, and potentially reducing the utility of the strata. Finally, in some cases (such as designs with a large number of strata, or those with a specified minimum sample size per group), stratified sampling can potentially require a larger sample than would other methods (although in most cases, the required sample size would be no larger than would be required for simple random sampling. A stratified sampling approach is most effective when three conditions are met 1. Variability within strata are minimized 2. Variability between strata are maximized 3. The variables upon which the population is stratified are strongly correlated with the desired dependent variable. Advantages over other sampling methods 1. 2. 3. 4. Focuses on important subpopulations and ignores irrelevant ones. Allows use of different sampling techniques for different subpopulations. Improves the accuracy/efficiency of estimation. Permits greater balancing of statistical power of tests of differences between strata by sampling equal numbers from strata varying widely in size. Disadvantages 1. Requires selection of relevant stratification variables which can be difficult. 2. Is not useful when there are no homogeneous subgroups. 3. Can be expensive to implement. Poststratification Stratification is sometimes introduced after the sampling phase in a process called "poststratification".[2] This approach is typically implemented due to a lack of prior knowledge of an appropriate stratifying variable or when the experimenter lacks the necessary information to create a stratifying variable during the sampling phase. Although the method is susceptible to the pitfalls of post hoc approaches, it can provide several benefits in the right situation. Implementation usually follows a simple random sample. In addition to allowing for stratification on an ancillary variable, poststratification can be used to implement weighting, which can improve the precision of a sample's estimates.[2] Oversampling Choice-based sampling is one of the stratified sampling strategies. In choice-based sampling,[6] the data are stratified on the target and a sample is taken from each stratum so that the rare target class will be more represented in the sample. The model is then built on this biased sample. The effects of the input variables on the target are often estimated with more precision with the choice-based sample even when a smaller overall sample size is taken, compared to a random sample. The results usually must be adjusted to correct for the oversampling. Probability-proportional-to-size sampling[edit] In some cases the sample designer has access to an "auxiliary variable" or "size measure", believed to be correlated to the variable of interest, for each element in the population. These data can be used to improve accuracy in sample design. One option is to use the auxiliary variable as a basis for stratification, as discussed above. Another option is probability proportional to size ('PPS') sampling, in which the selection probability for each element is set to be proportional to its size measure, up to a maximum of 1. In a simple PPS design, these selection probabilities can then be used as the basis for Poisson sampling. However, this has the drawback of variable sample size, and different portions of the population may still be over- or under-represented due to chance variation in selections. Systematic sampling theory can be used to create a probability proportionate to size sample. This is done by treating each count within the size variable as a single sampling unit. Samples are then identified by selecting at even intervals among these counts within the size variable. This method is sometimes called PPS-sequential or monetary unit sampling in the case of audits or forensic sampling. Example: Suppose we have six schools with populations of 150, 180, 200, 220, 260, and 490 students respectively (total 1500 students), and we want to use student population as the basis for a PPS sample of size three. To do this, we could allocate the first school numbers 1 to 150, the second school 151 to 330 (= 150 + 180), the third school 331 to 530, and so on to the last school (1011 to 1500). We then generate a random start between 1 and 500 (equal to 1500/3) and count through the school populations by multiples of 500. If our random start was 137, we would select the schools which have been allocated numbers 137, 637, and 1137, i.e. the first, fourth, and sixth schools. The PPS approach can improve accuracy for a given sample size by concentrating sample on large elements that have the greatest impact on population estimates. PPS sampling is commonly used for surveys of businesses, where element size varies greatly and auxiliary information is often available—for instance, a survey attempting to measure the number of guestnights spent in hotels might use each hotel's number of rooms as an auxiliary variable. In some cases, an older measurement of the variable of interest can be used as an auxiliary variable when attempting to produce more current estimates.[7] Cluster sampling[edit] A visual representation of selecting a random sample using the cluster sampling technique Sometimes it is more cost-effective to select respondents in groups ('clusters'). Sampling is often clustered by geography, or by time periods. (Nearly all samples are in some sense 'clustered' in time – although this is rarely taken into account in the analysis.) For instance, if surveying households within a city, we might choose to select 100 city blocks and then interview every household within the selected blocks. Clustering can reduce travel and administrative costs. In the example above, an interviewer can make a single trip to visit several households in one block, rather than having to drive to a different block for each household. It also means that one does not need a sampling frame listing all elements in the target population. Instead, clusters can be chosen from a cluster-level frame, with an element-level frame created only for the selected clusters. In the example above, the sample only requires a block-level city map for initial selections, and then a household-level map of the 100 selected blocks, rather than a household-level map of the whole city. Cluster sampling (also known as clustered sampling) generally increases the variability of sample estimates above that of simple random sampling, depending on how the clusters differ between themselves, as compared with the within-cluster variation. For this reason, cluster sampling requires a larger sample than SRS to achieve the same level of accuracy – but cost savings from clustering might still make this a cheaper option. Cluster sampling is commonly implemented as multistage sampling. This is a complex form of cluster sampling in which two or more levels of units are embedded one in the other. The first stage consists of constructing the clusters that will be used to sample from. In the second stage, a sample of primary units is randomly selected from each cluster (rather than using all units contained in all selected clusters). In following stages, in each of those selected clusters, additional samples of units are selected, and so on. All ultimate units (individuals, for instance) selected at the last step of this procedure are then surveyed. This technique, thus, is essentially the process of taking random subsamples of preceding random samples. Multistage sampling can substantially reduce sampling costs, where the complete population list would need to be constructed (before other sampling methods could be applied). By eliminating the work involved in describing clusters that are not selected, multistage sampling can reduce the large costs associated with traditional cluster sampling.[7]However, each sample may not be a full representative of the whole population. Quota sampling[edit] In quota sampling, the population is first segmented into mutually exclusive sub-groups, just as in stratified sampling. Then judgement is used to select the subjects or units from each segment based on a specified proportion. For example, an interviewer may be told to sample 200 females and 300 males between the age of 45 and 60. It is this second step which makes the technique one of non-probability sampling. In quota sampling the selection of the sample is non-random. For example interviewers might be tempted to interview those who look most helpful. The problem is that these samples may be biased because not everyone gets a chance of selection. This random element is its greatest weakness and quota versus probability has been a matter of controversy for several years. Minimax sampling[edit] In imbalanced datasets, where the sampling ratio does not follow the population statistics, one can resample the dataset in a conservative manner called minimax sampling.[8]The minimax sampling has its origin in Anderson minimax ratio whose value is proved to be 0.5: in a binary classification, the class-sample sizes should be chosen equally.[9]This ratio can be proved to be minimax ratio only under the assumption of LDA classifier with Gaussian distributions.[9] The notion of minimax sampling is recently developed for a general class of classification rules, called class-wise smart classifiers. In this case, the sampling ratio of classes is selected so that the worst case classifier error over all the possible population statistics for class prior probabilities, would be the Accidental sampling[edit] Accidental sampling (sometimes known as grab, convenience or opportunity sampling) is a type of nonprobability sampling which involves the sample being drawn from that part of the population which is close to hand. That is, a population is selected because it is readily available and convenient. It may be through meeting the person or including a person in the sample when one meets them or chosen by finding them through technological means such as the internet or through phone. The researcher using such a sample cannot scientifically make generalizations about the total population from this sample because it would not be representative enough. For example, if the interviewer were to conduct such a survey at a shopping center early in the morning on a given day, the people that he/she could interview would be limited to those given there at that given time, which would not represent the views of other members of society in such an area, if the survey were to be conducted at different times of day and several times per week. This type of sampling is most useful for pilot testing. Several important considerations for researchers using convenience samples include: 1. Are there controls within the research design or experiment which can serve to lessen the impact of a non-random convenience sample, thereby ensuring the results will be more representative of the population? 2. Is there good reason to believe that a particular convenience sample would or should respond or behave differently than a random sample from the same population? 3. Is the question being asked by the research one that can adequately be answered using a convenience sample? In social science research, snowball sampling is a similar technique, where existing study subjects are used to recruit more subjects into the sample. Some variants of snowball sampling, such as respondent driven sampling, allow calculation of selection probabilities and are probability sampling methods under certain conditions. Line-intercept sampling[edit] Line-intercept sampling is a method of sampling elements in a region whereby an element is sampled if a chosen line segment, called a "transect", intersects the element. Panel sampling[edit] Panel sampling is the method of first selecting a group of participants through a random sampling method and then asking that group for (potentially the same) information several times over a period of time. Therefore, each participant is interviewed at two or more time points; each period of data collection is called a "wave". The method was developed by sociologist Paul Lazarsfeld in 1938 as a means of studying political campaigns.[10] This longitudinal samplingmethod allows estimates of changes in the population, for example with regard to chronic illness to job stress to weekly food expenditures. Panel sampling can also be used to inform researchers about within-person health changes due to age or to help explain changes in continuous dependent variables such as spousal interaction.[11] There have been several proposed methods of analyzing panel data, including MANOVA, growth curves, and structural equation modeling with lagged effects. Snowball sampling[edit] Snowball sampling involves finding a small group of initial respondents and using them to recruit more respondents. It is particularly useful in cases where the population is hidden or difficult to enumerate. Theoretical sampling[edit] This section requires expansion.(July 2015) Theoretical sampling[12] occurs when samples are selected on the basis of the results of the data collected so far with a goal of developing a deeper understanding of the area or develop theory. Replacement of selected units[edit] Sampling schemes may be without replacement ('WOR'—no element can be selected more than once in the same sample) or with replacement ('WR'—an element may appear multiple times in the one sample). For example, if we catch fish, measure them, and immediately return them to the water before continuing with the sample, this is a WR design, because we might end up catching and measuring the same fish more than once. However, if we do not return the fish to the water (e.g., if we eat the fish), this becomes a WOR design. Sample size[edit] Main article: Sample size Formulas, tables, and power function charts are well known approaches to determine sample size. Steps for using sample size tables[edit] 1. Postulate the effect size of interest, α, and β. 2. Check sample size table[13] 1. Select the table corresponding to the selected α 2. Locate the row corresponding to the desired power 3. Locate the column corresponding to the estimated effect size. 4. The intersection of the column and row is the minimum sample size required. Sampling and data collection[edit] Good data collection involves:     Following the defined sampling process Keeping the data in time order Noting comments and other contextual events Recording non-responses Applications of Sampling[edit] Sampling enables the selection of right data points from within the larger data set to estimate the characteristics of the whole population. For example, there are about 600 million tweets produced every day. Is it necessary to look at all of them to determine the topics that are discussed during the day? Is it necessary to look at all the tweets to determine the sentiment on each of the topics? In manufacturing different types of sensory data such as acoustics, vibration, pressure, current, voltage and controller data are available at short time intervals. To predict down-time it may not be necessary to look at all the data but a sample may be sufficient. A theoretical formulation for sampling Twitter data has been developed.[14] Errors in sample surveys[edit] Main article: Sampling error Survey results are typically subject to some error. Total errors can be classified into sampling errors and non-sampling errors. The term "error" here includes systematic biases as well as random errors. Sampling errors and biases[edit] Sampling errors and biases are induced by the sample design. They include: 1. Selection bias: When the true selection probabilities differ from those assumed in calculating the results. 2. Random sampling error: Random variation in the results due to the elements in the sample being selected at random. Non-sampling error[edit] Non-sampling errors are other errors which can impact the final survey estimates, caused by problems in data collection, processing, or sample design. They include: 1. Over-coverage: Inclusion of data from outside of the population. 2. Under-coverage: Sampling frame does not include elements in the population. 3. Measurement error: e.g. when respondents misunderstand a question, or find it difficult to answer. 4. Processing error: Mistakes in data coding. 5. Non-response: Failure to obtain complete data from all selected individuals. After sampling, a review should be held of the exact process followed in sampling, rather than that intended, in order to study any effects that any divergences might have on subsequent analysis. A particular problem is that of non-response. Two major types of non-response exist: unit nonresponse (referring to lack of completion of any part of the survey) and item non-response (submission or participation in survey but failing to complete one or more components/questions of the survey).[15][16] In survey sampling, many of the individuals identified as part of the sample may be unwilling to participate, not have the time to participate (opportunity cost),[17] or survey administrators may not have been able to contact them. In this case, there is a risk of differences, between respondents and nonrespondents, leading to biased estimates of population parameters. This is often addressed by improving survey design, offering incentives, and conducting follow-up studies which make a repeated attempt to contact the unresponsive and to characterize their similarities and differences with the rest of the frame.[18] The effects can also be mitigated by weighting the data when population benchmarks are available or by imputing data based on answers to other questions. Nonresponse is particularly a problem in internet sampling. Reasons for this problem include improperly designed surveys,[16] over-surveying (or survey fatigue),[11][19] and the fact that potential participants hold multiple e-mail addresses, which they don't use anymore or don't check regularly. Survey weights[edit] In many situations the sample fraction may be varied by stratum and data will have to be weighted to correctly represent the population. Thus for example, a simple random sample of individuals in the United Kingdom might include some in remote Scottish islands who would be inordinately expensive to sample. A cheaper method would be to use a stratified sample with urban and rural strata. The rural sample could be under-represented in the sample, but weighted up appropriately in the analysis to compensate. More generally, data should usually be weighted if the sample design does not give each individual an equal chance of being selected. For instance, when households have equal selection probabilities but one person is interviewed from within each household, this gives people from large households a smaller chance of being interviewed. This can be accounted for using survey weights. Similarly, households with more than one telephone line have a greater chance of being selected in a random digit dialing sample, and weights can adjust for this. Weights can also serve other purposes, such as helping to correct for non-response. Methods of producing random samples[edit]    Random number table Mathematical algorithms for pseudo-random number generators Physical randomization devices such as coins, playing cards or sophisticated devices such as ERNIE History[edit] Random sampling by using lots is an old idea, mentioned several times in the Bible. In 1786 Pierre Simon Laplace estimated the population of France by using a sample, along with ratio estimator. He also computed probabilistic estimates of the error. These were not expressed as modern confidence intervals but as the sample size that would be needed to achieve a particular upper bound on the sampling error with probability 1000/1001. His estimates used Bayes' theorem with a uniform prior probability and assumed that his sample was random. Alexander Ivanovich Chuprov introduced sample surveys to Imperial Russia in the 1870s.[citation needed] In the USA the 1936 Literary Digest prediction of a Republican win in the presidential election went badly awry, due to severe bias [1]. More than two million people responded to the study with their names obtained through magazine subscription lists and telephone directories. It was not appreciated that these lists were heavily biased towards Republicans and the resulting sample, though very large, was deeply flawed.[20][21] See also[edit] Statistics portal Wikiversity has learning materials about Sampling (statistics)          Data collection Gy's sampling theory Horvitz–Thompson estimator Official statistics Ratio estimator Replication (statistics) Sampling (case studies) Sampling error Random-sampling mechanism Notes[edit] The textbook by Groves et alia provides an overview of survey methodology, including recent literature on questionnaire development (informed by cognitive psychology) :  Robert Groves, et alia. Survey methodology (2010) Second edition of the (2004) first edition ISBN 0-471-48348-6. The other books focus on the statistical theory of survey sampling and require some knowledge of basic statistics, as discussed in the following textbooks:   David S. Moore and George P. McCabe (February 2005). "Introduction to the practice of statistics" (5th edition). W.H. Freeman & Company. ISBN 0-7167-6282-X. Freedman, David; Pisani, Robert; Purves, Roger (2007). Statistics (4th ed.). New York: Norton. ISBN 0-393-92972-8. The elementary book by Scheaffer et alia uses quadratic equations from high-school algebra:  Scheaffer, Richard L., William Mendenhal and R. Lyman Ott. Elementary survey sampling, Fifth Edition. Belmont: Duxbury Press, 1996. More mathematical statistics is required for Lohr, for Särndal et alia, and for Cochran (classic):    Cochran, William G. (1977). Sampling techniques (Third ed.). Wiley. ISBN 0-471-16240-X. Lohr, Sharon L. (1999). Sampling: Design and analysis. Duxbury. ISBN 0-534-35361-4. Särndal, Carl-Erik, and Swensson, Bengt, and Wretman, Jan (1992). Model assisted survey sampling. Springer-Verlag. ISBN 0-387-40620-4. The historically important books by Deming and Kish remain valuable for insights for social scientists (particularly about the U.S. census and the Institute for Social Research at the University of Michigan):   Deming, W. Edwards (1966). Some Theory of Sampling. Dover Publications. ISBN 0-48664684-X. OCLC 166526. Kish, Leslie (1995) Survey Sampling, Wiley, ISBN 0-471-10949-5 References[edit] 1. Jump up^ Salant, Priscilla, I. Dillman, and A. Don. How to conduct your own survey. No. 300.723 S3.. 1994. 2. ^ Jump up to:a b c d Robert M. Groves; et al. Survey methodology. ISBN 0470465468. 3. Jump up^ Lohr, Sharon L. Sampling: Design and analysis. 4. Jump up^ Särndal, Carl-Erik, and Swensson, Bengt, and Wretman, Jan. Model Assisted Survey Sampling. 5. Jump up^ Scheaffer, Richard L., William Mendenhal and R. Lyman Ott. Elementary survey sampling. 6. Jump up^ Scott, A.J.; Wild, C.J. (1986). "Fitting logistic models under case-control or choicebased sampling". Journal of the Royal Statistical Society, Series B 48: 170–182. JSTOR 2345712. 7. ^ Jump up to:a b  Lohr, Sharon L. Sampling: Design and Analysis.  Särndal, Carl-Erik, and Swensson, Bengt, and Wretman, Jan. Model Assisted Survey Sampling. 8. ^ Jump up to:a b Shahrokh Esfahani, Mohammad; Dougherty, Edward (2014). "Effect of separate sampling on classification accuracy". Bioinformatics 30 (2): 242– 250.doi:10.1093/bioinformatics/btt662. 9. ^ Jump up to:a b c Anderson, Theodore (1951). "Classification by multivariate analysis". Psychometrika 16 (1): 31–50. doi:10.1007/bf02313425. 10. Jump up^ Lazarsfeld, P., & Fiske, M. (1938). The" panel" as a new tool for measuring opinion. The Public Opinion Quarterly, 2(4), 596–612. 11. ^ Jump up to:a b Groves, et alia. Survey Methodology 12. Jump up^ "Examples of sampling methods" (PDF). 13. Jump up^ Cohen, 1988 14. Jump up^ Deepan Palguna, Vikas Joshi, Venkatesan Chakaravarthy, Ravi Kothari and L. V. Subramaniam (2015). Analysis of Sampling Algorithms for Twitter. International Joint Conference on Artificial Intelligence. 15. Jump up^ Berinsky, A. J. (2008). Survey non-response. In W. Donsbach & M. W. Traugott (Eds.), The SAGE handbook of public opinion research (pp. 309–321). Thousand Oaks, CA: Sage Publications. 16. ^ Jump up to:a b Dillman, D. A., Eltinge, J. L., Groves, R. M., & Little, R. J. A. (2002). Survey nonresponse in design, data collection, and analysis. In R. M. Groves, D. A. Dillman, J. L. Eltinge, & R. J. A. Little (Eds.), Survey nonresponse (pp. 3–26). New York: John Wiley & Sons. 17. Jump up^ Dillman, D.A., Smyth, J.D., & Christian, L. M. (2009). Internet, mail, and mixed-mode surveys: The tailored design method. San Francisco: Jossey-Bass. 18. Jump up^ Vehovar, V., Batagelj, Z., Manfreda, K.L., & Zaletel, M. (2002). Nonresponse in web surveys. In R. M. Groves, D. A. Dillman, J. L. Eltinge, & R. J. A. Little (Eds.), Survey nonresponse (pp. 229–242). New York: John Wiley & Sons. 19. Jump up^ Porter, Whitcomb, Weitzer (2004) Multiple surveys of students and survey fatigue. In S. R. Porter (Ed.), Overcoming survey research problems: Vol. 121. New directions for institutional research (pp. 63–74). San Francisco, CA: Jossey Bass. 20. Jump up^ David S. Moore and George P. McCabe. "Introduction to the Practice of Statistics". 21. Jump up^ Freedman, David; Pisani, Robert; Purves, Roger. Statistics. Further reading[edit]   Chambers, R L, and Skinner, C J (editors) (2003), Analysis of Survey Data, Wiley, ISBN 0471-89987-9 Deming, W. Edwards (1975) On probability as a basis for action, The American Statistician, 29(4), pp146–152.          Gy, P (1992) Sampling of Heterogeneous and Dynamic Material Systems: Theories of Heterogeneity, Sampling and Homogenizing Korn, E.L., and Graubard, B.I. (1999) Analysis of Health Surveys, Wiley, ISBN 0-471-137731 Lucas, Samuel R. (2012). "Beyond the Existence Proof: Ontological Conditions, Epistemological Implications, and In-Depth Interview Research.", Quality & Quantity, doi:10.1007/s11135-012-9775-3. Stuart, Alan (1962) Basic Ideas of Scientific Sampling, Hafner Publishing Company, New York Smith, T. M. F. (1984). "Present Position and Potential Developments: Some Personal Views: Sample surveys". Journal of the Royal Statistical Society, Series A 147 (The 150th Anniversary of the Royal Statistical Society, number 2): 208– 221. doi:10.2307/2981677. JSTOR 2981677. Smith, T. M. F. (1993). "Populations and Selection: Limitations of Statistics (Presidential address)". Journal of the Royal Statistical Society, Series A 156 (2): 144– 166.doi:10.2307/2982726. JSTOR 2982726. (Portrait of T. M. F. Smith on page 144) Smith, T. M. F. (2001). "Biometrika centenary: Sample surveys". Biometrika 88 (1): 167– 243. doi:10.1093/biomet/88.1.167. Smith, T. M. F. (2001). "Biometrika centenary: Sample surveys". In D. M. Titterington and D. R. Cox. Biometrika: One Hundred Years. Oxford University Press. pp. 165–194.ISBN 0-19850993-6. Whittle, P. (May 1954). "Optimum preventative sampling". Journal of the Operations Research Society of America 2 (2): 197–203. doi:10.1287/opre.2.2.197.JSTOR 166605. Standards[edit] ISO[edit]   ISO 2859 series ISO 3951 series ASTM[edit]       ASTM E105 Standard Practice for Probability Sampling Of Materials ASTM E122 Standard Practice for Calculating Sample Size to Estimate, With a Specified Tolerable Error, the Average for Characteristic of a Lot or Process ASTM E141 Standard Practice for Acceptance of Evidence Based on the Results of Probability Sampling ASTM E1402 Standard Terminology Relating to Sampling ASTM E1994 Standard Practice for Use of Process Oriented AOQL and LTPD Sampling Plans ASTM E2234 Standard Practice for Sampling a Stream of Product by Attributes Indexed by AQL ANSI, ASQ[edit]  ANSI/ASQ Z1.4 U.S. federal and military standards[edit]   MIL-STD-105 MIL-STD-1916 [show]  v  t  e Statistics [show]  v  t  e Social survey research Authority control  GND: 4191095-3 NDL: 00568738 Categories:  Sampling (statistics)  Survey methodology Navigation menu       Not logged in  Talk  Contributions  Create account  Log in Article Talk Read Edit View history Go          Main page Contents Featured content Current events Random article Donate to Wikipedia Wikipedia store Interaction Help About Wikipedia               Community portal Recent changes Contact page Tools What links here Related changes Upload file Special pages Permanent link Page information Wikidata item Cite this page Print/export Create a book Download as PDF Printable version Languages                          ‫ال عرب ية‬ Català   Türkçe 中文 Edit links Dansk Deutsch Ελληνικά Español Euskara Français Galego 한국어 Հայերեն Bahasa Indonesia Italiano ‫עברית‬ Lietuvių Magyar 日本語 Norsk bokmål Polski Português Русский Simple English Basa Sunda Suomi தமிழ்  This page was last modified on 30 January 2016, at 11:10.  Text is available under the Creative Commons Attribution-ShareAlike License; additional terms may apply. By using this site, you agree to the Terms of Use and Privacy Policy. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization. . THE PRINCIPAL STEPS IN A SAMPLE SURVEY As a preliminary to a discussion of the role that theory plays in a sample survey, it is useful to describe briefly the steps involved in the planning and execution of a survey. The principal steps in a survey are grouped somewhat arbitrarily under 11 headings. 3.1 Objectives of the survey The first step when assessing a sample survey is to well identify the general objectives of the survey. Without a lucid statement of the objectives, it is easy in a complex survey to forget the objectives when engrossed in the details of planning, and to make decisions that are at variance with the objectives. One of the principal choice is between average values (mean of the population) or total values. In fact, depending on this choice, techniques for the optimal sample size and estimators factors are different. A number of measures exist that have been used by various agencies to measure the economic significance of fisheries to the regional economy. In addition, a number of performance indicators also exist that can be used to assess the performance of fisheries management in achieving its economic objectives (see chapter 1 and related annexes). 3.2 Population to be sampled The word population is used to denote the aggregate from which the sample is chosen. The definition of the population may present some problems in the fishing sector, as it should consider the complete list of vessels and their physical and technical characteristics. The population to be sampled (the sampled population) should coincide with the population about which information is wanted (the target population). Some-times, for reasons of practicability or convenience, the sampled population is more restricted than the target population. If so, it should be remembered that conclusions drawn from the sample apply to the sampled population. Judgement about the extent to which these conclusions will also apply to the target population must depend on other sources of information. Any supplementary information that can be gathered about the nature of the differences between sampled and target population may be helpful. For example, let us consider the Italian statistical sampling design for the estimation of “quantity and average price of fishery products landed each calendar month in Italy by Community and EFTA vessels” (Reg. CE n. 1382/91 modified by Reg. CE n. 2104/93). Aim of the survey is to estimate total catches and average prices for individual species. Therefore, the sampling basis consists of the more than 800 landing points spread over the 8 000 km of Italian coasts. It is not however feasible to consider the list of the landing points as the list of elementary units. To overcome these difficulties, a sampled population, distinct from the target population but including units in which the considered phenomenon takes place, has been considered. In synthesis, the elementary units considered are the landings of the vessels belonging to the sampled fleet. Thus, the list from which the sampling units are extracted is constituted by all the vessels belonging to the Italian fishery fleet. 3.3 Data to be collected It is well to verify that all the data are relevant to the purposes of the survey and that no essential data are omitted There is frequently a tendency to ask too many questions, some of which are never subsequently analysed. An overlong questionnaire lowers the quality of the answers to important as well as unimportant questions. 3.4 Degree of precision desired The results of sample surveys are always subject to some uncertainty because only part of the population has been measured and because of errors of measurement. This uncertainty can be reduced by taking larger samples and by using superior instruments of measurement. But this usually costs time and money. Consequently, the specification of the degree of precision wanted in the results is an important step. This step is the responsibility of the person who is going to use the data. It may present difficulties, since many administrators are unaccustomed to thinking in terms of the amount of error that can be tolerated in estimates, consistent with making good decisions. The statistician can often help at this stage. 3.5 The questionnaire and the choice of the data collectors There may be a choice of measuring instrument and of method of approach to the population. The survey may employ a self-administered questionnaire, an interviewer who reads a standard set of questions with no discretion, or an interviewing process that allows much latitude in the form and ordering of the questions. The approach may be by mail, by telephone, by personal visit, or by a combination of the three. Much study has been made of interviewing methods and problems. A major part of the preliminary work is the construction of record forms on which the questions and answers are to be entered. With simple questionnaires, the answers can sometimes be pre-coded, that is, entered in a manner in which they can be routinely transferred to mechanical equipment. In fact, for the construction of good record forms, it is necessary to visualise the structure of the final summary tables that will be used for drawing conclusions. Information may be collected using a number of different survey methods. These include personal interview, telephone interview or postal survey. The questionnaire design needs to vary based on the approach taken. Personal interviews involves visiting the individual from which data are to be collected. The interviewer controls the questionnaire, and fills in the required data. The questionnaire can be less detailed in terms of explanatory information as the interviewer can be trained on its completion before starting the interview process. This type of survey is best for long, complex surveys and it allows the interviewer and fisher to agree a time convenient for both parties. It is particularly useful when the respondent may have to go and find information such as accounts, log book records etc. The personal interview approach also allows the interviewer to probe more fully if he/she feels that the fisher has misunderstood a question, or information provided conflicts with other earlier statements. Data collectors are usually external to the phenomenon that is being examined and, moreover, they are often part of some public structure, in order to avoid possible influences due to personal interests. However, on the basis of the experience acquired in this field by Irepa, it has been demonstrated (Istat, Irepa 2000) that it is essential to have data collectors belonging to the fishery productive chain in order to obtain correct and timely data. Therefore, data collectors should belong to the productive or management fishery sectors. During meetings on socio-economic indicators partners involved presented several questionnaires. These questionnaires are aimed to collect the information required to calculate the socio-economic indicators and some of them are reported in appendix C. 3.6 Selection of the sample design There is a variety of plans by which the sample may be selected (simple random sample, stratified random sample, two-stage sampling, etc.). For each plan that is considered, rough estimates of the size of sample can be made from a knowledge of the degree of precision desired. The relative costs and time involved for each plan are also compared before making a decision. 3.7 Sampling units Sample units have to be drawn according to the sample design. To draw sample units from the population, several methods can be used, depending on the type of the chosen sample strategy:   sample with equal probabilities sample with probabilities proportional to the size (PPS). In the first case, each unit of the population has the same probability to take part of the sample, while in the case of a PPS sample each unit has a different probability to be sampled and this probability is proportional to the following measure: Pi = Xi/Xh, where, i = a generic vessel, h = stratum, X= a size parameter, for example the overall length of a vessel. 3.8 The pre-test It has been found useful to try out the questionnaire and the field methods on a small scale. This nearly always results in improvements in the questionnaire and may reveal other troubles that will be serious on a large scale, for example, that the cost will be much greater than expected. 3.9 Organization of the field work In a survey, many problems of business administration are met. The personnel must receive training in the purpose of the survey and in the methods of measurement to be employed and must be adequately supervised in their work. A procedure for early checking of the quality of the returns is invaluable. Plans must be made for handling non-response, that is, the failure of the enumerator to obtain information from certain of the units in the sample. 3.10 Summary and analysis of the data The first step is to edit the completed questionnaires, in the hope of amending recording errors, or at least of deleting data that are obviously erroneous. The check on the elementary data to eliminate non-sampling errors can be achieved by means of computer programmes implemented to correct the erroneous values and to permit statistical data analysis. These programmes are mainly based on graphical analysis of elementary data. Thereafter, the computations that lead to the estimates are performed. Different methods of estimation may be available for the same data. In the presentation of results it is good practice to report the amount of error to be expected in the most important estimates One of the advantages of probability sampling is that such statements can be made, although they have to be severely qualified if the amount of non-response is substantial 3.11 Information gained for future surveys The more information we have initially about a population, the easier it is to devise a sample that will give accurate estimates. Any completed sample is potentially a guide to improved future sampling, in the data that it supplies about the means, standard deviations, and nature of the variability of the principal measurements and about the costs involved in getting the data. Sampling practice advances more rapidly when provisions are made to assemble and record information of this type. Figure 1: The principal steps in a sample survey What is Sampling? What are its Characteristics, Advantages and Disadvantages? Posted in Research Methodology | Email This Post Introduction and Meaning In the Research Methodology, practical formulation of the research is very much important and so should be done very carefully with proper concentration and in the presence of a very good guidance. But during the formulation of the research on the practical grounds, one tends to go through a large number of problems. These problems are generally related to the knowing of the features of the universe or the population on the basis of studying the characteristics of the specific part or some portion, generally called as the sample. So now sampling can be defined as the method or the technique consisting of selection for the study of the so called part or the portion or the sample, with a view to draw conclusions or the solutions about the universe or the population. According to Mildred Parton, “Sampling method is the process or the method of drawing a definite number of the individuals, cases or the observations from a particular universe, selecting part of a total group for investigation.” Basic Theory Principles of sampling is of based on Sampling the following laws- • Law of Statistical Regularity – This law comes from the mathematical theory of probability. According to King,” Law of Statistical Regularity says that a moderately large number of the items chosen at random from the large group are almost sure on the average to possess the features of the large group.” According to this law the units of the sample must be selected at random. • Law of Inertia of Large Numbers – According to this law, the other things being equal – the larger the size of the sample; the more accurate the results are likely to be. Characteristics of the 1. Much 2. time. Much Very suitable for technique cheaper. Saves 3. 4. sampling reliable. carrying out different surveys. 5. Scientific in nature. Advantages of 1. 2. sampling Very Economical accurate. in nature. 3. 4. Very High suitability 5. ratio reliable. towards Takes the different surveys. less time. 6. In cases, when the universe is very large, then the sampling method is the only practical method for collecting the data. Disadvantages 1. of Inadequacy 2. of the Chances 3. 4. sampling for Problems Difficulty of 5. bias. of getting the accuracy. representative Untrained 6. samples. Absence of sample. manpower. the informants. 7. Chances of committing the errors in sampling. This article has been written by KJ Singh a MBA Graduate from a prestigious Business School In India More Entries :        What is Sampling Frame? Also describe errors What are the methods of Sampling and Probability Sampling? Explain the types of Survey’s? What are the steps in research design? Write about Objectives, Advantages and Disadvantages of Interview Method of Data Collection What are the steps involved in carrying out an experiment? Features, Advantages and Disadvantages of Observation Comments  bob September 16, 2014 at 12:50 am Precise and really helpful n of all it’s not on the PDF format ReplyLinkQuote  Priyadarshini Ghosh October 10, 2014 at 10:33 pm A detailed explanation would have been better. ReplyLinkQuote  chandru mfm pg center hemagangotri hasan October 30, 2014 at 4:12 am its very useful to finance studied student, thankful to you sir. From mfm students hemgangotri hasan ReplyLinkQuote  Ekoh Louisa Adanma December 9, 2014 at 9:13 pm Very easy to understand n helpful too. Thanks ReplyLinkQuote  huzaif February 12, 2015 at 12:27 am thank you very much ReplyLinkQuote  flavia chumbu March 26, 2015 at 9:22 am thax for the information l understood better ReplyLinkQuote  NEBERTY mr mayor CHIGODORA March 26, 2015 at 9:29 am Go ahead with the good work,you are spoon feeding us.You are such a wooooow ReplyLinkQuote  asinyen anjeline March 28, 2015 at 3:43 am thank you very much ReplyLinkQuote  Bayonle Stanley Akingbule June 9, 2015 at 1:06 am Well detailed , thanks and GOD BLESS ReplyLinkQuote  ajak daau ajak June 27, 2015 at 12:59 pm it’s so helpful but more elaboration is needed for better understanding. ReplyLinkQuote  Subham roy July 9, 2015 at 1:36 am well its helpful and useful…thnkz ReplyLinkQuote  karuma amos August 9, 2015 at 3:46 am Some how helpful but much content left out…thanks for your trial ReplyLinkQuote  Mukiibi Solomon September 9, 2015 at 8:04 am Its NYC BT more explainations ReplyLinkQuote  najjuka cissy December 11, 2015 at 4:55 am Thx fo dat. Tho more explanations on the points because dey are listed out ReplyLinkQuote  Post a comment Name Email Website Post your comment «12 Share Information  MBA Tutor   o        o  o    o o o o o o GD Topics MBA Colleges MBA Colleges India Assam Delhi Haryana Jammu and Kashmir Jharkhand Punjab MBA Courses Business Environment Communication Human Resource Management Human Resource Planning and Development Management of Conflict Performance Planning and Potential Appraisal Information Technology Management Information System Motivation Operations Management Principles of Management Research Methodology Get Guidance To Study Abroad Recently Added           Should we promote co-curricular activities in schools? Is Coalition Government better than Single Party Government? Should Voting be made Compulsory? Should the children have a responsibility towards aged parents? Is Democracy the best form of Government? Should Direct Taxes be Abolished? Should we ban child artist? Should Social Media be Moderated? Industrialization vs. Environment – What is of utmost need? Do India really need more AIIMS and IIT’s? About Us | Contact Us | Disclaimer | Privacy Policy All Rights Types of Sampling In applications: Probability Sampling: Simple Random Sampling, Stratified Random Sampling, Multi-Stage Sampling    What is each and how is it done? How do we decide which to use? How do we analyze the results differently depending on the type of sampling? Non-probability Sampling: Why don't we use non-probability sampling schemes? Two reasons:   We can't use the mathematics of probability to analyze the results. In general, we can't count on a non-probability sampling scheme to produce representative samples. In mathematical statistics books (for courses that assume you have already taken a probability course):   Described as assumptions about random variables Sampling with replacement versus sampling without replacement What are the main types of sampling and how is each done? Simple Random Sampling: A simple random sample (SRS) of size n is produced by a scheme which ensures that each subgroup of the population of size n has an equal probability of being chosen as the sample. Stratified Random Sampling: Divide the population into "strata". There can be any number of these. Then choose a simple random sample from each stratum. Combine those into the overall sample. That is a stratified random sample. (Example: Church A has 600 women and 400 women as members. One way to get a stratified random sample of size 30 is to take a SRS of 18 women from the 600 women and another SRS of 12 men from the 400 men.) Multi-Stage Sampling: Sometimes the population is too large and scattered for it to be practical to make a list of the entire population from which to draw a SRS. For instance, when the a polling organization samples US voters, they do not do a SRS. Since voter lists are compiled by counties, they might first do a sample of the counties and then sample within the selected counties. This illustrates two stages. In some instances, they might use even more stages. At each stage, they might do a stratified random sample on sex, race, income level, or any other useful variable on which they could get information before sampling. How does one decide which type of sampling to use? The formulas in almost all statistics books assume simple random sampling. Unless you are willing to learn the more complex techniques to analyze the data after it is collected, it is appropriate to use simple random sampling. To learn the appropriate formulas for the more complex sampling schemes, look for a book or course on sampling. Stratified random sampling gives more precise information than simple random sampling for a given sample size. So, if information on all members of the population is available that divides them into strata that seem relevant, stratified sampling will usually be used. If the population is large and enough resources are available, usually one will use multi-stage sampling. In such situations, usually stratified sampling will be done at some stages. How do we analyze the results differently depending on the different type of sampling? The main difference is in the computation of the estimates of the variance (or standard deviation). An excellent book for self-study is A Sampler on Sampling, by Williams, Wiley. In this, you see a rather small population and then a complete derivation and description of the sampling distribution of the sample mean for a particular small sample size. I believe that is accessible for any student who has had an upper-division mathematical statistics course and for some strong students who have had a freshman introductory statistics course. A very simple statement of the conclusion is that the variance of the estimator is smaller if it came from a stratified random sample than from simple random sample of the same size. Since small variance means more precise information from the sample, we see that this is consistent with stratified random sampling giving better estimators for a given sample size. Return to the top. Non-probability sampling schemes These include voluntary response sampling, judgement sampling, convenience sampling, and maybe others. In the early part of the 20th century, many important samples were done that weren't based on probability sampling schemes. They led to some memorable mistakes. Look in an introductory statistics text at the discussion of sampling for some interesting examples. The introductory statistics books I usually teach from are Basic Practice of Statistics by David Moore, Freeman, and Introduction to the Practice of Statistics by Moore and McCabe, also from Freeman. A particularly good book for a discussion of the problems of non-probability sampling is Statistics by Freedman, Pisani, and Purves. The detail is fascinating. Or, ask a statistics teacher to lunch and have them tell you the stories they tell in class. Most of us like to talk about these! Someday when I have time, maybe I'll write some of them here. Mathematically, the important thing to recognize is that the discipline of statistics is based on the mathematics of probability. That's about random variables. All of our formulas in statistics are based on probabilities in sampling distributions of estimators. To create a sampling distribution of an estimator for a sample size of 30, we must be able to consider all possible samples of size 30 and base our analysis on how likely each individual result is. Return to the top. In mathematical statistics books (for courses that assume you have already taken a probability course) the part of the problem relating to the sampling is described as assumptions about random variables. Mathematical statistics texts almost always says to consider the X's (or Y's) to be independent with a common distribution. How does this correspond to some description of how to sample from a population?Answer: simple random sampling with replacement. Return to the top. Mary Parker. Different Types of Sample There are 5 different types of sample you should be able to define. You should also understand when to use them, and what their advantages and disadvantages are. Simple Random Sample Obtaining a genuine random sample is difficult. We usually use Random Number Tables, and use the following procedure; 1. 2. 3. 4. 5. 6. Number the population from 0 to n Pick a random place I the number table Work in a random direction Organise numbers into the required number of digits (e.g. if the size of the population is 80, use 2 digits) Reject any numbers not applicable (in our example, numbers between 80 and 99) Continue until the required number of samples has been collected 7. [ If the sample is "without replacement", discard any repetitions of any number] Advantages: The sample will be free from Bias (i.e. it's random!) Disadvantages: Difficult to obtain Due to its very randomness, "freak" results can sometimes be obtained that are not representative of the population. In addition, these freak results may be difficult to spot. Increasing the sample size is the best way to eradicate this problem. Systematic Sample With this method, items are chosen from the population according to a fixed rule, e.g. every 10 th house along a street. This method should yield a more representative sample than the random sample (especially if the sample size is small). It seeks to eliminate sources of bias, e.g. an inspector checking sweets on a conveyor belt might unconsciously favour red sweets. However, a systematic method can also introduce bias, e.g. the period chosen might coincide with the period of faulty machine, thus yielding an unrepresentative number of faulty sweets. Advantages: Can eliminate other sources of bias Disadvantages: Can introduce bias where the pattern used for the samples coincides with a pattern in the population. Stratified Sampling The population is broken down into categories, and a random sample is taken of each category. The proportions of the sample sizes are the same as the proportion of each category to the whole. Advantages: Yields more accurate results than simple random sampling Can show different tendencies within each category (e.g. men and women) Disadvantages: Nothing major, hence it's used a lot Quota Sampling As with stratified samples, the population is broken down into different categories. However, the size of the sample of each category does not reflect the population as a whole. This can be used where an unrepresentative sample is desirable (e.g. you might want to interview more children than adults for a survey on computer games), or where it would be too difficult to undertake a stratified sample. Advantages: Simpler to undertake than a stratified sample Sometimes a deliberately biased sample is desirable Disadvantages: Not a genuine random sample Likely to yield a biased result Cluster Sampling Used when populations can be broken down into many different categories, or clusters (e.g. church parishes). Rather than taking a sample from each cluster, a random selection of clusters is chosen to represent the whole. Within each cluster, a random sample is taken. Advantages: Less expensive and time consuming than a fully random sample Can show "regional" variations Disadvantages: Not a genuine random sample Likely to yield a biased result (especially if only a few clusters are sampled) Types of samples The best sampling is probability sampling, because it increases the likelihood of obtaining samples that are representative of the population. Probability sampling (Representative samples) Probability samples are selected in such a way as to be representative of the population. They provide the most valid or credible results because they reflect the characteristics of the population from which they are selected (e.g., residents of a particular community, students at an elementary school, etc.). There are two types of probability samples: random and stratified. Random sample The term random has a very precise meaning. Each individual in the population of interest has an equal likelihood of selection. This is a very strict meaning -- you can't just collect responses on the street and have a random sample. The assumption of an equal chance of selection means that sources such as a telephone book or voter registration lists are not adequate for providing a random sample of a community. In both these cases there will be a number of residents whose names are not listed. Telephone surveys get around this problem by random-digit dialing -- but that assumes that everyone in the population has a telephone. The key to random selection is that there is no bias involved in the selection of the sample. Any variation between the sample characteristics and the population characteristics is only a matter of chance. Stratified sample A stratified sample is a mini-reproduction of the population. Before sampling, the population is divided into characteristics of importance for the research. For example, by gender, social class, education level, religion, etc. Then the population is randomly sampled within each category or stratum. If 38% of the population is college-educated, then 38% of the sample is randomly selected from the college-educated population. Stratified samples are as good as or better than random samples, but they require a fairly detailed advance knowledge of the population characteristics, and therefore are more difficult to construct. How to Construct a probability (representative) sample Nonprobability samples (Non-representative samples) As they are not truly representative, non-probability samples are less desirable than probability samples. However, a researcher may not be able to obtain a random or stratified sample, or it may be too expensive. A researcher may not care about generalizing to a larger population. The validity of non-probability samples can be increased by trying to approximate random selection, and by eliminating as many sources of bias as possible. Quota sample The defining characteristic of a quota sample is that the researcher deliberately sets the proportions of levels or strata within the sample. This is generally done to insure the inclusion of a particular segment of the population. The proportions may or may not differ dramatically from the actual proportion in the Two of each species population. The researcher sets a quota, independent of population characteristics. Example: A researcher is interested in the attitudes of members of different religions towards the death penalty. In Iowa a random sample might miss Muslims (because there are not many in that state). To be sure of their inclusion, a researcher could set a quota of 3% Muslim for the sample. However, the sample will no longer be representative of the actual proportions in the population. This may limit generalizing to the state population. But the quota will guarantee that the views of Muslims are represented in the survey. Purposive sample A purposive sample is a non-representative subset of some larger population, and is constructed to serve a very specific need or purpose. A researcher may have a specific group in mind, such as high level business executives. It may not be possible to specify the population -- they would not all be known, and access will be difficult. The researcher will attempt to zero in on the target group, interviewing whomever is available. A subset of a purposive sample is a snowball sample -- so named because one picks up the sample along the way, analogous to a snowball accumulating snow. A snowball sample is achieved by asking a participant to suggest someone else who might be willing or appropriate for the study. Snowball samples are particularly useful in hard-to-track populations, such as truants, drug users, etc. Convenience sample A convenience sample is a matter of taking what you can get. It is an accidental sample. Although selection may be unguided, it probably is not random, using the correct definition of everyone in the population having an equal chance of being selected. Volunteers would constitute a convenience sample. Non-probability samples are limited with regard to generalization. Because they do not truly represent a population, we cannot make valid inferences about the larger group from which they are drawn. Validity can be increased by approximating random selection as much as possible, and making every attempt to avoid introducing bias into sample selection. Examples of nonprobability samples Self-test #1: Sample types Self-test #2: Using the random numbers table Continue on to Sample size Sampling error and nonsampling error Posted on 4 September, 2014 by Dr Nic The subject of statistics is rife with misleading terms. I have written about this before in such posts as Teaching Statistical Language and It is so random. But the terms sampling error and nonsampling error win the Dr Nic prize for counter-intuitivity and confusion generation. Confusion abounds To start with, the word error implies that a mistake has been made, so the term sampling error makes it sound as if we made a mistake while sampling. Well this is wrong. And the term nonsampling error (why is this even a term?) sounds as if it is the error we make from not sampling. And that is wrong too. However these terms are used extensively in the NZ statistics curriculum, so it is important that we clarify what they are about. Fortunately the Glossary has some excellent explanations: Sampling Error “Sampling error is the error that arises in a data collection process as a result of taking a sample from a population rather than using the whole population. Sampling error is one of two reasons for the difference between an estimate of a population parameter and the true, but unknown, value of the population parameter. The other reason is non-sampling error. Even if a sampling process has no non-sampling errors then estimates from different random samples (of the same size) will vary from sample to sample, and each estimate is likely to be different from the true value of the population parameter. The sampling error for a given sample is unknown but when the sampling is random, for some estimates (for example, sample mean, sample proportion) theoretical methods may be used to measure the extent of the variation caused by sampling error.” Non-sampling error: “Non-sampling error is the error that arises in a data collection process as a result of factors other than taking a sample. Non-sampling errors have the potential to cause bias in polls, surveys or samples. There are many different types of non-sampling errors and the names used to describe them are not consistent. Examples of non-sampling errors are generally more useful than using names to describe them. And it proceeds to give some helpful examples. These are great definitions, and I thought about turning them into a diagram, so here it is: Table summarising types of error. And there are now two videos to go with the diagram, to help explain sampling error and nonsampling error. Here is a link to the first: Video about sampling error One of my earliest posts, Sampling Error Isn’t, introduced the idea of using variation due to sampling and other variation as a way to make sense of these ideas. The sampling video above is based on this approach. Students need lots of practice identifying potential sources of error in their own work, and in critiquing reports. In addition I have found True/False questions surprisingly effective in practising the correct use of the terms. Whatever engages the students for a time in consciously deciding which term to use, is helpful in getting them to understand and be aware of the concept. Then the odd terminology will cease to have its original confusing connotations. About these ads Non-sampling error From Wikipedia, the free encyclopedia In statistics, non-sampling error is a catch-all term for the deviations of estimates from their true values that are not a function of the sample chosen, including varioussystematic errors and random errors that are not due to sampling.[1] Non-sampling errors are much harder to quantify than sampling errors.[2] Non-sampling errors in survey estimates can arise from:[3]  Coverage errors, such as failure to accurately represent all population units in the sample, or the inability to obtain information about all sample cases;  Response errors by respondents due for example to definitional differences, misunderstandings, or deliberate misreporting;  Mistakes in recording the data or coding it to standard classifications;  Other errors of collection, nonresponse, processing, or imputation of values for missing or inconsistent data.[3] An excellent discussion of issues pertaining to non-sampling error can be found in several sources such as Kalton (1983)[4] and Salant and Dillman (1995),[5] See also[edit] Sampling and non-sampling errors Beyond the conceptual differences, many kinds of error can help explain differences in the output of the programs that generate data on income. They are often classified into two broad types: sampling errors and non-sampling errors. Sampling errors occur because inferences about the entire population are based on information obtained from only a sample of that population. Because SLID and the long-form Census are sample surveys, their estimates are subject to this type of error. The coefficient of variationis a measure of the extent to which the estimate could vary, if a different sample had been used. This measure gives an indication of the confidence that can be placed in a particular estimate. This data quality measure will be used later in this paper to help explain why some of SLID's estimates, which are based on a smaller sample, might differ from those of the other programs generating income data. While the Census is also subject to this type of error, reliable estimates can be made for much smaller populations because the sampling rate is much higher for the Census (20%)1. Non-sampling errors can be further divided into coverage errors, measurement errors (respondent, interviewer, questionnaire, collection method…), non-response errors and processing errors. The coverage errors are generally not well measured for income and are usually inferred from exercises of data confrontation such as this. Section 3 will review the population exclusions and other known coverage differences between the sources. The issues of various collection methods or mixed response modes and the different types of measurement errors that could arise will be approached in section 4. Non-response can be an issue in the case of surveys. It is not always possible to contact and convince household members to respond to a survey. Sometimes as well, even if the household responded, there may not be valid responses to all questions. In both cases adjustments are performed to the data but error may result as the quality of the adjustments often depends on then on-respondents being similar to the respondents. For the 2005 income year,SLID had a response rate of 73.3% and for the Census, it was close to 97%. Still for 2005, because of item nonresponse, all income components were imputed for 2.7% of SLID's respondents and at least some components were imputed for another 23.5%2. In the case of the Census, income was totally imputed for 9.3% and partially imputed for 29.3%. In administrative data – in particular the personal tax returns – the filing rates for specific populations may depend on a variety of factors (amount owed, financial activity during the year, personal interest, requirement for eligibility to support programs, etc.) and this could also result in differences in the estimates generated by the programs producing income data. The systems and procedures used to process the data in each of the programs are different and may have design variations that impact the data in special ways. When such discrepancies have been identified, they will be mentioned in section 5. Beyond the design variations, most processing errors in these data sources are thought to be detected and corrected before the release of data to the public. However due to the complexity and to the yearly modifications of processing systems, some errors may remain undetected and they are therefore quite difficult to quantify. More detail on the quality and methods of individual statistical programs is accessible through the Surveys and statistical programs by subject section on Statistics Canada's website. Notes 1. The sampling error from one-year estimates of individual income based on the LAD would also be of a similar magnitude as its sampling rate is also one in five. 2. Data Quality in the 2005 Survey of Labour and Income Dynamics , C. Duddek, Income Research Paper Series, Statistics Canada catalogue no. 75F0002-No.003, May 2007. 6 Sampling and Non-sampling Errors The statistical quality or reliability of a survey may obviously be influenced by the errors that for various reasons affect the observations. Error components are commonly divided into two major categories: Sampling and non-sampling errors. In sampling literature the terms "variable errors" and "bias" are also frequently used, though having a precise meaning which is slightly different from the former concepts. The total error of a survey statistic is labeled the mean square error, being the sum of variable errors and all biases. In this section we will give a fairly general and brief description of the most common error components related to household sample surveys, and discuss their presence in and impacts on this particular survey. Secondly, we will go into more detail as to those components which can be assessed numerically. Error Components and their Presence in the Survey (1) Sampling errors are related to the sample design itself and the estimators used, and may be seen as a consequence of surveying only a random sample of, and not the complete, population. Within the family of probability sample designs - that is designs enabling the establishment of inclusion probabilities (random samples) - sampling errors can be estimated. The most common measure for the sampling error is the variance of an estimate, or derivatives thereof. The derivative mostly used is the standard error, which is simply the square root of the variance. The variance or the standard error does not tell us exactly how great the error is in each particular case. It should rather be interpreted as a measure of uncertainty, i.e. how much the estimate is likely to vary if repeatedly selected samples (with the same design and of the same size) had been surveyed. The variance is discussed in more detail in section 6.2. (2) Non-sampling errors is a "basket" comprising all errors which are not sampling errors. These type of errors may induce systematic bias in the estimates, as opposed to random errors caused by sampling errors. The category may be further divided into subgroups according to the various origins of the error components:      Imperfections in the sampling frame, i.e. when the population frame from which the sample is selected does not comprise the complete population under study, or include foreign elements. Exclusion of certain groups of the population from the sampling frame is one example. As described in the Gaza section, it was decided to exclude "outside localities" from being surveyed for cost reasons. It was maintained that the exclusion would have negligible effects on survey results. Errors imposed by implementary deviations from the theoretical sample design and field work procedures. Examples: non-response, "wrong" households selected or visited, "wrong" persons interviewed, etc. Except for non-response, which will be further discussed subsequently, there were some cases in the present survey in which the standard instructions for "enumeration walks" had to be modified in order to make sampling feasible. Any departure from the standard rules has been particularly considered within the context of inclusion probabilities. None of the practical solutions adopted imply substantial alterations of the theoretical probabilities described in the previous sections. The field work procedures themselves may imply unforeseen systematic biases in the sample selection. In the present survey one procedure has been given particular consideration as a potential source of error: the practical modification of choosing road crossing corners - instead of any randomly selected spot - as starting points for the enumeration walks. This choice might impose systematic biases as to the kind of households being sampled. However, numerous inspection trials in the field proved it highly unlikely that such bias would occur. According to the field work instructions, the starting points themselves were never to be included in the sample. Such inclusion would have implied a systematic over-representation of road corner households, and thus may have caused biases for certain variables. (Instead, road corner households may now be slightly underrepresented in so far as they as starting points are excluded from the sample. Possible bias induced by this under-representation is, however, negligible compared to the potential bias accompanying the former alternative.) Improper wording of questions, misquotations by the interviewer, misinterpretations and other factors that may cause failure in obtaining the intended response. "Fake response" (questions being answered by the interviewer himself/herself) may also be included in this group of possible errors. Irregularities of this kind are generally difficult to detect. The best ways of preventing them is to have well trained data collectors, to apply various verification measures, and to introduce the internal control mechanisms by letting data collectors work in pairs - possibly supplemented by the presence of the supervisor. A substantial part of the training of supervisors and data collectors was devoted to such measures. Verification interviews were carried out by the supervisors among a 10% randomly selected subsample. No fake interviews were detected. However, a few additional re-interviews were carried out, on suspicion of misunderstandings and incorrect responses. Data processing errors include errors arising incidentally during the stages of response recording, data entry and programming. In this survey the data entry programme used included consistency controls wherever possible, aiming at correcting any logical contradictions in the data. Furthermore, verification punch work was applied in order to correct mis-entries not detected by the consistency control, implying that each and all questionnaires have been punched twice. Sampling Error - Variance of an Estimate Generally, the prime objective of sample designing is to keep sampling error at the lowest level possible (within a given budget). There is thus a unique theoretical correspondence between the sampling strategy and the sampling error, which can be expressed mathematically by the variance of the estimator applied. Unfortunately, design complexity very soon implies variance expressions to be mathematically uncomfortable and sometimes practically "impossible" to handle. Therefore, approximations are frequently applied in order to achieve interpretable expressions of the theoretical variance itself, and even more to estimate it. In real life practical shortcomings frequently challenge mathematical comfort. Absence of sampling frames or other prior information forces one to use mathematically complex strategies in order to find feasible solutions. The design of the present survey - stratified, 4-5 stage sampling with varying inclusion probabilities - is probably among the extremes in this respect, implying that the variance of the estimator (5.2) will be of the utmost complexity - as will be seen subsequently. The (approximate) variance of the estimator (5.2) is in its simplest form: The variances and covariances on the right hand side of (6.1) may be expressed in terms of the stratum variances and covariances: Proceeding one step further the stratum variance may be expressed as follows9: where we have introduced the notation ps (k) = P1 (s, k). The ps (k, l) is the joint probability of inclusion for PSU (s,k) and PSU (s,l), and the variance of the PSU (s,k) unbiased estimate . The variance of is obtained similarly by substituting x with N in the above formula. The stratum covariance formula is somewhat more complicated and is not expressed here. The PSU (s,k) variance components in the latter formula have a structure similar to the stratum one, as is realized by regarding the PSUs as separate "strata" and the cells as "PSUs". Again, another variance component emerges for each of the cells, the structure of which is similar to the preceding one. In order to arrive at the "ultimate" variance expression yet another two or three similar stages have to be passed. It should be realized that the final variance formula is extremely complicated, even if simplifying modifications and approximations may reduce the complexities stemming from the 2nd - 5th sampling stages. It should also be understood that attempts to estimate this variance properly and exhaustively (unbiased or close to unbiased) would be beyond any realistic effort. Furthermore, for such estimation to be accomplished certain preconditions have to be met. Some of these conditions cannot, however, be satisfied (for instance: at least two PSUs have to be selected from each stratum comprising more than one PSU). We thus have to apply a more simple method for appraising the uncertainty of our estimates. Any sampling strategy (sample selection approach and estimator) may be characterized by its performance relative to a simple random sampling (SRS) design, applying the sample average as the estimator for proportions. The design factor of a strategy is thus defined as the fraction between the variances of the two estimators. If the design factor is, for instance, less than 1, the strategy under consideration would be better than SRS. Usually, multi-stage strategies are inferior to SRS, implying the design factor being greater than 1. The design factor is usually determined empirically. Although there is no overwhelming evidence in its favour, a factor of 1.5 is frequently used for stratified, multi-stage designs. (The design factor may vary among survey variables). The rough approximate variance estimator is thus: where p is the estimate produced by (5.2) and nT is the number of observations underlying the estimate (the "100%"). Although this formula oversimplifies the variance, it still takes care of some of the basic features of the real variance; the variance decreases by increasing sample size (n), and tends to be larger for proportions around 50% than at the tails (0% or 100%). The square root of the variance, i.e. or briefly s, is called the standard error, and is tabulated in table A.12 for various values of p and n. Table A.12 Standard error estimates for proportions (s and p are specified as percentages). Number of obs. (n) Estimated proportion (p %) 5/95 10/90 20/80 30/70 40/60 50 10 8.4 11.6 15.5 17.7 19.0 19.4 20 6.0 8.2 11.0 12.5 13.4 13.7 50 3.8 5.2 6.9 7.9 8.5 8.7 75 3.1 4.2 5.7 6.5 6.9 7.1 100 2.7 3.7 4.9 5.6 6.0 6.1 150 2.2 3.0 4.0 4.6 4.9 5.0 200 1.9 2.6 3.5 4.0 4.2 4.3 250 1.7 2.3 3.1 3.5 3.8 3.9 300 1.5 2.1 2.8 3.2 3.5 3.5 350 1.4 2.0 2.6 3.0 3.2 3.3 400 1.3 1.8 2.5 2.8 3.0 3.1 500 1.2 1.6 2.2 2.5 2.7 2.7 700 1.0 1.4 1.9 2.1 2.3 2.3 1000 0.8 1.2 1.5 1.8 1.9 1.9 1500 0.7 0.9 1.3 1.4 1.5 1.6 2000 0.6 0.8 1.1 1.3 1.3 1.4 2500 0.5 0.7 1.0 1.2 1.2 1.2 Confidence Intervals The sample which has been surveyed is one specific outcome of an "infinite" number of random selections which might have been done within the sample design. Other sample selections would most certainly have yielded survey results slightly different from the present ones. The survey estimates should thus not be interpreted as accurately as the figures themselves indicate. A confidence interval is a formal measure for assessing the variability of survey estimates from such hypothetically repeated sample selections. The confidence interval is usually derived from the survey estimate itself and its standard error: Confidence interval: [p - c s, p + c s] where the c is a constant which must be determined by the choice of a confidence coefficient, fixing the probability of the interval including the true, but unknown, population proportion for which p is an estimate. For instance, c=1 corresponds to a confidence probability of 67%, i.e. one will expect that 67 out of 100 intervals will include the true proportion if repeated surveys are carried out. In most situations, however, a chance of one out of three to arrive at a wrong conclusion is not considered satisfactory. Usually, confidence coefficients of 90% or 95% are preferred, 95% corresponding to approximately c=2. Although our assessment as to the location of the true population proportion thus becomes less uncertain, the assessment itself, however, becomes less precise as the length of the interval increases. Comparisons between groups Comparing the occurrence of an attribute between different sub-groups of the population is probably the most frequently used method for making inference from survey data. For illustration of the problems involved in such comparisons, let us consider two separate subgroups for which the estimated proportions sharing the attribute are , respectively, while the unknown true proportions are denoted p1 and p2. The corresponding standard error estimates are s1 and s2. The problem of inference is thus to evaluate the significance of the difference between the two sub-group estimates: Can the observed difference be caused by sampling error alone, or is it so great that there must be more substantive reasons for it? We will assume that the estimate is the larger of the two proportions observed. Our problem of judgement will thus be equivalent to testing the following hypothesis: Hypothesis: p1 = p2 Alternative: p1 > p2 In case the test rejects the hypothesis we will accept the alternative as a "significant" statement, and thus conclude that the observed difference between the two estimates is too great to be caused by randomness alone. However, as is the true nature of statistical inference, one can (almost) never draw absolutely certain conclusions. The uncertainty of the test is indicated by the choice of a "significance level", which is the probability of making a wrong decision by rejecting a true hypothesis. This probability should obviously be as small as possible. Usually it is set at 2.5% or 5% - depending on the risk or loss involved in drawing wrong conclusions. The test implies that the hypothesis is rejected if where the constant c depends on the choice of significance level: Significance level -----------------2.5% 5.0% 10.0% c-value ------2.0 1.6 1.3 As is seen, the test criteria comprise the two standard error estimates and thus imply some calculation. It is also seen that smaller significance levels imply the requirement of larger observed differences between sub-groups in order to arrive at significant conclusions. One should be aware that the non-rejection of a hypothesis leaves one with no conclusions at all, rather than the acceptance of the hypothesis itself. Non-response Non-response occurs when one fails to obtain an interview with a properly pre-selected individual (unit non-response). The most frequent reasons for this kind of non-response are refusals and absence ("not-at-homes"). Item non-response occurs when a single question is left unanswered. Non-response is generally the most important single source of bias in surveys. Most exposed to non-response bias are variables related to the very phenomenon of being a (frequent) "notat-homer" or not (example: cinema attendance). In Western societies non-response rates of 15-30% are normal. Various measures have been undertaken to keep non-response at the lowest level possible. Most of all confidence-building has been of concern, implying contacts with local community representatives have been made in order to enlist their support and approval. Furthermore, many hours have been spent explaining the scope of the survey to respondents and anyone else wanting to know, assuring that the survey makers neither would impose taxes on people nor demolish their homes, or - equally important for the reliability of the survey - bring direct material aid. Furthermore, up to 4 call-backs were applied if selected respondents were not at home. Usually the data collectors were able to get an appointment for a subsequent visit at the first attempt, so that only one revisit was required in most cases. Unit non-response thus comprises refusals and those not being at home at all after four attempts. Table A.13 shows the net number of respondents and non-responses in each of the three parts of the survey. The initial sizes of the various samples are deduced from the table by adding responses and non-responses. For the household and RSI samples, the total size was 2,518 units, while the female sample size was 1,247. It is seen from the bottom line that the nonresponse rates are outstandingly small compared to the "normal" magnitudes of 10 - 20% in similar surveys. Consequently, there should be fairly good evidence for maintaining that the effects of non-response in this survey are insignificant. Table A.13 Number of (net) respondents and non-respondents in the tree parts of the survey Households Region 970 8 968 10 482 4 1,023 16 1,004 35 502 14 486 15 478 23 240 5 2,479 39 2,450 68 1,224 23 Arab Jerusalem Total Women Resp. Non-resp. Resp. Non-resp. Resp. Non-resp. Gaza West Bank RSIs Non-response rate 1.5% al@mashriq 2.7% 960428/960710   Mobile Survey Participant Information  About Us  Careers  Help 1.8%  Contact Us Australian Bureau of Statistics   Home Complete Survey  Statistics  Services  Census   Topics @ a Glance Methods & Classifications  News & Media  Education  Links  Help search ABS Home Statistical Language - Census and Sample  Menu       Understanding Statistics Draft Statistical Capability Framework Statistical Language ABS Presents...Videos Statistical Skills for Official Statisticians A Guide for Using Statistics for Evidence Based Policy Statistics - A Powerful Edge! ABS Sports Stats    ABS Training Census and Sample Recommended: First read What is a Population? This animation explains the concept of census and sample. If you are unable to access the video a Transcript (.doc 27kb) has been provided. The animation requiresAdobe Flash Player to run. There is no audio in this animation. How do we study a population? A population may be studied using one of two approaches: taking a census, or selecting a sample. It is important to note that whether a census or a sample is used, both provide information that can be used to draw conclusions about the whole p What is a census (complete enumeration)? A census is a study of every unit, everyone or everything, in a population. It is known as a complete enumeration, which means a complete count What is a sample (partial enumeration)? A sample is a subset of units in a population, selected to represent all units in a population of interest. It is a partial enumeration because it is a c Information from the sampled units is used to estimate the characteristics for the entire population of interest. When to use a census or a sample? Once a population has been identified a decision needs to be made about whether taking a census or selecting a sample will be the more suitable o to using a census or sample to study a population: Pros of a CENSUS    provides a true measure of the population (no sampling error) benchmark data may be obtained for future studies detailed information about small sub-groups within the population is more likely to be available Pros of a SAMPLE    costs would generally be lower than for a census results may be available in less time if good sampling techniques are used, the results can be very representative of the actual population Cons of a CENSUS    may be difficult to enumerate all units of the population within the available time higher costs, both in staff and monetary terms, than for a sample generally takes longer to collect, process, and release data than from a sample Cons of a SAMPLE    data may not be representative of the total population, particularly where the sample size is small often not suitable for producing benchmark data as data are collected from a subset of units and inferences made  about the whole population, the data are subject to 'sampling' error decreased number of units will reduce the detailed information available about subgroups within a population How are samples selected? A sample must be robust in its design and large enough to provide a reliable representation of the whole population. Aspects to be considered whe accuracy required, cost, and the timing. Sampling can be random or non-random. In a random (or probability) sample each unit in the population has a chance of being selected, and this probability can be accurately determine Probability or random sampling includes, but is not limited to, simple random sampling, systematic sampling, and stratified sampling. Random sam estimates from the data obtained from the units included in the sample. Simple random sample: All members of the sample are chosen at random and have the same chance of being in the sample. A lottery draw is a goo the numbers are randomly generated from a defined range of numbers (i.e. 1 through to 45) with each number having an equal chance of being sel Systematic random sample: The first member of the sample is chosen at random then the other members of the sample are taken at intervals (i.e. Stratified random sample: Relevant subgroups from within the population are identified and random samples are selected from within each strata. In a non-random (or non-probability) sample some units of the population have no chance of selection, the selection is non-random, or the prob determined. In this method the sampling error cannot be estimated, making it difficult to infer population estimates from the sample. Non-random sampling inc sampling, quota sampling, and volunteer sampling Convenience sampling: Units are chosen based on their ease of access; Purposive sampling: The sample is chosen based on what the researcher thinks is appropriate for the study; Quota sampling: The researcher can select units as they choose, as long as they reach a defined quota; and Volunteer sampling: participants volunteer to be a part of the survey (a common method used for internet based opinion surveys where there is no Collecting data about a population flowchart: Collecting Data about a Population Flowchart: Census and Sample Recommended: Read Data Sources next Further information: ABS: 1299.0 - An Introduction to Sample Surveys: A User's Guide External links: Sample Size calculator Basic Survey Design: Samples and Censuses Return to Statistical Language Homepage   Return to Top Privacy | Disclaimer | Feedback | | © Copyright| Sitemap| Online Security Difference Between Census and Sampling Posted on January 31, 2011 by Nedha Last updated on: May 28, 2015 Census vs Sampling Census and sampling are two methods of collecting data between which certain differences can be identified. Before we move forward to enumerate differences between Census and sampling, it is better to understand what these two techniques of generatinginformation mean. A census can simply be defined as a periodic collection of information from the entirepopulation. Conducting a census can be very time-consuming and costly. However, the advantage is that it allows the researcher to gain accurate information. On the other hand, sampling is when the researcher selects a sample from the population and gathers information. This is less time consuming, but the reliability of the information gained is doubtful. Through this article let us examine the differences between a census and sampling. What is a Census? Census refers to a periodic collection of information from the entire population. It is a time-consuming affair as it involves counting all heads and generating information about them. For better governance, every government requires specific data and information about the populace to make programs andpolicies that match the needs and requirements of the population. A census allows the government to gain such information. What is Sampling? There are times when a government cannot wait for next Census and needs to gather current information about the population. This is when a different technique of collecting information that is less elaborate and cheaper than Census is employed. This is called Sampling. This method of collecting information requires generating a sample that is representative of the entire population. When using a sample for data collection the researcher can use various methods of sampling. Simple random sampling, stratified sampling, snowball method, nonrandom sampling are some of the mostly used sampling methods. There are stark differences between Census and sampling though both serve the purpose of providing data and information about a population. Howsoever accurately, a sample from a population may be generated there will always be a margin for error, whereas in case of Census, the entire population is taken into account and as such it is most accurate. Data obtained from both Census and sampling is extremely important for a government for various purposes such as planning developmental programs and policies for weaker sections of the society. What is the Difference Between Census and Sampling? Definitions of Census and Sampling: Census: Census refers to a periodic collection of information about the populace from the entire population. Sampling: Sampling is a method of collecting information from a sample that is representative of the entire population. Characteristics of Census and Sampling: Reliability: Census: Data from the census is reliable and accurate. Sampling: there is a margin of error in data obtained from sampling. Time: Census: Census is very time-consuming. Sampling: Sampling is quick. Cost: Census: Census is very expensive Sampling: Sampling is inexpensive. Convenience: Census: Census is not very convenient as the researcher has to allocate a lot of effort in collecting data. Sampling: Sampling is the most convenient method of obtaining data about the population. Image Courtesy: 1. “Volkstelling 1925 Census“. [Public Domain] via Wikimedia Commons 2. “Simple random sampling” by Dan Kernler [CC BY-SA 4.0] via Wikimedia Co 4. Enumerations versus Samples Print Sixteen U.S. Marshals and 650 assistants conducted the first U.S. census in 1791. They counted some 3.9 million individuals, although as then-Secretary of State, Thomas Jefferson, reported to President George Washington, the official number understated the actual population by at least 2.5 percent (Roberts, 1994). By 1960, when the U.S. population had reached 179 million, it was no longer practical to have a census taker visit every household. The Census Bureau then began to distribute questionnaires by mail. Of the 116 million households to which questionnaires were sent in 2000, 72 percent responded by mail. A mostly-temporary staff of over 800,000 was needed to visit the remaining households, and to produce the final count of 281,421,906. Using statistically reliable estimates produced from exhaustive follow-up surveys, the Bureau's permanent staff determined that the final count was accurate to within 1.6 percent of the actual number (although the count was less accurate for young and minority residences than it was for older and white residents). It was the largest and most accurate census to that time. (Interestingly, Congress insists that the original enumeration or "head count" be used as the official population count, even though the estimate calculated from samples by Census Bureau statisticians is demonstrably more accurate.) The mail-in response rate for the 2010 census was also 72 percent. As with most of the 20th century censuses the official 2010 census count, by state, had to be delivered to the Office of the President by December 31 of the census year. Then within one week of the opening of the next session of the Congress, the President reported to the House of Representatives the apportionment population counts and the number of Representatives to which each state was entitled. In 1791, census takers asked relatively few questions. They wanted to know the numbers of free persons, slaves, and free males over age 16, as well as the sex and race of each individual. (You can view replicas of historical census survey forms here(link is external)) As the U.S. population has grown, and as its economy and government have expanded, the amount and variety of data collected has expanded accordingly. In the 2000 census, all 116 million U.S. households were asked six population questions (names, telephone numbers, sex, age and date of birth, Hispanic origin, and race), and one housing question (whether the residence is owned or rented). In addition, a statistical sample of one in six households received a "long form" that asked 46 more questions, including detailed housing characteristics, expenses, citizenship, military service, health problems, employment status, place of work, commuting, and income. From the sampled data, the Census Bureau produced estimated data on all these variables for the entire population. In the parlance of the Census Bureau, data associated with questions asked of all households are called100% data and data estimated from samples are called sample data. Both types of data are available aggregated by various enumeration areas, including census block, block group, tract, place, county, and state (see the illustration below). Through 2000, the Census Bureau distributes the 100% data in a package called the "Summary File 1" (SF1) and the sample data as "Summary File 3" (SF3). In 2005, the Bureau launched a new project called American Community Survey that surveys a representative sample of households on an ongoing basis. Every month, one household out of every 480 in each county or equivalent area receives a survey similar to the old "long form." Annual or semi-annual estimates produced from American Community Survey samples replaced the SF3 data product in 2010. To protect respondents' confidentiality, as well as to make the data most useful to legislators, the Census Bureau aggregates the data it collects from household surveys to several different types of geographic areas. SF1 data, for instance, are reported at the block or tract level. There were about 8.5 million census blocks in 2000. By definition, census blocks are bounded on all sides by streets, streams, or political boundaries. Census tracts are larger areas that have between 2,500 and 8,000 residents. When first delineated, tracts were relatively homogeneous with respect to population characteristics, economic status, and living conditions. A typical census tract consists of about five or six sub-areas called block groups. As the name implies, block groups are composed of several census blocks. American Community Survey estimates, like the SF3 data that preceded them, are reported at the block group level or higher. Figure 3.4.1 Relationships among the various census geographies. U.S. Census Bureau, American FactFinder, 2005, http://factfinder2.census.gov/faces/nav/jsf/pages/index.xhtml(link is external) An updated source for the diagram can be found at https://www.census.gov/geo/reference/hierarchy.html(link is external)). Try This! Acquiring U.S. Census Data via the World Wide Web The purpose of this practice activity is to guide you through the process of finding and acquiring 2000 census data from the U.S. Census Bureau data via the Web. Your objective is to look up the total population of each county in your home state (or an adopted state of the U.S.). 1. Go to the U.S. Census Bureau site at http://www.census.gov(link is external). 2. At the Census Bureau home page, hover your mouse cursor over the Data tab, then overData Tools and App and select American FactFinder. American FactFinder is the Census Bureau's primary medium for distributing census data to the public. 3. Expand the ADVANCED SEARCH list, and click on the SHOW ME ALL button. Take note of the three numbered steps featured on the page you are taken to. That’s what we are about to do in this exercise. 4. Click the Topics search option box. In the Select Topics overlay window expand the Peoplelist. Next expand the Basic Count/Estimate list. Then choose Population Total. Note that a Population Total entry is placed in the Your Selections box in the upper left, and it disappears from the Basic Count/Estimate list. Close the Select Topics window. The list of datasets in the resulting Search Results window is for the entire United States. We want to narrow the search to county-level data for your home or adopted state. 5. Click the Geographies search options box. In the Select Geographies overlay window that opens make sure the List tab is selected. Under Select a geographic type:, click County - 050. Next, select the entry for your state from the Select a state list, and then, from the Select one or more geographic areas.... list, select All counties within <your state> . Last, click ADD TO YOUR SELECTIONS. This will place your All Counties… choice in the Your Selections box. Close the Select Geographies window. 6. The list of datasets in the Search Results window now pertains to the counties in your state. Take a few moments to review the datasets that are listed. Note that there are SF1, SF2, ACS (American Community Survey), etc., datasets, and that if you page through the list far enough you will see that data from past years is listed. We are going to focus our effort on the 2010 SF1 100% Data. 7. Given that our goal is to find the population of the counties in your home state, can you determine which dataset we should look at? There is a TOTAL POPULATION entry for 2010. Find it, and make certain you have located the 2010 SF1 100% Data dataset. (You can use the Refine your search results: slot above the dataset list to help narrow the search.) Check the box for it, and then click View. In the new results window that opens, you should be able to find the population of the counties of your chosen state. Note the row of Actions:, which includes Print and Download buttons. There is a TOTAL POPULATION entry, probably on page 2. Find it, and make certain you have located the 2010 SF1 100% Data dataset. Check the box for it and click View. In the new Results window that opens you should be able to find the population of the counties your chosen state. Note the row of Actions:, which includes Print and Download buttons. I encourage you to experiment some with the American FactFinder site. Start slow, and just click theBACK TO ADVANCED SEARCH button, un-check the TOTOL POPULATION dataset and choose a different dataset to investigate. Registered students will need to answer a couple of quiz questions based on using this site. Pay attention to what is in the Your Selections window. You can easily remove entries by clicking the red circle with the white X. On the SEARCH page, with nothing in the Your Selections box, you might try typing “QT” or “GCT” in the Search for: slot. QT stands for Quick Tables which are preformatted tables that show several related themes for one or more geographic areas. GCT stands for Geographic Comparison Tableswhich are the most convenient way to compare data collected for all the counties, places, or congressional districts in a state, or all the census tracts in a county. Students who register for this Penn State course gain access to assignments and instructor feedback, and earn academic credit. Information about Penn State's Online Geospatial Education programs is available at the Geospatial Education Program Office(link is external). Probability distribution From Wikipedia, the free encyclopedia This article is about probability distribution. For generalized functions in mathematical analysis, see Distribution (mathematics). For other uses, see Distribution. This article includes a list of references, related reading or external links, but its sources remain unclear because it lacksinline citations. Please improve this article by introducing more precise citations. (July 2011) This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. (July 2011) In probability and statistics, a probability distribution assigns a probability to each measurable subset of the possible outcomes of a random experiment, survey, or procedure of statistical inference. Examples are found in experiments whose sample space is non-numerical, where the distribution would be a categorical distribution; experiments whose sample space is encoded by discrete random variables, where the distribution can be specified by a probability mass function; and experiments with sample spaces encoded by continuous random variables, where the distribution can be specified by a probability density function. More complex experiments, such as those involving stochastic processesdefined in continuous time, may demand the use of more general probability measures. In applied probability, a probability distribution can be specified in a number of different ways, often chosen for mathematical convenience:      by supplying a valid probability mass function or probability density function by supplying a valid cumulative distribution function or survival function by supplying a valid hazard function by supplying a valid characteristic function by supplying a rule for constructing a new random variable from other random variables whose joint probability distribution is known. A probability distribution can either be univariate or multivariate. A univariate distribution gives the probabilities of a single random variable taking on various alternative values; a multivariate distribution (a joint probability distribution) gives the probabilities of a random vector—a set of two or more random variables—taking on various combinations of values. Important and commonly encountered univariate probability distributions include the binomial distribution, the hypergeometric distribution, and the normal distribution. The multivariate normal distribution is a commonly encountered multivariate distribution. Contents [hide]           1Introduction 2Terminology o 2.1Basic terms 3Cumulative distribution function 4Discrete probability distribution o 4.1Measure theoretic formulation o 4.2Cumulative density o 4.3Delta-function representation o 4.4Indicator-function representation 5Continuous probability distribution 6Some properties 7Kolmogorov definition 8Random number generation 9Applications 10Common probability distributions o o    10.1Related to real-valued quantities that grow linearly (e.g. errors, offsets) 10.2Related to positive real-valued quantities that grow exponentially (e.g. prices, incomes, populations) o 10.3Related to real-valued quantities that are assumed to be uniformly distributed over a (possibly unknown) region o 10.4Related to Bernoulli trials (yes/no events, with a given probability) o 10.5Related to categorical outcomes (events with K possible outcomes, with a given probability for each outcome) o 10.6Related to events in a Poisson process (events that occur independently with a given rate) o 10.7Related to the absolute values of vectors with normally distributed components o 10.8Related to normally distributed quantities operated with sum of squares (for hypothesis testing) o 10.9Useful as conjugate prior distributions in Bayesian inference 11See also 12References 13External links Introduction[edit] The probability mass function (pmf) p(S) specifies the probability distribution for the sum S of counts from two dice. For example, the figure shows that p(11) = 1/18. The pmf allows the computation of probabilities of events such as P(S > 9) = 1/12 + 1/18 + 1/36 = 1/6, and all other probabilities in the distribution. To define probability distributions for the simplest cases, one needs to distinguish between discrete and continuous random variables. In the discrete case, one can easily assign a probability to each possible value: for example, when throwing a fair die, each of the six values 1 to 6 has the probability 1/6. In contrast, when a random variable takes values from a continuum then, typically, probabilities can be nonzero only if they refer to intervals: in quality control one might demand that the probability of a "500 g" package containing between 490 g and 510 g should be no less than 98%. The probability density function (pdf) of the normal distribution, also called Gaussian or "bell curve", the most important continuous random distribution. As notated on the figure, the probabilities of intervals of values correspond to the area under the curve. If the random variable is real-valued (or more generally, if a total order is defined for its possible values), the cumulative distribution function (CDF) gives the probability that the random variable is no larger than a given value; in the real-valued case, the CDF is theintegral of the probability density function (pdf) provided that this function exists. Terminology[edit] As probability theory is used in quite diverse applications, terminology is not uniform and sometimes confusing. The following terms are used for non-cumulative probability distribution functions:    Probability mass, Probability mass function, p.m.f.: for discrete random variables. Categorical distribution: for discrete random variables with a finite set of values. Probability density, Probability density function, p.d.f.: most often reserved for continuous random variables. The following terms are somewhat ambiguous as they can refer to non-cumulative or cumulative distributions, depending on authors' preferences:   Probability distribution function: continuous or discrete, non-cumulative or cumulative. Probability function: even more ambiguous, can mean any of the above or other things. Finally,  Probability distribution: sometimes the same as probability distribution function, but usually refers to the more complete assignment of probabilities to all measurable subsets of outcomes, not just to specific outcomes or ranges of outcomes. Basic terms[edit]           Mode: for a discrete random variable, the value with highest probability (the location at which the probability mass function has its peak); for a continuous random variable, the location at which the probability density function has its peak. Support: the smallest closed set whose complement has probability zero. Head: the range of values where the pmf or pdf is relatively high. Tail: the complement of the head within the support; the large set of values where the pmf or pdf is relatively low. Expected value or mean: the weighted average of the possible values, using their probabilities as their weights; or the continuous analog thereof. Median: the value such that the set of values less than the median has a probability of onehalf. Variance: the second moment of the pmf or pdf about the mean; an important measure of the dispersion of the distribution. Standard deviation: the square root of the variance, and hence another measure of dispersion. Symmetry: a property of some distributions in which the portion of the distribution to the left of a specific value is a mirror image of the portion to its right. Skewness: a measure of the extent to which a pmf or pdf "leans" to one side of its mean. Cumulative distribution function[edit] Because a probability distribution Pr on the real line is determined by the probability of a scalar random variable X being in a half-open interval (-∞, x], the probability distribution is completely characterized by its cumulative distribution function: Discrete probability distribution[edit] See also: Probability mass function and Categorical distribution The probability mass function of a discrete probability distribution. The probabilities of the singletons {1}, {3}, and {7} are respectively 0.2, 0.5, 0.3. A set not containing any of these points has probability zero. The cdf of a discrete probability distribution, ... ... of a continuous probability distribution, ... ... of a distribution which has both a continuous part and a discrete part. A discrete probability distribution is a probability distribution characterized by a probability mass function. Thus, the distribution of arandom variable X is discrete, and X is called a discrete random variable, if as u runs through the set of all possible values of X. Hence, a random variable can assume only a finite or countably infinite number of values—the random variable is a discrete variable. For the number of potential values to be countably infinite, even though their probabilities sum to 1, the probabilities have to decline to zero fast enough. for example, if + 1/4 + 1/8 + ... = 1. for n = 1, 2, ..., we have the sum of probabilities 1/2 Well-known discrete probability distributions used in statistical modeling include the Poisson distribution, the Bernoulli distribution, thebinomial distribution, the geometric distribution, and the negative binomial distribution. Additionally, the discrete uniform distribution is commonly used in computer programs that make equal-probability random selections between a number of choices. Measure theoretic formulation[edit] A measurable function between a probability space and a measurable space is called a discrete random variable provided its image is a countable set and the pre-image of singleton sets are measurable, i.e., for all function disjoint sets are disjoint . The latter requirement induces a probability mass via . Since the pre-images of This recovers the definition given above. Cumulative density[edit] Equivalently to the above, a discrete random variable can be defined as a random variable whose cumulative distribution function (cdf) increases only by jump discontinuities—that is, its cdf increases only where it "jumps" to a higher value, and is constant between those jumps. The points where jumps occur are precisely the values which the random variable may take. Delta-function representation[edit] Consequently, a discrete probability distribution is often represented as a generalized probability density function involving Dirac delta functions, which substantially unifies the treatment of continuous and discrete distributions. This is especially useful when dealing with probability distributions involving both a continuous and a discrete part. Indicator-function representation[edit] For a discrete random variable X, let u0, u1, ... be the values it can take with non-zero probability. Denote These are disjoint sets, and by formula (1) It follows that the probability that X takes any value except for u0, u1, ... is zero, and thus one can write X as except on a set of probability zero, where is the indicator function of A. This may serve as an alternative definition of discrete random variables. Continuous probability distribution[edit] See also: Probability density function A continuous probability distribution is a probability distribution that has a cumulative distribution function that is continuous. Most often they are generated by having aprobability density function. Mathematicians call distributions with probability density functions absolutely continuous, since their cumulative distribution function is absolutely continuous with respect to the Lebesgue measure λ. If the distribution of X is continuous, then X is called a continuous random variable. There are many examples of continuous probability distributions: normal, uniform, chi-squared, and others. Intuitively, a continuous random variable is the one which can take a continuous range of values—as opposed to a discrete distribution, where the set of possible values for the random variable is at most countable. While for a discrete distribution an event with probability zero is impossible (e.g., rolling 31/2 on a standard die is impossible, and has probability zero), this is not so in the case of a continuous random variable. For example, if one measures the width of an oak leaf, the result of 3½ cm is possible; however, it has probability zero because uncountably many other potential values exist even between 3 cm and 4 cm. Each of these individual outcomes has probability zero, yet the probability that the outcome will fall into the interval (3 cm, 4 cm) is nonzero. This apparent paradox is resolved by the fact that the probability that X attains some value within aninfinite set, such as an interval, cannot be found by naively adding the probabilities for individual values. Formally, each value has an infinitesimally small probability, whichstatistically is equivalent to zero. Formally, if X is a continuous random variable, then it has a probability density function ƒ(x), and therefore its probability of falling into a given interval, say [a, b] is given by the integral In particular, the probability for X to take any single value a (that is a ≤ X ≤ a) is zero, because an integral with coinciding upper and lower limits is always equal to zero. The definition states that a continuous probability distribution must possess a density, or equivalently, its cumulative distribution function be absolutely continuous. This requirement is stronger than simple continuity of the cumulative distribution function, and there is a special class of distributions, singular distributions, which are neither continuous nor discrete nor a mixture of those. An example is given by the Cantor distribution. Such singular distributions however are never encountered in practice. Note on terminology: some authors use the term "continuous distribution" to denote the distribution with continuous cumulative distribution function. Thus, their definition includes both the (absolutely) continuous and singular distributions. By one convention, a probability distribution is called continuous if its cumulative distribution function is continuous and, therefore, the probability measure of singletons for all . Another convention reserves the term continuous probability distribution for absolutely continuous distributions. These distributions can be characterized by a probability density function: a non-negative Lebesgue integrable function numbers such that defined on the real Discrete distributions and some continuous distributions (like the Cantor distribution) do not admit such a density. Some properties[edit]   The probability distribution of the sum of two independent random variables is the convolution of each of their distributions. Probability distributions are not a vector space—they are not closed under linear combinations, as these do not preserve non-negativity or total integral 1—but they are closed under convex combination, thus forming a convex subset of the space of functions (or measures). Kolmogorov definition[edit] Main articles: Probability space and Probability measure In the measure-theoretic formalization of probability theory, a random variable is defined as a measurable function X from a probability space to measurable space . A probability distribution of X is the pushforward measure X*P of X , which is a probability measure on satisfying X*P = PX −1. Random number generation[edit] Main article: Pseudo-random number sampling A frequent problem in statistical simulations (the Monte Carlo method) is the generation of pseudo-random numbers that are distributed in a given way. Most algorithms are based on a pseudorandom number generator that produces numbers X that are uniformly distributed in the interval [0,1). These random variates X are then transformed via some algorithm to create a new random variate having the required probability distribution. Applications[edit] The concept of the probability distribution and the random variables which they describe underlies the mathematical discipline of probability theory, and the science of statistics. There is spread or variability in almost any value that can be measured in a population (e.g. height of people, durability of a metal, sales growth, traffic flow, etc.); almost all measurements are made with some intrinsic error; in physics many processes are described probabilistically,from the kinetic properties of gases to the quantum mechanicaldescription of fundamental particles. For these and many other reasons, simple numbers are often inadequate for describing a quantity, while probability distributions are often more appropriate. As a more specific example of an application, the cache language models and other statistical language models used in natural language processing to assign probabilities to the occurrence of particular words and word sequences do so by means of probability distributions. Common probability distributions[edit] Main article: List of probability distributions The following is a list of some of the most common probability distributions, grouped by the type of process that they are related to. For a more complete list, see list of probability distributions, which groups by the nature of the outcome being considered (discrete, continuous, multivariate, etc.) Note also that all of the univariate distributions below are singly peaked; that is, it is assumed that the values cluster around a single point. In practice, actually observed quantities may cluster around multiple values. Such quantities can be modeled using a mixture distribution. Related to real-valued quantities that grow linearly (e.g. errors, offsets)[edit]  Normal distribution (Gaussian distribution), for a single such quantity; the most common continuous distribution Related to positive real-valued quantities that grow exponentially (e.g. prices, incomes, populations)[edit]   Log-normal distribution, for a single such quantity whose log is normally distributed Pareto distribution, for a single such quantity whose log is exponentially distributed; the prototypical power law distribution Related to real-valued quantities that are assumed to be uniformly distributed over a (possibly unknown) region[edit]   Discrete uniform distribution, for a finite set of values (e.g. the outcome of a fair die) Continuous uniform distribution, for continuously distributed values Related to Bernoulli trials (yes/no events, with a given probability)[edit]  Basic distributions:  Bernoulli distribution, for the outcome of a single Bernoulli trial (e.g. success/failure, yes/no)   Binomial distribution, for the number of "positive occurrences" (e.g. successes, yes votes, etc.) given a fixed total number of independent occurrences  Negative binomial distribution, for binomial-type observations but where the quantity of interest is the number of failures before a given number of successes occurs  Geometric distribution, for binomial-type observations but where the quantity of interest is the number of failures before the first success; a special case of the negative binomial distribution Related to sampling schemes over a finite population:  Hypergeometric distribution, for the number of "positive occurrences" (e.g. successes, yes votes, etc.) given a fixed number of total occurrences, using sampling without replacement  Beta-binomial distribution, for the number of "positive occurrences" (e.g. successes, yes votes, etc.) given a fixed number of total occurrences, sampling using a Polya urn scheme (in some sense, the "opposite" of sampling without replacement) Related to categorical outcomes (events with K possible outcomes, with a given probability for each outcome)[edit]    Categorical distribution, for a single categorical outcome (e.g. yes/no/maybe in a survey); a generalization of the Bernoulli distribution Multinomial distribution, for the number of each type of categorical outcome, given a fixed number of total outcomes; a generalization of the binomial distribution Multivariate hypergeometric distribution, similar to the multinomial distribution, but using sampling without replacement; a generalization of the hypergeometric distribution Related to events in a Poisson process (events that occur independently with a given rate) [edit]    Poisson distribution, for the number of occurrences of a Poisson-type event in a given period of time Exponential distribution, for the time before the next Poisson-type event occurs Gamma distribution, for the time before the next k Poissontype events occur Related to the absolute values of vectors with normally distributed components[edit]   Rayleigh distribution, for the distribution of vector magnitudes with Gaussian distributed orthogonal components. Rayleigh distributions are found in RF signals with Gaussian real and imaginary components. Rice distribution, a generalization of the Rayleigh distributions for where there is a stationary background signal component. Found in Rician fading of radio signals due to multipath propagation and in MR images with noise corruption on non-zero NMR signals. Related to normally distributed quantities operated with sum of squares (for hypothesis testing)[edit]    Chi-squared distribution, the distribution of a sum of squared standard normal variables; useful e.g. for inference regarding the sample variance of normally distributed samples (see chi-squared test) Student's t distribution, the distribution of the ratio of a standard normal variable and the square root of a scaled chi squared variable; useful for inference regarding the meanof normally distributed samples with unknown variance (see Student's t-test) F-distribution, the distribution of the ratio of two scaled chi squared variables; useful e.g. for inferences that involve comparing variances or involving R-squared (the squaredcorrelation coefficient) Useful as conjugate prior distributions in Bayesian inference[edit] Main article: Conjugate prior     Beta distribution, for a single probability (real number between 0 and 1); conjugate to the Bernoulli distribution and binomial distribution Gamma distribution, for a non-negative scaling parameter; conjugate to the rate parameter of a Poisson distribution or exponential distribution, the precision (inversevariance) of a normal distribution, etc. Dirichlet distribution, for a vector of probabilities that must sum to 1; conjugate to the categorical distribution and multinomial distribution; generalization of the beta distribution Wishart distribution, for a symmetric non-negative definite matrix; conjugate to the inverse of the covariance matrix of a multivariate normal distribution; generalization of thegamma distribution See also[edit] Statistics portal         Copula (statistics) Empirical probability Histogram Joint probability distribution Likelihood function List of statistical topics Kirkwood approximation Moment-generating function   Quasiprobability distribution Riemann–Stieltjes integral application to probability theory References[edit]    B. S. Everitt: The Cambridge Dictionary of Statistics, Cambridge University Press, Cambridge (3rd edition, 2006). ISBN 0-521-69027-7 Bishop: Pattern Recognition and Machine Learning, Springer, ISBN 0-387-31073-8 den Dekker A. J., Sijbers J., (2014) "Data distributions in magnetic resonance images: a review", Physica Medica, [1] External links[edit] 5.6.2 - "Greater than" Probabilities Printer-friendly version Sometimes we want to know the probability that a variable has a value greater than some value. For instance, we might want to know the probability that a randomly selected vehicle speed is greater than 73 mph, written P(X>73)P(X>73). Previously we found P(X<73)=.9452P(X<73)=.9452. The general rule for a "greater than" situation isP(X>x)=1−P(X≤x)P(X>x)=1−P(X≤x). Thus, P(X>73)=1−.9452=.0548P(X>73)=1−.9452=.0548. The probability that a randomly selected vehicle will be going 73 mph or greater is .0548. If we did not know P(X≤73)P(X≤73) we could compute this probability by constructing a probability distribution in Minitab Express or Minitab.  Using Minitab Express  Using Minitab 1. 2. 3. 4. 5. In Minitab Express: Open Minitab Express without any data. From the menu bar, select Statistics > Probability Distributions > Distribution Plot Click Display Probability For Distribution, select Normal (this is the default). In this scenario, our mean is 65 and our standard deviation is 5. Under Shade the area corresponding to the following: select A specified x value andRight tail. The X value is 73. The result is the following output which shows us that 0.0547993 of the distribution is greater than 73 mph. ‹ 5.6.1 - Cumulative Probabilitiesup 5.6.3 - "In between" Probabilities Printer-friendly version Suppose we want to know the probability a normal random variable is within a specified interval. For instance, suppose we want to know the probability a randomly selected vehicle is between 60 and 73 mph? We could compute the probability that the speed is less than 73 mph and the probability that the speed is less than 60 mph and subtract the two. In other words: P(60<X<73)=P(X<73)−P(X<60)P(60<X<73)=P(X<73)−P(X<60) Or, we could use statistical software to find this range:  Using Minitab Express  1. 2. 3. 4. 5. Using Minitab In Minitab Express: Open Minitab Express without any data. From the menu bar, select Statistics > Probability Distributions > Distribution Plot Click Display Probability For Distribution, select Normal (this is the default). In this scenario, our mean is 65 and our standard deviation is 5. Under Shade the area corresponding to the following: select A specified x value andMiddle. The X value 1 is 60 and X value 2 is 73. The result is the following output which shows us that 0.786545 of the distribution is between 60 mph and 73 mph. ‹ 5.6.2 - "Greater than" Probabilitiesup ility Distributions » 5.6 - Finding Probabilities using Software 5.6.4 - Finding Percentiles Printer-friendly version Percentile: Proportion of values below a given value For example, if your test score is in the 88th percentile, then you scored better than 88% of test takers. We may wish to know the value of a variable that is a specified percentile. For example, what speed is the 99.99th percentile of speeds at the highway location in our earlier example? Recall, the mean vehicle speed is 65 mph with a standard deviation of 5 mph.  Using Minitab Express  Using Minitab To calculate percentiles in Minitab Express: 1. Open Minitab Express without data 2. On a PC: From the menu bar, select Statistics > Probability Distributions > CDF/PDF > Inverse (ICDF) 3. 4. 5. 6. 7. 8. On a MAC: From the menu bar, select Statistics > Probability Distributions > Inverse Cumulative Distribution Function Form of input is A single value Value is .9999. Distribution is Normal Our mean is 65 and our standard deviation is 5 Under Output, select Display a table of inverse cumulative probabilities Click OK The result should be the following output: Video Review 5.8 - Review of Finding the Proportion Under the Normal Curve Printer-friendly version Video Review: Working with Continuous Random Variables Finding the Score for a Given Proportion This video walks through one example. A group of instructors have decided to assign grades on a curve. Given the mean and standard deviation of their students' scores, they want to know what point ranges are associated with which grades. Minitab Express is used. On Your Own Practice finding the proportion of observations under the normal curve. Each question can be answered using either Minitab Express or the z table. Work through each example then click the icon to view the solution and compare your answers. HINT: Drawing the normal curve and shading in the region you are looking for is often helpful. 1. What proportion of the standard normal curve is less than a z score of 1.64? 2. What proportion of the standard normal curve falls above a z score of 1.33? 3. What proportion of the standard normal curve falls between a z score of -.50 and a z score of +.50? 4. At one private school, a minimum IQ score of 125 is necessary to be considered for admission. IQ scores have a mean of 100 and standard deviation of 15. Given this information, what proportion of children are eligible for consideration for admission to this school? 5. ACT scores have a mean of 18 and a standard deviation of 6. What proportion of test takers score between a 20 and 26? 6. A men’s clothing company is doing research on the height of adult American men in order to inform the sizing of the clothing that they offer. The height of males in the United States is normally distributed with a mean of 175 cm and a standard deviation of 15 cm. Men who are more than 30 cm different (shorter or taller) from the mean are classified by the apparel company as special cases because they do not fit in their regular length clothing. Given this information, what proportion of men would be classified as special cases? ‹ 5.7 - Finding Probabilities using a Standard Normal Tableup Statistical Distributions (e.g. Normal, Poisson, Binomial) and their uses Statistics: Distributions Summary Normal distribution describes continuous data which have a symmetric distribution, with a characteristic 'bell' shape. Binomial distribution describes the distribution of binary data from a finite sample. Thus it gives the probability of getting r events out of n trials. Poisson distribution describes the distribution of binary data from an infinite sample. Thus it gives the probability of getting r events in a population. The Normal Distribution It is often the case with medical data that the histogram of a continuous variable obtained from a single measurement on different subjects will have a characteristic `bell-shaped' distribution known as a Normal distribution. One such example is the histogram of the birth weight (in kilograms) of the 3,226 new born babies shown in Figure 1. Figure 1 Distribution of birth weight in 3,226 newborn babies (data from O' Cathain et al 2002) To distinguish the use of the same word in normal range and Normal distribution we have used a lower and upper case convention throughout this book. The histogram of the sample data is an estimate of the population distribution of birth weights in new born babies. This population distribution can be estimated by the superimposed smooth `bell-shaped' curve or `Normal' distribution shown. We presume that if we were able to look at the entire population of new born babies then the distribution of birth weight would have exactly the Normal shape. We often infer, from a sample whose histogram has the approximate Normal shape, that the population will have exactly, or as near as makes no practical difference, that Normal shape. The Normal distribution is completely described by two parameters μ and σ, where μ represents the population mean or centre of the distribution and σ the population standard deviation. Populations with small values of the standard deviation σ have a distribution concentrated close to the centre μ; those with large standard deviation have a distribution widely spread along the measurement axis. One mathematical property of the Normal distribution is that exactly 95% of the distribution lies between μ - (1.96 x σ) and μ + (1.96 x σ) Changing the multiplier 1.96 to 2.58, exactly 99% of the Normal distribution lies in the corresponding interval. In practice the two parameters of the Normal distribution, μ and σ, must be estimated from the sample data. For this purpose a random sample from the population is first taken. The sample mean and the sample standard deviation, SD( ) = s, are then calculated. If a sample is taken from such a Normal distribution, and provided the sample is not too small, then approximately 95% of the sample will be covered by -[1.96xSD(x)] to +[1.96xSD(x)] This is calculated by merely replacing the population parameters μ and σ by the sample estimates and s in the previous expression. In appropriate circumstances this interval may estimate the reference interval for a particular laboratory test which is then used for diagnostic purposes. We can use the fact that our sample birth weight data appear Normally distributed to calculate a reference range. We have already mentioned that about 95% of the observations (from a Normal distribution) lie within +/-1.96SDs of the mean. So a reference range for our sample of babies is: 3.39-[1.96x0.55] to 3.39+[1.96x0.55] 2.31kg to 4.47kg A baby's weight at birth is strongly associated with mortality risk during the first year and, to a lesser degree, with developmental problems in childhood and the risk of various diseases in adulthood. If the data are not Normally distributed then we can base the normal reference range on the observed percentiles of the sample. i.e. 95% of the observed data lie between the 2.5 and 97.5 percentiles. So a percentile-based reference range for our sample is: 2.19kg to 4.43kg. Most reference ranges are based on samples larger than 3500 people. Over many years, and millions of births, the WHO has come up with a with a normal birth weight range for new born babies. These ranges represent results than are acceptable in newborn babies and actually cover the middle 80% of the population distribution i.e. the 10th and 90th centiles. Low birth weight babies are usually defined (by the WHO) as weighing less than 2500g (the 10th centile) regardless of gestational age, since and large birth weight babies are defined as weighing above 4000kg (the 90th centile). Hence the normal birth weight range is around 2.5kg to 4kg. For our sample data, the 10 to 90th centile range was similar, 2.75 to 4.03kg. The Binomial Distribution If a group of patients is given a new drug for the relief of a particular condition, then the proportion p being successively treated can be regarded as estimating the population treatment success rate . The sample proportion p is analogous to the sample mean , in that if we score zero for those s patients who fail on treatment, and unity for those r who succeed, then p=r/n, where n=r+s is the total number of patients treated. Thus p also represents a mean. Data which can take only a 0 or 1 response, such as treatment failure or treatment success, follow the binomial distribution provided the underlying population response rate does not change. The binomial probabilities are calculated from for successive values of R from 0 through to n. In the above n! is read as n factorial and R! as R factorial. For R=4, R!=4×3×2×1=24. Both 0! and 1! are taken as equal to unity. The shaded area marked in Figure 2 corresponds to the above expression for the binomial distribution calculated for each of R=8,9,...,20 and then added. This area totals 0.1018. So the probability of eight or more responses out of 20 is 0.1018. For a fixed sample size n the shape of the binomial distribution depends only on . Suppose n =20 patients are to be treated, and it is known that on average a quarter or =0.25 will respond to this particular treatment. The number of responses actually observed can only take integer values between 0 (no responses) and 20 (all respond). The binomial distribution for this case is illustrated in Figure 2. The distribution is not symmetric, it has a maximum at five responses and the height of the blocks corresponds to the probability of obtaining the particular number of responses from the 20 patients yet to be treated. It should be noted that the expected value for r, the number of successes yet to be observed if we treated n patients, is n . The potential variation about this expectation is expressed by the corresponding standard deviation SE(r) = √ [n (1- )] Figure 2 also shows the Normal distribution arranged to have μ=n =5 and σ=√[n (1- )]=1.94, superimposed on to a binomial distribution with π=0.25 and n=20. The Normal distribution describes fairly precisely the binomial distribution in this case. If n is small, however, or close to 0 or 1, the disparity between the Normal and binomial distributions with the same mean and standard deviation, similar to those illustrated in Figure 2, increases and the Normal distribution can no longer be used to approximate the binomial distribution. In such cases the probabilities generated by the binomial distribution itself must be used. It is also only in situations in which reasonable agreement exists between the distributions that we would use the confidence interval expression given previously. For technical reasons, the expression given for a confidence interval for is an approximation. The approximation will usually be quite good provided p is not too close to 0 or 1, situations in which either almost none or nearly all of the patients respond to treatment. The approximation improves with increasing sample size n. Figure 2: Binomial distribution for n=20 with =0.25 and the Normal approximation The Poisson Distribution The Poisson distribution is used to describe discrete quantitative data such as counts in which the populations size n is large, the probability of an individual event is small, but the expected number of events, n , is moderate (say five or more). Typical examples are the number of deaths in a town from a particular disease per day, or the number of admissions to a particular hospital. Example Wight et al (2004) looked at the variation in cadaveric heart beating organ donor rates in the UK. They found that they were 1330 organ donors, aged 15-69, across the UK for the two years 1999 and 2000 combined. Heart-beating donors are patients who are seriously ill in an intensive care unit (ICU) and are placed on a ventilator. Now it is clear that the distribution of number of donors takes integer values only, thus the distribution is similar in this respect to the binomial. However, there is no theoretical limit to the number of organ donors that could happen on a particular day. Here the population is the UK population aged 15-69, over two years, which is over 82 million people, so in this case each member can be thought to have a very small probability of actually suffering an event, in this case being admitted to a hospital ICU and placed on a ventilator with a life threatening condition. The mean number of organ donors per day over the two year period is calculated as It should be noted that the expression for the mean is similar to that for , except here multiple data values are common; and so instead of writing each as a distinct figure in the numerator they are first grouped and counted. For data arising from a Poisson distribution the standard error, that is the standard deviation of r, is estimated by SE(r) = √(r/n), where n is the total number of days. Provided the organ donation rate is not too low, a 95% confidence interval for the underlying (true) organ donation rate λ can be calculated by r-1.96×SE(r) to r+1.96× SE(r). In the above example r=1.82, SE(r)=√(1.82/730)=0.05, and therefore the 95% confidence interval for λ is 1.72 to 1.92 organ donations per day. Exact confidence intervals can be calculated as described by Altman et al. (2000). The Poisson probabilities are calculated from Prob(R responses) = e-λλR R! for successive values of R from 0 to infinity. Here e is the exponential constant 2.7182…, and λ is the population rate which is estimated by r in the example above. Example Suppose that before the study of Wight et al. (2004) was conducted it was expected that the number of organ donations per day was approximately one. Then assuming λ = 2, we would anticipate the probability of 0 organ donations to be e-110/0!=e-1=0.1353. (Remember that 10 and 0! are both equal to 1.) The probability of one organ donation would be e-111/1!=e-1=27.07. Similarly the probability of two organ donations per day is e-112/2!=e-1/2=0.2707; and so on to give for three donations 0.1804, four donations 0.0902, five donations 0.0361, six donations 0.0120, etc. If the study is then to be conducted over 2 years (730 days), each of these probabilities is multiplied by 730 to give the expected number of days during which 0, 1, 2, 3, etc. donations will occur. These expectations are 98.8, 197.6, 197.6, 131.7, 26.3, 8.8 days. A comparison can then be made between what is expected and what is actually observed. References  Altman D.G., Machin D., Bryant T.N., & Gardner M.J. Statistics with Confidence. Confidence intervals and statistical guidelines (2nd Edition). London: British Medical Journal, 2000  Campbell MJ Machin D. Medical Statistics : A Commonsense Approach. Chichester: Wiley, 1999.  O'Cathain A., Walters S.J., Nicholl J.P., Thomas K.J., & Kirkham M. Use of evidence based leaflets to promote informed choice in maternity care: randomised controlled trial in everyday practice. British Medical Journal 2002; 324: 643-646.  Melchart D, Streng a, Hoppe A, Brinkhaus B, Witt C, et al Acupuncture in patients with tension-type headache: randomised controlled trial BMJ 2005;331:376-382  Wight J., Jakubovic M., Walters S., Maheswaran R., White P., Lennon V. Variation in cadaveric organ donor rates in the UK. Nephrology Dialysis Transplantation 2004; 19(4): 963-968, 2004. Cart SEARCH Sign In   Scientific Software Data Analysis Resource Center    Company Support How to Buy 1. Select category 2. Choose calculator 3. Enter data 4. View results Binomial, Poisson and Gaussian distributions Binomial distribution The binomial distribution applies when there are two possible outcomes. You know the probability of obtaining either outcome (traditionally called "success" and "failure") and want to know the chance of obtaining a certain number of successes in a certain number of trials. How many trials (or subjects) per experiment? What is the probability of "success" in each trial or subject? Calculate Probabilities Poisson distribution The Poisson distribution applies when you are counting the number of objects in a certain volume or the number of events in a certain time period. You know the average number of counts, and wish to know the chance of actually observing various numbers of objects or events. Average number of objects per area (or events per unit time)? Calculate Probabilities Gaussian distribution The Gaussian distribution applies when the outcome is expressed as a number that can have a fractional value. If there are numerous reasons why any particular measurement is different than the mean, the distribution of measurements will tend to follow a Gaussian bell-shaped distribution. If you know the mean and SD of this distribution, you can compute the fraction of the population that is greater (or less) than any particular value. Normal Distribution, Binomial Distribution, Poisson Distribution 1. 1. Binomial Distribution and Applications 2. 2. Binomial Probability Distribution Is the binomial distribution is a continuous distribution?Why? Notation: X ~ B(n,p) There are 4 conditions need to be satisfied for a binomial experiment: 1. There is a fixed number of n trials carried out. 2. The outcome of a given trial is either a “success” or “failure”. 3. The probability of success (p) remains constant from trial to trial. 4. The trials are independent, the outcome of a trial is not affected by the outcome of any other trial. 3. 3. Comparison between binomial and normal distributions 4. 4. Binomial Distribution If X ~ B(n, p), then where successof trials.insuccessesofnumberr 11!and10!also,1...)2()1(! yprobabilitP n nnnn .,...,1,0r)1( )!(! ! )1()( npp rnr n ppcrXP rnrrnr n r 5. 5. Exam Question  Ten percent of computer parts produced by a certain supplier are defective. What is the probability that a sample of 10 parts contains more than 3 defective ones? 6. 6. Solution :  Method 1(Using Binomial Formula): 7. 7. Method 2(Using Binomial Table): 8. 8.  From table of binomial distribution : 9. 9. Example 2 If X is binomially distributed with 6 trials and a probability of success equal to ¼ at each attempt. What is the probability of a)exactly 4 succes. b)at least one success. 10. 10. Example 3 Jeremy sells a magazine which is produced in order to raise money for homeless people. The probability of making a sale is, independently, 0.50 for each person he approaches. Given that he approaches 12 people, find the probability that he will make: (a)2 or fewer sales; (b)exactly 4 sales; (c)more than 5 sales. 11. 11. Normal Distribution 12. 12. Normal Distribution  In general, when we gather data, we expect to see a particular pattern to the data, called a normal distribution. A normal distribution is one where the data is evenly distributed around the mean, which when plotted as a histogram will result in a bell curve also known as a Gaussian distribution. 13. 13.  thus, things tend towards the mean – the closer a value is to the mean, the more you’ll see it; and the number of values on either side of the mean at any particular distance are equal or in symmetry. 14. 14.  15. 15. Z-score  with mean and standard deviation of a set of scores which are normally distributed, we can standardize each "raw" score, x, by converting it into a z score by using the following formula on each individual score: 16. 16. Example 1 a) Find the z-score corresponding to a raw score of 132 from a normal distribution with mean 100 and standard deviation 15. b) A z-score of 1.7 was found from an observation coming from a normal distribution with mean 14 and standard deviation 3. Find the raw score. Solution a)We compute 132 - z = __________ = 2.133 15 b) We have x - 1.7 = ________ 3 To solve this we just multiply both sides by the denominator 3, (1.7)(3) = x - 14 5.1 = x - 14 x = 19.1 17. 17. Example 2 Find a) P(z < 2.37) b) P(z > 1.82) Solution a)We use the table. Notice the picture on the table has shaded region corresponding to the area to the left (below) a zscore. This is exactly what we want. Hence P(z < 2.37) = .9911 b) In this case, we want the area to the right of 1.82. This is not what is given in the table. We can use the identity P(z > 1.82) = 1 - P(z < 1.82) reading the table gives P(z < 1.82) = .9656 Our answer is P(z > 1.82) = 1 - .9656 = .0344 18. 18. Example 3 Find P(-1.18 < z < 2.1) Solution Once again, the table does not exactly handle this type of area. However, the area between -1.18 and 2.1 is equal to the area to the left of 2.1 minus the area to the left of -1.18. That is P(-1.18 < z < 2.1) = P(z < 2.1) P(z < -1.18) To find P(z < 2.1) we rewrite it as P(z < 2.10) and use the table to get P(z < 2.10) = .9821. The table also tells us that P(z < -1.18) = .1190 Now subtract to get P(1.18 < z < 2.1) = .9821 - .1190 = .8631 19. 19. Poisson distribution 20. 20. Definitions  a discrete probability distribution for the count of events that occur randomly in a given time.  a discrete frequency distribution which gives the probability of a number of independent events occurring in a fixed time. 21. 21. Poisson distribution only apply one formula: Where:  X = the number of events  λ = mean of the event per interval Where e is the constant, Euler's number (e = 2.71828...) 22. 22. Example: Births rate in a hospital occur randomly at an average rate of 1.8 births per hour. What is the probability of observing 4 births in a given hour at the hospital? Assuming X = No. of births in a given hour i) Events occur randomly ii) Mean rate λ = 1.8 Using the poisson formula, we cam simply calculate the distribution. P(X = 4) =( e^1.8)(1.8^4)/(4!) Ans: 0.0723 23. 23.  If the probability of an item failing is 0.001, what is the probability of 3 failing out of a population of 2000? Λ = n * p = 2000 * 0.001 = 2 Hence, use the Poisson formula X = 3, P(X = 3) = Ans: 0.1804 24. 24. Example: A small life insurance company has determined that on the average it receives 6 death claims per day. Find the probability that the company receives at least seven death claims on a randomly selected day. 25. 25. Analysis method  1st: analyse the given data.  2nd: label the value of x, λ  At least 7 days, means the probability must be ≥ 7. but the value will be to the infinity. Hence, must apply the probability rule which is  P(X ≥ 7) = 1 – P(X ≤ 6)  P(X ≤ 6) means that the value of x must be from 0, 1, 2, 3, 4, 5, 6.  Total them up using Poisson, then 1 subtract the answer.  Ans = 0.3938 26. 26. Example: The number of traffic accidents that occurs on a particular stretch of road during a month follows a Poisson distribution with a mean of 9.4. Find the probability that less than two accidents will occur on this stretch of road during a randomly selected month. P(x < 2) = P(x = 0) + P(x = 1) Ans: 0.000860 Business Statistics: Revealing Facts From Figures URL for this site is: http://ubmail.ubalt.edu/~harsham/Business-stat/opre504.htm Interactive Online Version Europe Mirror Site I am always happy to help students who are not enrolled in my courses with questions and problems. But unfortunately, I don't have enough time to respond to everyone. Thank you for your understanding. Professor Hossein Arsham MENU    Course Information (for students enrolled in my class) Introduction Towards Statistical Thinking For Decision Making Under Uncertainties     Probability for Statistical Inference Topics in Business Statistics Statistical Books List Interesting and Useful Sites  Introduction Towards Statistical Thinking For Decision Making Under Uncertainties The Birth of Statistics What is Business Statistics Belief, Opinion, and Fact Kinds of Lies: Lies, Damned Lies and Statistics Probability for Statistical Inference Different Schools of Thought in Inferential Statistics Bayesian, Frequentist, and Classical Methods Probability, Chance, Likelihood, and Odds How to Assign Probabilities General Laws of Probability Mutually Exclusive versus Independent Events Entropy Measure Applications of and Conditions for Using Statistical Tables Relationships Among Distributions and Unification of Statistical Tables       Normal Distribution Binomial Distribution Poisson Distribution Exponential Distribution Uniform Distribution Student's t-Distributions Topics in Business Statistics Greek Letters Commonly Used in Statistics Type of Data and Levels of Measurement Sampling Methods Number of Class Intervals in a Histogram How to Construct a Box Plot Outlier Removal Statistical Summaries      What What What What Representative of a Sample: Measures of Central Tendency Selecting Among the Mean, Median, and Mode Quality of a Sample: Measures of Dispersion Guess a Distribution to Fit Your Data: Skewness & Kurtosis A Numerical Example & Discussions Is Is Is Is So Important About the Normal Distributions a Sampling Distribution Central Limit Theorem "Degrees of Freedom" Parameters' Estimation and Quality of a 'Good' Estimate Procedures for Statistical Decision Making Statistics with Confidence and Determining Sample Size Hypothesis Testing: Rejecting a Claim The Classical Approach to the Test of Hypotheses The Meaning and Interpretation of P-values (what the data say) Blending the Classical and the P-value Based Approaches in Test of Hypotheses Conditions Under Which Most Statistical Testings Apply    Homogeneous Population (Don't mix apples and oranges) Test for Randomness: The Runs Test Lilliefors Test for Normality Statistical Tests for Equality of Populations Characteristics  Two-Population Independent Means (T-test)    Two Dependent Means (T-test for paired data sets) More Than Two Independent Means (ANOVA) More Than Two Dependent Means (ANOVA) Power of a Test Parametric vs. Non-Parametric vs. Distribution-free Tests Chi-square Tests Bonferroni Method Goodness-of-fit Test for Discrete Random Variables When We Should Pool Variance Estimates Resampling Techniques: Jackknifing, and Bootstrapping What is a Linear Least Squares Model Pearson's and Spearman's Correlations How to Compare Two Correlations Coefficients Independence vs. Correlated Correlation, and Level of Significance Regression Analysis: Planning, Development, and Maintenance Predicting Market Response Warranties: Statistical Planning and Analysis Factor Analysis Interesting and Useful Sites (topical category) Selected Reciprocal Web Sites Review of Statistical Tools on the Internet General References Statistical Societies & Organizations Statistics References Statistics Resources Statistical Data Analysis Probability Resources Data and Data Analysis Computational Probability and Statistics Resources Questionnaire Design, Surveys Sampling and Analysis Statistical Software Learning Statistics Econometric and Forecasting Selected Topics Glossary Collections Sites Statistical Tables Introduction This Web site is a course in statistics appreciation, i.e. to acquire a feel for the statistical way of thinking. An introductory course in statistics designed to provide you with the basic concepts and methods of statistical analysis for processes and products. Materials in this Web site are tailored to meet your needs in business decision making. It promotes think statistically. The cardinal objective for this Web site is to increase the extent to which statistical thinking is embedded in management thinking for decision making under uncertainties. It is already an accepted fact that "Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write." So, let's be ahead of our time. To be competitive, business must design quality into products and processes. Further, they must facilitate a process of never-ending improvement at all stages of manufacturing. A strategy employing statistical methods, particularly statistically designed experiments, produces processes that provide high yield and products that seldom fail. Moreover, it facilitates development of robust products that are insensitive to changes in the environment and internal component variation. Carefully planned statistical studies remove hindrances to high quality and productivity at every stage of production, saving time and money. It is well recognized that quality must be engineered into products as early as possible in the design process. One must know how to use carefully planned, cost-effective experiments to improve, optimize and make robust products and processes. Business Statistics is a science assisting you to make business decisions under uncertainties based on some numerical and measurable scales. Decision making process must be based on data neither on personal opinion nor on belief. Know that data are only crude information and not knowledge by themselves. The sequence from data to knowledge is: from Data to Information, from Information to Facts, and finally, from Facts to Knowledge. Data becomes information when it becomes relevant to your decision problem. Information becomes fact when the data can support it. Fact becomes knowledge when it is used in the successful completion of decision process. The following figure illustrates the statistical thinking process based on data in constructing statistical models for decision making under uncertainties. Knowledge is more than knowing something technical. Knowledge needs wisdom, and wisdom comes with age and experience. Wisdom is about knowing how something technical can be best used to meet the needs of the decision-maker. Wisdom, for example, creates statistical software that is useful, rather than technically brilliant. The Devil is in the Deviations: Variation is an inevitability in life! Every process has variation. Every measurement. Every sample! Managers need to understand variation for two key reasons. First, so that they can lead others to apply statistical thinking in day to day activities and secondly, to apply the concept for the purpose of continuous improvement. This course will provide you with hands-on experience to promote the use of statistical thinking and techniques to apply them to make educated decisions whenever you encounter variation in business data. You will learn techniques to intelligently assess and manage the risks inherent in decision-making. Therefore, remember that: Just like weather, if you cannot control something, you should learn how to measure and analyze, in order to predict it, effectively. If you have taken statistics before, and have a feeling of inability to grasp concepts, it is largely due to your former non-statistician instructors teaching statistics. Their deficiencies lead students to develop phobias for the sweet science of statistics. In this respect, the following remark is made by Professor Herman Chernoff, in Statistical Science, Vol. 11, No. 4, 335-350, 1996: "Since everybody in the world thinks he can teach statistics even though he does not know any, I shall put myself in the position of teaching biology even though I do not know any" Plugging numbers in the formulas and crunching them has no value by themselves. You should continue to put effort into the concepts and concentrate on interpreting the results. Even, when you solve a small size problem by hand, I would like you to use the available computer software and Webbased computation to do the dirty work for you. You must be able to read off the logical secrete in any formulas not memorizing them. For example, in computing the variance, consider its formula. Instead of memorizing, you should start with some whys: i. Why we square the deviations from the mean. Because, if we add up all deviations we get always zero. So to get away from this problem, we square the deviations. Why not raising to the power of four (three will not work)? Since squaring does the trick why should we make life more complicated than it is. Notice also that squaring also magnifies the deviations, therefore it works to our advantage to measure the quality of the data. ii. Why there is a summation notation in the formula. To add up the squared deviation of each data point to compute the total sum of squared deviations. iii. Why we divide the sum of squares by n-1. The amount of deviation should reflects also how large is the sample, so we must bring in the sample size. That is, in general larger sample size have larger sum of square deviation from the mean. Okay. Why n-1 and not n. The reason for it is that when you divide by n-1 the sample's variance provide a much closer to the population variance than when you divide by n, on average. You note that for large sample size n (say over 30) it really does not matter whether you divide by n or n-1. The results are almost the same and acceptable. The factor n-1 is the so called the "degrees of freedom". This was just an example for you to show as how to question the formulas rather than memorizing them. If fact when you try to understand the formulas you do not need to remember them, they are parts of your brain connectivity. Clear thinking is always more important than the ability to do a lot of arithmetic. When you look at a statistical formula the formula should talk to you, as when a musician looks at a piece of musicalnotes he/she hears the music.How to become a statistician who is also a musician? The objectives for this course are to learn statistical thinking; to emphasize more data and concepts, less theory and fewer recipes; and finally to foster active learning using, e.g., the useful and interesting Web-sites. Some Topics in Business Statistics Greek Letters Commonly Used as Statistical Notations We use Greek letters in statistics and other scientific areas to honor the ancient Greek philosophers who invented science (such as Socrates, the inventor of dialectic reasoning). Greek Letters Commonly Used as Statistical Notations alpha beta ki-sqre delta mu nu pi rho sigma tau theta   2         Note: ki-square (ki-sqre, Chi-square), 2, is not the square of anything, its name imply Chi-square (read, ki-square). Ki does not exist in statistics. I'm glad that you're overcoming all the confusions that exist in learning statistics. The Birth of Statistics The original idea of "statistics" was the collection of information about and for the "State". The birth of statistics occurred in mid-17th century. A commoner, named John Graunt, who was a native of London, begin reviewing a weekly church publication issued by the local parish clerk that listed the number of births, christenings, and deaths in each parish. These so called Bills of Mortality also listed the causes of death. Graunt who was a shopkeeper organized this data in the forms we call descriptive statistics, which was published asNatural and Political Observation Made upon the Bills of Mortality. Shortly thereafter, he was elected as a member of Royal Society. Thus, statistics has to borrow some concepts from sociology, such as the concept of "Population". It has been argued that since statistics usually involves the study of human behavior, it cannot claim the precision of the physical sciences. Probability has much longer history. It originated from the study of games of chance and gambling during the sixteenth century. Probability theory was a branch of mathematics studied by Blaise Pascal and Pierre de Fermat in the seventeenth century. Currently, in 21st centuray, probabilistic modeling are used to control the flow of traffic through a highway system, a telephone interchange, or a computer processor; find the genetic makeup of individuals or populations; quality control; insurance; investment; and other sectors of business and industry. New and ever growing diverse fields of human activities are using statistics, however, it seems that this field itself remains obscure to the public. Professor Bradley Efron expressed this fact nicely: During the 20th Century statistical thinking and methodology have become the scientific framework for literally dozens of fields including education, agriculture, economics, biology, and medicine, and with increasing influence recently on the hard sciences such as astronomy, geology, and physics. In other words, we have grown from a small obscure field into a big obscure field. For the history of probability, and history of statistics, visit History of Statistics Material. I also recommend the following books. Further Readings: Daston L., Classical Probability in the Enlightenment, Princeton University Press, 1988. The book points out that early Enlightenment thinkers could not face uncertainty. A mechanistic, deterministic machine, was the Enlightenment view of the world. Gillies D., Philosophical Theories of Probability, Routledge, 2000. Covers the classical, logical, subjective, frequency, and propensity views. Hacking I., The Emergence of Probability, Cambridge University Press, London, 1975. A philosophical study of early ideas about probability, induction and statistical inference. Peters W., Counting for Something: Statistical Principles and Personalities, Springer, New York, 1987. It teaches the principles of applied economic and social statistics in a historical context. Featured topics include public opinion polls, industrial quality control, factor analysis, Bayesian methods, program evaluation, nonparametric and robust methods, and exploratory data analysis. Porter T., The Rise of Statistical Thinking, 1820-1900, Princeton University Press, 1986. The author states that statistics has become known in the twentieth century as the mathematical tool for analyzing experimental and observational data. Enshrined by public policy as the only reliable basis for judgments as the efficacy of medical procedures or the safety of chemicals, and adopted by business for such uses as industrial quality control, it is evidently among the products of science whose influence on public and private life has been most pervasive. Statistical analysis has also come to be seen in many scientific disciplines as indispensable for drawing reliable conclusions from empirical results.This new field of mathematics found so extensive a domain of applications. Stigler S., The History of Statistics: The Measurement of Uncertainty Before 1900, U. of Chicago Press, 1990. It covers the people, ideas, and events underlying the birth and development of early statistics. Tankard J., The Statistical Pioneers, Schenkman Books, New York, 1984. This work provides the detailed lives and times of theorists whose work continues to shape much of the modern statistics. What is Business Statistics? In this diverse world of ours, no two things are exactly the same. A statistician is interested in both the differences and the similarities, i.e. both patterns and departures. The actuarial tables published by insurance companies reflect their statistical analysis of the average life expectancy of men and women at any given age. From these numbers, the insurance companies then calculate the appropriate premiums for a particular individual to purchase a given amount of insurance. Exploratory analysis of data makes use of numerical and graphical techniques to study patterns and departures from patterns. The widely used descriptive statistical techniques are: Frequency Distribution Histograms; Box & Whisker and Spread plots; Normal plots; Cochrane (odds ratio) plots; Scattergrams and Error Bar plots; Ladder, Agreement and Survival plots; Residual, ROC and diagnostic plots; and Population pyramid. Graphical modeling is a collection of powerful and practical techniques for simplifying and describing inter-relationships between many variables, based on the remarkable correspondence between the statistical concept of conditional independence and the graph-theoretic concept of separation. The controversial "Million Man March on Washington" was in 1995 demonstrated the size of a rally can have important political consequences. March organizers steadfastly maintained the official attendance estimates offered by the U. S. Park Service (300,000) were too low. Is it? In examining distributions of data, you should be able to detect important characteristics, such as shape, location, variability, and unusual values. From careful observations of patterns in data, you can generate conjectures about relationships among variables. The notion of how one variable may be associated with another permeates almost all of statistics, from simple comparisons of proportions through linear regression. The difference between association and causation must accompany this conceptual development. Data must be collected according to a well-developed plan if valid information on a conjecture is to be obtained. The plan must identify important variables related to the conjecture and specify how they are to be measured. From the data collection plan, a statistical model can be formulated from which inferences can be drawn. Statistical models are currently used in various fields of business and science. However, the terminology differs from field to field. For example, the fitting of models to data, called calibration, history matching, and data assimilation, are all synonymous with parameter estimation. Know that data are only crude information and not knowledge by themselves. The sequence from data to knowledge is: from Data to Information, from Information to Facts, and finally, from Facts to Knowledge. Data becomes information when it becomes relevant to your decision problem. Information becomes fact when the data can support it. Fact becomes knowledge when it is used in the successful completion of decision process. The following figure illustrates the statistical thinking process based on data in constructing statistical models for decision making under uncertainties. That's why we need Business Statistics. Statistics arose from the need to place knowledge on a systematic evidence base. This required a study of the laws of probability, the development of measures of data properties and relationships, and so on. The main objective of Business Statistics is to make inference (prediction, making decisions) about certain characteristics of a population based on information contained in a random sample from the entire population, as depicted below: Business Statistics is the science of ‘good' decision making in the face of uncertainty and is used in many disciplines such as financial analysis, econometrics, auditing, production and operations including services improvement, and marketing research. It provides knowledge and skills to interpret and use statistical techniques in a variety of business applications. A typical Business Statistics course is intended for business majors, and covers statistical study, descriptive statistics (collection, description, analysis, and summary of data), probability, and the binomial and normal distributions, test of hypotheses and confidence intervals, linear regression, and correlation. The following discussion refers to the above chart. Statistics is a science of making decisions with respect to the characteristics of a group of persons or objects on the basis of numerical information obtained from a randomly selected sample of the group. At the planning stage of a statistical investigation the question of sample size (n) is critical. This course provides a practical introduction to sample size determination in the context of some commonly used significance tests. Population: A population is any entire collection of people, animals, plants or things from which we may collect data. It is the entire group we are interested in, which we wish to describe or draw conclusions about. In the above figure the life of the light bulbs manufactured say by GE, is the concerned population. Statistical Experiment In order to make any generalization about a population, a random sample from the entire population, that is meant to be representative of the population, is often studied. For each population there are many possible samples. A sample statistic gives information about a corresponding population parameter. For example, the sample mean for a set of data would give information about the overall population mean . It is important that the investigator carefully and completely defines the population before collecting the sample, including a description of the members to be included. Example: The population for a study of infant health might be all children born in the U.S.A. in the 1980's. The sample might be all babies born on 7th May in any of the years. An experiment is any process or study which results in the collection of data, the outcome of which is unknown. In statistics, the term is usually restricted to situations in which the researcher has control over some of the conditions under which the experiment takes place. Example: Before introducing a new drug treatment to reduce high blood pressure, the manufacturer carries out an experiment to compare the effectiveness of the new drug with that of one currently prescribed. Newly diagnosed subjects are recruited from a group of local general practices. Half of them are chosen at random to receive the new drug, the remainder receive the present one. So, the researcher has control over the type of subject recruited and the way in which they are allocated to treatment. Experimental (or Sampling) Unit: A unit is a person, animal, plant or thing which is actually studied by a researcher; the basic objects upon which the study or experiment is carried out. For example, a person; a monkey; a sample of soil; a pot of seedlings; a postcode area; a doctor's practice. Design of experiments is a key tool for increasing the rate of acquiring new knowledge–knowledge that in turn can be used to gain competitive advantage, shorten the product development cycle, and produce new products and processes which will meet and exceed your customer's expectations. The major task of statistics is to study the characteristics of populations whether these populations are people, objects, or collections of information. For two major reasons, it is often impossible to study an entire population: The process would be too expensive or time consuming. The process would be destructive. In either case, we would resort to looking at a sample chosen from the population and trying to infer information about the entire population by only examining the smaller sample. Very often the numbers which interest us most about the population are the mean  and standard deviation . Any number -- like the mean or standard deviation -- which is calculated from an entire population is called a Parameter. If the very same numbers are derived only from the data of a sample, then the resulting numbers are called Statistics. Frequently, parameters are represented by Greek letters and statistics by Latin letters (as shown in the above Figure). The step function in this figure is the Empirical Distribution Function (EDF), known also as Ogive, which is used to graph cumulative frequency. An EDF is constructed by placing a point corresponding to the middle point of each class at a height equal to the cumulative frequency of the class. EDF represents the distribution function Fx. Parameter A parameter is a value, usually unknown (and therefore has to be estimated), used to represent a certain population characteristic. For example, the population mean is a parameter that is often used to indicate the average value of a quantity. Within a population, a parameter is a fixed value which does not vary. Each sample drawn from the population has its own value of any statistic that is used to estimate this parameter. For example, the mean of the data in a sample is used to give information about the overall mean in the population from which that sample was drawn. Statistic: A statistic is a quantity that is calculated from a sample of data. It is used to give information about unknown values in the corresponding population. For example, the average of the data in a sample is used to give information about the overall average in the population from which that sample was drawn. It is possible to draw more than one sample from the same population and the value of a statistic will in general vary from sample to sample. For example, the average value in a sample is a statistic. The average values in more than one sample, drawn from the same population, will not necessarily be equal. Statistics are often assigned Roman letters (e.g. and s), whereas the equivalent unknown values in the population (parameters ) are assigned Greek letters (e.g. µ, ). The word estimate means to esteem, that is giving a value to something. A statistical estimate is an indication of the value of an unknown quantity based on observed data. More formally, an estimate is the particular value of an estimator that is obtained from a particular sample of data and used to indicate the value of a parameter. Example: Suppose the manager of a shop wanted to know , the mean expenditure of customers in her shop in the last year. She could calculate the average expenditure of the hundreds (or perhaps thousands) of customers who bought goods in her shop, that is, the population mean . Instead she could use an estimate of this population mean by calculating the mean of a representative sample of customers. If this value was found to be $25, then $25 would be her estimate. There are two broad subdivisions of statistics: Descriptive statistics and Inferential statistics. The principal descriptive quantity derived from sample data is the mean ( ), which is the arithmetic average of the sample data. It serves as the most reliable single measure of the value of a typical member of the sample. If the sample contains a few values that are so large or so small that they have an exaggerated effect on the value of the mean, the sample is more accurately represented by the median -- the value where half the sample values fall below and half above. The quantities most commonly used to measure the dispersion of the values about their mean are the variance s2 and its square root , the standard deviation s. The variance is calculated by determining the mean, subtracting it from each of the sample values (yielding the deviation of the samples), and then averaging the squares of these deviations. The mean and standard deviation of the sample are used as estimates of the corresponding characteristics of the entire group from which the sample was drawn. They do not, in general, completely describe the distribution (Fx) of values within either the sample or the parent group; indeed, different distributions may have the same mean and standard deviation. They do, however, provide a complete description of the Normal Distribution, in which positive and negative deviations from the mean are equally common and small deviations are much more common than large ones. For a normally distributed set of values, a graph showing the dependence of the frequency of the deviations upon their magnitudes is a bell-shaped curve. About 68 percent of the values will differ from the mean by less than the standard deviation, and almost 100 percent will differ by less than three times the standard deviation. Statistical inference refers to extending your knowledge obtained from a random sample from the entire population to the whole population. This is known in mathematics as Inductive Reasoning. That is, knowledge of the whole from a particular. Its main application is in hypotheses testing about a given population. Inferential statistics is concerned with making inferences from samples about the populations from which they have been drawn. In other words, if we find a difference between two samples, we would like to know, is this a "real" difference (i.e., is it present in the population) or just a "chance" difference (i.e. it could just be the result of random sampling error). That's what tests of statistical significance are all about. Statistical inference guides the selection of appropriate statistical models. Models and data interact in statistical work. Models are used to draw conclusions from data, while the data are allowed to criticize, and even falsify the model through inferential and diagnostic methods. Inference from data can be thought of as the process of selecting a reasonable model, including a statement in probability language of how confident one can be about the selection. Inferences made in statistics are of two types. The first is estimation, which involves the determination, with a possible error due to sampling, of the unknown value of a population characteristic, such as the proportion having a specific attribute or the average value  of some numerical measurement. To express the accuracy of the estimates of population characteristics, one must also compute the "standard errors" of the estimates; these are margins that determine the possible errors arising from the fact that the estimates are based on random samples from the entire population and not on a complete population census. The second type of inference is hypothesis testing. It involves the definitions of a "hypothesis" as one set of possible population values and an "alternative," a different set. There are many statistical procedures for determining, on the basis of a sample, whether the true population characteristic belongs to the set of values in the hypothesis or the alternative. The statistical inference is grounded in probability, idealized concepts of the group under study, called the population, and the sample. The statistician may view the population as a set of balls from which the sample is selected at random, that is, in such a way that each ball has the same chance as every other one for inclusion in the sample. Notice that to be able to estimate the population parameters, the sample size n most be greater than one. For example, with a sample size of one the variation (s2) within the sample is 0/1 = 0. An estimate for the variation (2) within the population would be 0/0, which is indeterminate quantity, meaning impossible. For working with zero correctly, visit the Web site The Zero Saga & Confusions With Numbers. Probability is the tool used for anticipating what the distribution of data should look like under a given model. Random phenomena are not haphazard: they display an order that emerges only in the long run and is described by a distribution. The mathematical description of variation is central to statistics. The probability required for statistical inference is not primarily axiomatic or combinatorial, but is oriented toward describing data distributions. Statistics is a tool that enables us to impose order on the disorganized cacophony of the real world of modern society. The business world has grown both in size and competition. Corporations must perform risky businesses, hence the growth in popularity and need for business statistics. Business statistics has grown out of the art of constructing charts and tables! It is a science of basing decisions on numerical data in the face of uncertainty. Business statistics is a scientific approach to decision making under risk. In practicing business statistics, we search for an insight, not the solution. Our search is for the one solution that meets all the business's needs with the lowest level of risk. Business statistics can take a normal business situation and with the proper data gathering, analysis, and re-search for a solution, turn it into an opportunity. While business statistics cannot replace the knowledge and experience of the decision maker, it is a valuable tool that the manager can employ to assist in the decision making process in order to reduce the inherent risk. Business Statistics provides justifiable answers to the following concerns for every consumer and producer: 1. What is your or your customer's Expectation of the product/service you buy or that you sell? That is, what is a good estimate for ? 2. Given the information about your or your customer's expectation, what is the Quality of the product/service you buy or you sell. That is, what is a good estimate for ? 3. Given the information about your or your customer's expectation, and the quality of the product/service you buy or you sell, does the product/servive Compare with other existing similar types? That is, comparing several 's. Visit also the following Web sites: What is Statistics? How to Study Statistics Decision Analysis Kinds of Lies: Lies, Damned Lies and Statistics "There are three kinds of lies -- lies, damned lies, and statistics." quoted in Mark Twain's autobiography. It is already an accepted fact that "Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write." The following are some examples as how statistics could be misused in advertising, which can be described as the science of arresting human unintelligence long enough to get money from it. The founder of Revlon says "In factory we make cosmetics; in the store we sell hope." In most cases, the deception of advertising is achieved by omission: 1. The Incredible Expansion Toyota: "How can it be that an automobile that's a mere nine inches longer on the outside give you over two feet more room on the inside? May be it's the new math!" Toyota Camry Ad. Where is the fallacy in this statement? Taking volume as length! For example : 3x6x4=72 feet (cubic), 3x6x4.75=85.5 feet (cubic). It could be even more than 2 feet! 2. Pepsi Cola Ad.: " In recent side-by-side blind taste tests, nationwide, more people preferred Pepsi over Coca-Cola". The questions are: Was it just some of taste tests, what was the sample size? It does not say "In all recent…" 3. Correlation? Consortium of Electric Companies Ad. "96% of streets in the US are under-lit and, moreover, 88% of crimes take place on under-lit streets". 4. Dependent or Independent Events? "If the probability of someone carrying a bomb on a plane is .001, then the chance of two people carrying a bomb is .000001. Therefore, I should start carrying a bomb on every flight." 5. Paperboard Packaging Council's concerns: "University studies show paper milk cartons give you more vitamins to the gallon." How was the design of experiment? The research was sponsored by the council! Paperboard sales is declining! 6. All the vitamins or just one? "You'd have to eat four bowls of Raisin Bran to get the vitamin nutrition in one bowl of Total". 7. Six Times as Safe: "Last year 35 people drowned in boating accidents. Only 5 were wearing life jackets. The rest were not. Always wear life jacket when boating". What percentage of boaters wear life jackets? Conditional probability. 8. A Tax Accountant Firm Ad.: "One of our officers would accompany you in the case of Audit". This sounds like a unique selling proposition, but it conceals the fact that the statement is a US Law. 9. Dunkin Donuts Ad.: "Free 3 muffins when you buy three at the regular 1/2 dozen price." References and Further Readings: 200% of Nothing, by A. Dewdney, John Wiley, New York, 1993. Based on his articles about math abuse in Scientific American, Dewdney lists the many ways we are manipulated with fancy mathematical footwork and faulty thinking in print ads, the news, company reports and product labels. He shows how to detect the full range of math abuses and defend against them. The Informed Citizen: Argument and Analysis for Today, by W. Schindley, Harcourt Brace, 1996. This rhetoric/reader explores the study and practice of writing argumentative prose. The focus is on exploring current issues in communities, from the classroom to cyberspace. The "interacting in communities" theme and the high-interest readings engage students, while helping them develop informed opinions, effective arguments, and polished writing. Visit also the Web site: Glossary of Mathematical Mistakes. Belief, Opinion, and Fact The letters in your course number: OPRE 504, stand for OPerations RE-search. OPRE is a science of making decisions (based on some numerical and measurable scales) by searching, and re-searching for a solution. I refer you to What Is OR/MS? for a deeper understanding of what OPRE is all about. Decision making under uncertainty must be based on facts not on personal opinion nor on belief. Belief, Opinion, and Fact Self says Belief Opinion Fact I'm right This is my view This is a fact Says to others You're wrong That is yours I can prove it to you Sensible decisions are always based on facts. We should not confuse facts with beliefs or opinions. Beliefs are defined as someone's own understanding or needs. In belief, "I am" always right and "you" are wrong. There is nothing that can be done to convince the person that what they believe in is wrong. Opinions are slightly less extreme than beliefs. An opinion means that a person has certain views that they think are right. They also know that others are entitled to their own opinions. People respect other's opinions and in turn expect the same. Contrary to beliefs and opinions are facts. Facts are the basis of decisions. A fact is something that is right, and one can prove it to be true based on evidence and logical arguments. Examples for belief, opinion, and facts can be found in religion, economics, and econometrics, respectively. With respect to belief, Henri Poincaré said "Doubt everything or believe everything: these are two equally convenient strategies. With either we dispense with the need to think." How to Assign Probabilities? Probability is an instrument to measure the likelihood of the occurrence of an event. There are three major approaches of assigning probabilities as follows: 1. Classical Approach: Classical probability is predicated on the condition that the outcomes of an experiment are equally likely to happen. The classical probability utilizes the idea that the lack of knowledge implies that all possibilities are equally likely. The classical probability is applied when the events have the same chance of occurring (called equally likely events), and the set of events are mutually exclusive and collectively exhaustive. The classical probability is defined as: P(X) = Number of favorable outcomes / Total number of possible outcomes 2. Relative Frequency Approach: Relative probability is based on accumulated historical or experimental data. Frequency-based probability is defined as: P(X) = Number of times an event occurred / Total number of opportunities for the event to occur. Note that relative probability is based on the ideas that what has happened in the past will hold. 3. Subjective Approach: The subjective probability is based on personal judgment and experience. For example, medical doctors sometimes assign subjective probability to the length of life expectancy for a person who has cancer. General Laws of Probability 1. General Law of Addition: When two or more events will happen at the same time, and the events are not mutually exclusive, then: P(X or Y) = P(X) + P(Y) - P(X and Y) 2. Special Law of Addition: When two or more events will happen at the same time, and the events are mutually exclusive, then: P(X or Y) = P(X) + P(Y) 3. General Law of Multiplication: When two or more events will happen at the same time, and the events are dependent, then the general rule of multiplicative law is used to find the joint probability: P(X and Y) = P(X) . P(Y|X), where P(X|Y) is a conditional probability. 4. Special Law of Multiplicative: When two or more events will happen at the same time, and the events are independent, then the special rule of multiplication law is used to find the joint probability: P(X and Y) = P(X) . P(Y) 5. Conditional Probability Law: A conditional probability is denoted by P(X|Y). This phrase is read: the probability that X will occur given that Y is known to have occurred. Conditional probabilities are based on knowledge of one of the variables. The conditional probability of an event, such as X, occurring given that another event, such as Y, has occurred is expressed as: P(X|Y) = P(X and Y) / P(Y) Provided P(y) is not zero. Note that when using the conditional law of probability, you always divide the joint probability by the probability of the event after the word given. Thus, to get P(X given Y), you divide the joint probability of X and Y by the unconditional probability of Y. In other words, the above equation is used to find the conditional probability for any two dependent events. A special case of the Bayes Theorem is: P(X|Y) = P(Y|X). P(X) / P(Y) If two events, such as X and Y, are independent then: P(X|Y) = P(X), and P(Y|X) = P(Y) Mutually Exclusive versus Independent Events Mutually Exclusive (ME): Event A and B are M.E if both cannot occur simultaneously. That is, P[A and B] = 0. Independency (Ind.): Events A and B are independent if having the information that B already occurred does not change the probability that A will occur. That is P[A given B occurred] = P[A]. If two events are ME they are also Dependent: P(A given B) = P[A and B]/P[B], and since P[A and B] = 0 (by ME), then P[A given B] = 0. Similarly, If two events are Dependent then they are also not ME. If two events are Dependent then they may or may not be ME. If two events are not ME, then they may or may not be Independent. The following Figure contains all possibilities. The notations used in this table are as follows: X means does not imply, question mark ? means it may or may not imply, while the check mark means it implies. Bernstein was the first to discovere that (probabilistic) pairwise independency and mutual independency for a collection of events A1,..., An are different notions. Different Schools of Thought in Inferential Statistics There are few different schools of thoughts in statistics. They are introduced sequentially in time by necessity. The Birth Process of a New School of Thought The process of devising a new school of thought in any field has always taken a natural path. Birth of new schools of thought in statistics is not an exception. The birth process is outlined below: Given an already established school, one must work within the defined framework. A crisis appears, i.e., some inconsistencies in the framework result from its own laws. Response behavior: 1. Reluctance to consider the crisis. 2. Try to accommodate and explain the crisis within the existing framework. 3. Conversion of some well-known scientists attracts followers in the new school. The following Figure illustrates the three major schools of thought; namely, the Classical (attributed to Laplace), Relative Frequency (attributed toFisher), and Bayesian (attributed to Savage). The arrows in this figure represent some of the main criticisms among Objective, Frequentist, and Subjective schools of thought. To which school do you belong? Read the conclusion in this figure. Bayesian, Frequentist, and Classical Methods The problem with the Classical Approach is that what constitutes an outcome is not objectively determined. One person's simple event is another person's compound event. One researcher may ask, of a newly discovered planet, "what is the probability that life exists on the new planet?" while another may ask "what is the probability that carbonbased life exists on it?" Bruno de Finetti, in the introduction to his two-volume treatise on Bayesian ideas, clearly states that "Probabilities Do not Exist". By this he means that probabilities are not located in coins or dice; they are not characteristics of things like mass, density, etc. Some Bayesian approaches consider probability theory as an extension of deductive logic to handle uncertainty. It purports to deduce from first principles the uniquely correct way of representing your beliefs about the state of things, and updating them in the light of the evidence. The laws of probability have the same status as the laws of logic. These Bayesian approahe is explicitly "subjective" in the sense that it deals with the plausibility which a rational agent ought to attach to the propositions she considers, "given her current state of knowledge and experience." By contrast, at least some non-Bayesian approaches consider probabilities as "objective" attributes of things (or situations) which are really out there (availability of data). A Bayesian and a classical statistician analyzing the same data will generally reach the same conclusion. However, the Bayesian is better able to quantify the true uncertainty in his analysis, particularly when substantial prior information is available. Bayesians are willing to assign probability distribution function(s) to the population's parameter(s) while frequentists are not. From a scientist's perspective, there are good grounds to reject Bayesian reasoning. The problem is that Bayesian reasoning deals not with objective, but subjective probabilities. The result is that any reasoning using a Bayesian approach cannot be publicly checked -- something that makes it, in effect, worthless to science, like non replicative experiments. Bayesian perspectives often shed a helpful light on classical procedures. It is necessary to go into a Bayesian framework to give confidence intervals the probabilistic interpretation which practitioners often want to place on them. This insight is helpful in drawing attention to the point that another prior distribution would lead to a different interval. A Bayesian may cheat by basing the prior distribution on the data; a Frequentist can base the hypothesis to be tested on the data. For example, the role of a protocol in clinical trials is to prevent this from happening by requiring the hypothesis to be specified before the data are collected. In the same way, a Bayesian could be obliged to specify the prior in a public protocol before beginning a study. In a collective scientific study, this would be somewhat more complex than for Frequentist hypotheses because priors must be personal for coherence to hold. A suitable quantity that has been proposed to measure inferential uncertainty; i.e., to handle the a priori unexpected, is the likelihood function itself. If you perform a series of identical random experiments (e.g., coin tosses), the underlying probability distribution that maximizes the probability of the outcome you observed is the probability distribution proportional to the results of the experiment. This has the direct interpretation of telling how (relatively) well each possible explanation (model), whether obtained from the data or not, predicts the observed data. If the data happen to be extreme ("atypical") in some way, so that the likelihood points to a poor set of models, this will soon be picked up in the next rounds of scientific investigation by the scientific community. No long run frequency guarantee nor personal opinions are required. There is a sense in which the Bayesian approach is oriented toward making decisions and the frequentist hypothesis testing approach is oriented toward science. For example, there may not be enough evidence to show scientifically that agent X is harmful to human beings, but one may be justified in deciding to avoid it in one's diet. Since the probability (or the distribution of possible probabilities) is continuous, the probability that the probability is any specific point estimate is really zero. This means that in a vacuum of information, we can make no guess about the probability. Even if we have information, we can really only guess at a range for the probability. Further Readings: Land F., Operational Subjective Statistical Methods, Wiley, 1996. Presents a systematic treatment of subjectivist methods along with a good discussion of the historical and philosophical backgrounds of the major approaches to probability and statistics. Plato, Jan von, Creating Modern Probability, Cambridge University Press, 1994. This book provides a historical point of view on subjectivist and objectivist probability school of thoughts. Weatherson B., Begging the question and Bayesians, Studies in History and Philosophy of Science, 30(4), 687-697, 1999. Zimmerman H., Fuzzy Set Theory, Kluwer Academic Publishers, 1991. Fuzzy logic approaches to probability (based on L.A. Zadeh and his followers) present a difference between "possibility theory" and probability theory. For more information, visit the Web sites Bayesian Inference for the Physical Sciences, Bayesians vs. NonBayesians, Society for Bayesian Analysis,Probability Theory As Extended Logic, and Bayesians worldwide. Type of Data and Levels of Measurement Information can be collected in statistics using qualitative or quantitative data. Qualitative data, such as eye color of a group of individuals, is not computable by arithmetic relations. They are labels that advise in which category or class an individual, object, or process fall. They are called categorical variables. Quantitative data sets consist of measures that take numerical values for which descriptions such as means and standard deviations are meaningful. They can be put into an order and further divided into two groups: discrete data or continuous data. Discrete data are countable data, for example, the number of defective items produced during a day's production. Continuous data, when the parameters (variables) are measurable, are expressed on a continuous scale. For example, measuring the height of a person. The first activity in statistics is to measure or count. Measurement/counting theory is concerned with the connection between data and reality. A set of data is a representation (i.e., a model) of the reality based on a numerical and mensurable scales. Data are called "primary type" data if the analyst has been involved in collecting the data relevant to his/her investigation. Otherwise, it is called "secondary type" data. Data come in the forms of Nominal, Ordinal, Interval and Ratio (remember the French word NOIR for color black). Data can be either continuous or discrete. Level of Measurements _________________________________________ Nominal Ordinal Interval/Ratio Ranking? no yes yes Numerical no no yes difference Zero and unit of measurement are arbitrary in the Interval scale. While the unit of measurement is arbitrary in Ratio scale, its zero point is a natural attribute. The categorical variable is measured on an ordinal or nominal scale. Measurement theory is concerned with the connection between data and reality. Both statistical theory and measurement theory are necessary to make inferences about reality. Since statisticians live for precision, they prefer Interval/Ratio levels of measurement. Visit the Web site Measurement theory: Frequently Asked Questions Number of Class Intervals in a Histogram Before we can construct our frequency distribution we must determine how many classes we should use. This is purely arbitrary, but too few classes or too many classes will not provide as clear a picture as can be obtained with some more nearly optimum number. An empirical relationship, known as Sturge's Rule, may be used as a useful guide to determine the optimal number of classes (k) is given by k = the smallest integer greater than or equal to 1 + 3.332 Log(n) where k is the number of classes, Log is in base 10, n is the total number of the numerical values which comprise the data set. Therefore, class width is: (highest value - lowest value) / (1 + 3.332 Logn) where n is the total number of items in the data set. To have an "optimum" you need some measure of quality -presumably in this case, the "best" way to display whatever information is available in the data. The sample size contributes to this; so the usual guidelines are to use between 5 and 15 classes, with more classes possible if you have a larger sample. You should take into account a preference for tidy class widths, preferably a multiple of 5 or 10, because this makes it easier to understand. Beyond this it becomes a matter of judgement. Try out a range of class widths, and choose the one that works best. (This assumes you have a computer and can generate alternative histograms fairly readily.) There are often management issues that come into play as well. For example, if your data is to be compared to similar data -- such as prior studies, or from other countries -- you are restricted to the intervals used therein. If the histogram is very skewed, then unequal classes should be considered. Use narrow classes where the class frequencies are high, wide classes where they are low. The following approaches are common: Let n be the sample size, then the number of class intervals could be MIN { n, 10 Log(n) }. The Log is the logarithm in base 10. Thus for 200 observations you would use 14 intervals but for 2000 you would use 33. Alternatively, 1. Find the range (highest value - lowest value). 2. Divide the range by a reasonable interval size: 2, 3, 5, 10 or a multiple of 10. 3. Aim for no fewer than 5 intervals and no more than 15. Visit also the Web site Histogram Applet, and Histogram Generator How to Construct a BoxPlot A BoxPlot is a graphical display that has many characteristics. It includes the presence of possible outliers. It illustrates the range of data. It shows a measure of dispersion such as the upper quartile, lower quartile and interquartile range (IQR) of the data set as well as the median as a measure of central location which is useful for comparing sets of data. It also gives an indication of the symmetry or skewness of the distribution. The main reason for the popularity of boxplots is that they offer a lot of information in a compact way. Steps to Construct a BoxPlot: 1. Horizontal lines are drawn at the median and at the upper and lower quartiles. These horizontal lines are joined by vertical lines to produce the box. 2. A vertical lines is drawn up from the upper quartile to the most extreme data point that is within a distance of 1.5 (IQR) of the upper quartile. A similar defined vertical line is drawn from the lower quartile. 3. Each data point beyond the end of the vertical line is marked with and asterisk (*). Probability, Chance, Likelihood, and Odds "Probability" has an exact technical meaning -- well, in fact it has several, and there is still debate as to which term ought to be used. However, for most events for which probability is easily computed e.g. rolling of a die the probability of getting a four [::], almost all agree on the actual value (1/6), if not the philosophical interpretation. A probability is always a number between 0 [not "quite" the same thing as impossibility: it is possible that "if" a coin were flipped infinitely many times, it would never show "tails", but the probability of an infinite run of heads is 0] and 1 [again, not "quite" the same thing as certainty but close enough]. The word "chance" or "chances" is often used as an approximate synonym of "probability", either for variety or to save syllables. It would be better practice to leave "chance" for informal use, and say "probability" if that is what is meant. In cases where the probability of an observation is described by a parametric model, the "likelihood" of a parameter value given the data is defined to be the probability of the data given the parameter. One occasionally sees "likely" and "likelihood", however, these terms are used casually as synonyms for "probable" and "probability". "Odds" is a probabilistic concept related to probability. It is the ratio of the probability (p) of an event to the probability (1-p) that it does not happen: p/(1-p). It is often expressed as a ratio, often of whole numbers; e.g., "odds" of 1 to 5 in the die example above, but for technical purposes the division may be carried out to yield a positive real number (here 0.2). The logarithm of the odds ratio is useful for technical purposes, as it maps the range of probabilities onto the (extended) real numbers in a way that preserves symmetry between the probability that an event occurs and the probability that it does not occur. Odds are a ratio of nonevents to events. If the event rate for a disease is 0.1 (10 per cent), its nonevent rate is 0.9 and therefore its odds are 9:1. Note that this is not the same expression as the inverse of event rate. Another way to compare probabilities and odds is using "part-whole thinking" with a binary (dichotomous) split in a group. A probability is often a ratio of a part to a whole; e.g., the ratio of the part [those who survived 5 years after being diagnosed with a disease] to the whole [those who were diagnosed with the disease]. Odds are often a ratio of a part to a part; e.g., the odds against dying are the ratio of the part that succeeded [those who survived 5 years after being diagnosed with a disease] to the part that 'failed' [those who did not survive 5 years after being diagnosed with a disease]. Obviously, probability and odds are intimately related: Odds = p / (1-p). Note that probability is always between zero and one, whereas odds range from zero to infinity. Aside from their value in betting, odds allow one to specify a small probability (near zero) or a large probability (near one) using large whole numbers (1,000 to 1 or a million to one). Odds magnify small probabilities (or large probabilities) so as to make the relative differences visible. Consider two probabilities: 0.01 and 0.005. They are both small. An untrained observer might not realize that one is twice as much as the other. But if expressed as odds (99 to 1 versus 199 to 1) it may be easier to compare the two situations by focusing on large whole numbers (199 versus 99) rather than on small ratios or fractions. Visit also the Web site Counting and Combinatorial What Is "Degrees of Freedom" Recall that in estimating the population's variance, we used (n-1) rather than n, in the denominator. The factor (n-1) is called "degrees of freedom." Estimation of the Population Variance: Variance in a population is defined as the average of squared deviations from the population mean. If we draw a random sample of n cases from a population where the mean is known, we can estimate the population variance in an intuitive way. We sum the deviations of scores from the population mean and divide this sum by n. This estimate is based on n independent pieces of information and we have n degrees of freedom. Each of the n observations, including the last one, is unconstrained ('free' to vary). When we do not know the population mean, we can still estimate the population variance, but now we compute deviations around the sample mean. This introduces an important constraint because the sum of the deviations around the sample mean is known to be zero. If we know the value for the first (n-1) deviations, the last one is known. There are only n-1 independent pieces of information in this estimate of variance. If you study a system with n parameters xi, i =,1..., n you can represent it in a n-dimension space. Any point of this space shall represent a potential state of your system. If your n parameters could vary independently, then your system would be fully described in a n-dimension hypervolume. Now, imagine you've got one constraint between the parameters (an equation relying your n parameters), then your system would be described by a (n-1)-dimension hyper-surface. For example, in three dimensional space, a linear relationship means a plane which is 2-dimensional. In statistics, your n parameters are your n data. To evaluate variance, you first need to infer the mean E(X). So when you evaluate the variance, you've got one constraint on your system (which is the expression of the mean), and it only remains (n-1) degrees of freedom to your system. Therefore, we divide the sum of squared deviations by n-1 rather than by n when we have sample data. On average, deviations around the sample mean are smaller than deviations around the population mean. This is because our sample mean is always in the middle of our sample scores; in fact the minimum possible sum of squared deviations for any sample of numbers is around the mean for that sample of numbers. Thus, if we sum the squared deviations from the sample mean and divide by n, we have an underestimate of the variance in the population (which is based on deviations around the population mean). If we divide the sum of squared deviations by n-1 instead of n, our estimate is a bit larger, and it can be shown that this adjustment gives us an unbiased estimate of the population variance. However, for large n, say, over 30, it does not make too much of difference if we divide by n, or n-1. Degrees of Freedom in ANOVA: You will see the key parse "degrees of freedom" also appearing in the Analysis of Variance (ANOVA) tables. If I tell you about 4 numbers, but don't say what they are, the average could be anything. I have 4 degrees of freedom in the data set. If I tell you 3 of those numbers, and the average, you can guess the fourth number. The data set, given the average, has 3 degrees of freedom. If I tell you the average and the standard deviation of the numbers, I have given you 2 pieces of information, and reduced the degrees of freedom to from 4 to 2. You only need to know 2 of the numbers' values to guess the other 2. In an ANOVA table, degree of freedom (df) is the divisor in SS/df which will result in an unbiased estimate of the variance of a population. df = N - k, where N is the sample size, and k is a small number, equal to the number of "constraints", the number of "bits of information" already "used up". Degree of freedom is an additive quantity; total amounts of it can be "partitioned" into various components. For example, suppose we have a sample of size 13 and calculate its mean, and then the deviations from the mean, only 12 of the deviations are free to vary: once one has found 12 of the deviations, the thirteenth one is determined. Therefore, if one is estimating a population variance from a sample, k = 1. In bivariate correlation or regression situations, k = 2: the calculation of the sample means of each variable "uses up" two bits of information, leaving N - 2 independent bits of information. In a one-way analysis of variance (ANOVA) with g groups, there are three ways of using the data to estimate the population variance. If all the data are pooled, the conventional SST/(n-1) would provide an estimate of the population variance. If the treatment groups are considered separately, the sample means can also be considered as estimates of the population mean, and thus SSb/(g - 1) can be used as an estimate. The remaining ("within-group", "error") variance can be estimated from SSw/(n - g). This example demonstrates the partitioning of df: df total = n - 1 = df(between) + df(within) = (g - 1) + (n - g). Therefore, the simple 'working definition' of df is ‘sample size minus the number of estimated parameters'. A fuller answer would have to explain why there are situations in which the degrees of freedom is not an integer. After, we said all this, the best explanation, is mathematical in that we use df to obtain an unbiased estimate. In summary, the concept of degrees of freedom is used for the following two different purposes:   Parameter(s) of certain distributions, such as F, and tdistribution are called degrees of freedom. Therefore, degrees of freedom could be positive non-integer number(s). Degrees of freedom is used to obtain unbiased estimate for the population parameters. Outlier Removal Because of the potentially large variance, outliers could be the outcome of sampling. It's perfectly correct to have such an observation that legitimately belongs to the study group by definition. Lognormally distributed data (such as international exchange rate), for instance, will frequently exhibit such values. Therefore, you must be very careful and cautious: before declaring an observation "an outlier," find out why and how such observation occurred. It could even be an error at the data entering stage. First, construct the BoxPlot of your data. Form the Q1, Q2, and Q3 points which divide the samples into four equally sized groups. (Q2 = median) Let IQR = Q3 - Q1. Outliers are defined as those points outside the values Q3+k*IQR and Q1-k*IQR. For most case one sets k=1.5. Another alternative is the following algorithm a) Compute  of whole sample. b) Define a set of limits off the mean: mean + k, mean k sigma (Allow user to enter k. A typical value for k is 2.) c) Remove all sample values outside the limits. Now, iterate N times through the algorithm, each time replacing the sample set with the reduced samples after applying step (c). Usually we need to iterate through this algorithm 4 times. As mentioned earlier, a common "standard" is any observation falling beyond 1.5 (interquartile range) i.e., (1.5 IQRs) ranges above the third quartile or below the first quartile. The following SPSS program, helps you in determining the outliers. $SPSS/OUTPUT=LIER.OUT TITLE 'DETERMINING IF OUTLIERS EXIST' DATA LIST FREE FILE='A' / X1 VAR LABLE X1 'INPUT DATA' LIST CASE CASE=10/VARIABLE=X1/ CONDESCRIPTIVE X1(ZX1) LIST CASE CASE=10/VARIABLES=X1,ZX1/ SORT CASES BY ZX1(A) LIST CASE CASE=10/VARIABLES=X1,ZX1/ FINISH Statistical Summaries Representative of a Sample: Measures of Central Tendency Summaries How do you describe the "average" or "typical" piece of information in a set of data? Different procedures are used to summarize the most representative information depending of the type of question asked and the nature of the data being summarized. Measures of location give information about the location of the central tendency within a group of numbers. The measures of location presented in this unit for ungrouped (raw) data are the mean, the median, and the mode. Mean: The arithmetic mean (or the average or simple mean) is computed by summing all numbers in an array of numbers (xi) and then dividing by the number of observations (n) in the array. The mean uses all of the observations, and each observation affects the mean. Even though the mean is sensitive to extreme values, i.e., extremely large or small data can cause the mean to be pulled toward the extreme data, it is still the most widely used measure of location. This is due to the fact that the mean has valuable mathematical properties that make it convenient for use with inferential statistical analysis. For example, the sum of the deviations of the numbers in a set of data from the mean is zero, and the sum of the squared deviations of the numbers in a set of data from the mean is the minimum value. Weighted Mean: In some cases, the data in the sample or population should not be weighted equally, rather each value should be weighted according to its importance. Median: The median is the middle value in an ordered array of observations. If there is an even number of observations in the array, the median is the average of the two middle numbers. If there is an odd number of data in the array, the median is the middle number. The median is often used to summarize the distribution of an outcome. If the distribution is skewed, the median and the IQR may be better than other measures to indicate where the observed data are concentrated. Generally, the median provides a better measure of location than the mean when there are some extremely large or small observations; i.e., when the data are skewed to the right or to the left. For this reason, median income is used as the measure of location for the U.S. household income. Note that if the median is less than the mean, the data set is skewed to the right. If the median is greater than the mean, the data set is skewed to the left. Mode: The mode is the most frequently occurring value in a set of observations. Why use the mode? The classic example is the shirt/shoe manufacturer who wants to decide what sizes to introduce. Data may have two modes. In this case, we say the data are bimodal, and sets of observations with more than two modes are referred to as multimodal. Note that the mode does not have important mathematical properties for future use. Also, the mode is not a helpful measure of location, because there can be more than one mode or even no mode. Whenever, more than one mode exist, then the population from which the sample came is a mixture of more than one population. Almost all standard statistical analyses assume that the population is homogeneous, meaning that its density is unimodal. Notice that Excel is a very limited statistical software. For example, it displays only one mode, the first one. Unfortunately, this is very misleading. However, you may find out if there are others by inspection only, as follow: Create a frequency distribution, invoke the menu sequence: Tools, Data analysis, Frequency and follow instructions on the screen. You will see the frequency distribution and then find the mode visually. Unfortunately, Excel does not draw a Stem and Leaf diagram. All commercial off-the-shelf software, such as SAS and SPSS display a Stem and Leaf diagram which is a frequency distribution of a given data set. Quartiles & Percentiles: Quantiles are values that separate a ranked data set into four equal classes. Whereas percentiles are values that separate a ranked the data into 100 equal classes. The widely used quartiles are the 25th, 50th, and 75th percentiles. Selecting Among the Mean, Median, and Mode It is a common mistake to specify the wrong index for central tenancy. The first consideration is the type of data, if the variable is categorical, the mode is the single measure that best describes that data. The second consideration in selecting the index is to ask whether the total of all observations is of any interest. If the answer is yes, then the mean is the proper index of central tendency. If the total is of no interest, then depending on whether the histogram is symmetric or skewed one must use either mean or median, respectively. In all cases the histogram must be unimodal. Suppose that four people want to get together to play poker. They live on 1st Street, 3rd Street, 7th Street, and 15th Street. They want to select a house that involves the minimum amount of driving for all parties concerned. Let's suppose that they decide to minimize the absolute amount of driving. If they met at 1st Street, the amount of driving would be 0 + 2 + 6 + 14 = 22 blocks. If they met at 3rd Street, the amount of driving would be 2 + 0+ 4 + 12 = 18 blocks. If they met at 7th Street, 6 + 4 + 0 + 8 = 18 blocks. Finally, at 15th Street, 14 + 12 + 8 + 0 = 34 blocks. So the two houses that would minimize the amount of driving would be 3rd or 7th Street. Actually, if they wanted a neutral site, any place on 4th, 5th, or 6th Street would also work. Note that any value between 3 and 7 could be defined as the median of 1, 3, 7, and 15. So the median is the value that minimizes the absolute distance to the data points. Now the person at 15th is upset at always having to do more driving. So the group agrees to consider a different rule. The decide to minimize the square of the distance driving. This is the least squares principle. By squaring, we give more weight to a single very long commute than to a bunch of shorter commutes. With this rule, the 7th Street house (36 + 16 + 0 + 64 = 116 square blocks) is preferred to the 3rd Street house (4 + 0 + 16 + 144 = 164 square blocks). If you consider any location, and not just the houses themselves, then 9th Street is the location that minimizes the square of the distances driven. Find the value of x that minimizes (1 - x)2 + (3 - x)2 +(7 - x)2 + (15 - x)2. The value that minimizes the sum of squared values is 6.5 which is also equal to the arithmetic mean of 1, 3, 7, and 15. With calculus, it's easy to show that this holds in general. For moderately asymmetrical distributions the mode, median and mean satisfy the formula: mode=3 (median) 2(mean). Consider a small sample of scores with an even number of cases, for example, 1, 2, 4, 7, 10, and 12. The median is 5.5, the midpoint of the interval between the scores of 4 and 7. As we discussed above, it is true that the median is a point around which the sum absolute deviations is minimized. In this example the sum of absolute deviation is 22. However, it is not a unique point. Any point in the 4 to 7 region will have the same value of 22 for the sum of the absolute deviations. Indeed, medians are tricky. The 50%-50% (above-below) is not quite correct. For example, 1, 1, 1, 1, 1, 1, 8 has no median. The convention says that, the median is 1, however about 14% of the data lie strictly above it, 100% of the data is the median. This generalizes to other percentiles. We will make use of this idea in regression analysis. In an analogous argument, the regression line is a unique line which minimizes the sum of the squared deviations from it. There is no unique line which minimizes the sum of the absolute deviations from it. Quality of a Sample: Measures of Dispersion Average by itself is not a good indication of quality. You need to know the variance to make any educated assessment. We are reminded of the dilemma of the six-foot tall statistician who drowned in a stream that had an average depth of three feet. These are statistical procedures for describing the nature and extent of differences among the information in the distribution. A measure of variability is generally reported with a measure of central tendency. Statistical measures of variation are numerical values that indicate the variability inherent in a set of data measurements. Note that a small value for a measure of dispersion indicates that the data are concentrated around the mean; therefore, the mean is a good representative of the data set. On the other hand, a large measure of dispersion indicates that the mean is not a good representative of the data set. Also, measures of dispersion can be used when we want to compare the distributions of two or more sets of data. Quality of a data set is measured by its variability: Larger variability indicates lower quality. That is why high variation makes the manager very worried. Your job, as a statistician is to measure the variation, and if it is too high and unacceptable, then it is the job of the technical staff, such as engineers, to fix the process. The decision situations with flat uncertainty have the largest risk. For simplicity, consider the case when there are only two outcomes one with probability of p. Then, the variation in the outcomes is p(1-p). This variation is the largest if we set p = 50%. That is, equal chance for each outcome. In such a case, the quality of information is at its lowest level. Remember, quality of information and variation are inversely related. Larger the variation in the data, the lower the quality of the data (i.e., information). Remember that the Devil is in the Deviations. The four most common measures of variation are the range, variance, standard deviation, and coefficient of variation. Range: The range of a set of observations is the absolute value of the difference between the largest and smallest values in the set. It measures the size of the smallest contiguous interval of real numbers that encompasses all of the data values. It is not useful when extreme values are present. It is based solely on two values, not on the entire data set. In addition, it cannot be defined for open-ended distributions such as Normal distribution. Normal distribution does not have a range. A student said "since the tails of normal density function never touch the xaxis, at the same time since for an observation to contribute to forming the such a curve, very large positive and negative values must exist" Yet such remote values are always possible, but increasingly improbable. This encapsulates the asymptotic behavior of normal density very well. Variance: An important measure of variability is variance. Variance is the average of the squared deviations of each observation in the set from the arithmetic mean of all of observations. Variance =  (xi - ) 2 / (n - 1), n 2. The variance is a measure of spread or dispersion among values in a data set. Therefore, the greater the variance, the lower the quality. The variance is not expressed in the same units as the observations. In other words, the variance is hard to understand because the deviations from the mean are squared, making it too large for logical explanation. This problem can be solved by working with the square root of the variance, which is called the standard deviation. Standard Deviation: Both variance and standard deviation provide the same information; one can always be obtained from the other. In other words, the process of computing a standard deviation always involves computing a variance. Since standard deviation is the square root of the variance, it is always expressed in the same units as the raw data: For large data set (more than 30, say), approximately 68% of the data will fall within one standard deviation of the mean, 95% fall within two standard deviations, and 97.7% (or almost 100% ) fall within three standard deviations (S) from the mean. Standard Error: Standard error is a statistic indicating the accuracy of an estimate. That is, it tells us to assess how different the estimate ( such as ) is from the population parameter (such as ). It is therefore, the standard deviation of a sampling distribution of the estimator such as 's. Coefficient of Variation: Coefficient of Variation (CV) is the relative deviation with respect to size : CV is independent of the unit of measurement. In estimation of a parameter when CV is less than say 10%, the estimate is assumed acceptable. The inverse of CV; namely 1/CV is called the Signal-to-noise Ratio. The coefficient of variation is used to represent the relationship of the standard deviation to the mean, telling how much representative the mean is of the numbers from which it came. It expresses the standard deviation as a percentage of the mean; i.e., it reflects the variation in a distribution relative to the mean. Z Score: how many standard deviations a given point (i.e. observations) is above or below the mean. In other words, a Z score represents the number of standard deviations an observation (x) is above or below the mean. The larger the Z value, the further away a value will be from the mean. Note that values beyond three standard deviations are very unlikely. Note that if a Z score is negative, the observation (x) is below the mean. If the Z score is positive, the observation (x) is above the mean. The Z score is found as: Z = (x - ) / standard deviation of X The Z score is a measure of the number of standard deviations that an observation is above or below the mean. Since the standard deviation is never negative, a positive Z score indicates that the observation is above the mean, a negative Z score indicate that the observation is below the mean. Note that Z is a dimensionless value, and is therefore a useful measure by which to compare data values from two different populations even those measured by different units. Z-Transformation: Applying the formula z = (X - ) / will always produce a transformed variable with a mean of zero and a standard deviation of one. However, the shape of the distribution will not be affected by the transformation. If X is not normal then the transformed distribution will not be normal either. In the following SPSS command variable x is transformed to zx. descriptives variables=x(zx) You have heard the terms z value, z test, z transformation, and z score. Do all of these terms mean the same thing? Certainly not: The z value is refereed to the critical value (a point on the horizontal axes) of the Normal (o, 1) density function, for a given area to the left of that z-value. The z test is refereed to the procedures for testing the equality of mean (s) of one (or two) population(s). z score of a given observation x in a sample of size n, is simply (x - average of the sample) divided by the standard deviation of the sample. The z transformation of a set of observations of size n is simply (each observation - average of all observation) divided by the standard deviation among all observations. The aim is to produce a transformed data set with a mean of zero and a standard deviation of one. This makes the transformed set dimensionless and manageable with respect to its magnitudes. It also used in comparing several data sets measured using different scales of measurements. Pearson coined the term "standard deviation" sometime near 1900. The idea of using squared deviations goes back to Laplace in the early 1800's. Finally, notice again, that the trandforming raw scores to z scores does NOT normalize the data. Guess a Distribution to Fit Your Data: Skewness & Kurtosis A pair of statistical measures skewness and kurtosis is a measuring tool which is used in selecting a distribution(s) to fit your data. To make an inference with respect to the population distribution, you may first compute skewness and kurtosis from your random sample from the entire population. Then, locating a point with these coordinates on some widely used Skewness-Kurtosis Charts (available from your instructor upon request), guess a couple of possible distributions to fit your data. Finally, you might use the goodness-of-fit test to rigorously come up with the best candidate fitting your data. Removing outliers improves both skewness and kurtosis. Skewness: Skewness is a measure of the degree to which the sample population deviates from symmetry with the mean at the center. Skewness =  (xi - ) 3 / [ (n - 1) S 3 ], n 2. Skewness will take on a value of zero when the distribution is a symmetrical curve. A positive value indicates the observations are clustered more to the left of the mean with most of the extreme values to the right of the mean. A negative skewness indicates clustering to the right. In this case we have: Mean Median Mode. The reverse order holds for the observations with positive skewness. Kurtosis: Kurtosis is a measure of the relative peakedness of the curve defined by the distribution of the observations. Kurtosis =  (xi - ) 4 / [ (n - 1) S 4 ], n 2. Standard normal distribution has kurtosis of +3. A kurtosis larger than 3 indicates the distribution is more peaked than the standard normal distribution. Coefficient of Excess Kurtosis = Kurtosis - 3. A less than 3 kurtosis value means that the distribution is flatter than the standard normal distribution. Skewness and kurtosis can be used to check for normality via the the Jarque-Bera test. For large n, under the normality condition the quantity n {Skewness2 / 6 +((Kurtosis - 3)2) / 24)} follows a chi-square distribution with d.f. = 2. Further Reading: Tabachnick B., and L. Fidell, Using Multivariate Statistics, HarperCollins, 1996. Has a good discussion on applications and significance tests for skewness and kurtosis. Numerical Example & Discussions A Numerical Example: Given the following, small (n = 4) data set, compute the descriptive statistics: x1 = 1, x2 = 2, x3 = 3, and x4 = 6. i xi 1 1 -2 4 -8 16 2 2 -1 1 -1 1 3 3 0 0 0 0 4 6 3 9 27 81 Sum 12 0 14 18 98 (xi- ) 2 (xi - ) (xi - ) 3 (xi - )4 The mean is 12 / 4 = 3, the variance is s2 = 14 / 3 = 4.67, the standard deviation is s = (14/3) 0.5 = 2.16, the skewness is 18 / [3 (2.16) 3 ] = 0.5952, and finally, the kurtosis is 18 / [3 (2.16) 4 ] = 1.5. A Short Discussion Deviations about the mean of a distribution is the basis for most of the statistical tests we will learn. Since we are measuring how much a set of scores is dispersed about the mean , we are measuring variability. We can calculate the deviations about the mean and express it as variance 2or standard deviation . It is very important to have a firm grasp of this concept because it will be a central concept throughout your statistics course. Both variance 2 and standard deviation  measure variability within a distribution. Standard deviation  is a number that indicates how much on average each of the values in the distribution deviates from the mean (or center) of the distribution. Keep in mind that variance 2 measures the same thing as standard deviation  (dispersion of scores in a distribution). Variance 2, however, is the average squared deviations about the mean. Thus, variance 2 is the square of the standard deviation . Expected value and variance of respectively. are  and 2/n, Expected value and variance of S2 are 2 and 24 / (n-1), respectively. and S2 are the best estimators for and 2. They are Unbiased (you may update your estimate); Efficient (they have the smallest variation among other estimators); Consistent (increasing sample size provides a better estimate); and Sufficient (you do not need to have the whole data set; what you need are xi and xi2 for estimations). Note also that the above variance for of S2 is justified only in the case where the population distribution tends to be normal, otherwise one may use bootstrapping techniques. In general, it is believed that the pattern of mode, median, and mean go from lower to higher in positive skewed data sets, and just the opposite pattern in negative skewed data sets. However, for example, in the following 23 numbers, mean=2.87, median=3, but the data is positively skewed: 42764353131243121152231 and, the following 10 numbers have mean=median=mode=4, but the data set is left skewed: 1234445566 Note also that, most commercial software donot correctly compute skewness and kurtosis. There is no easy way to determine confidence intervals about a computed skewness or kurtosis value from a small to medium sample. The literature gives tables based on asymptotic methods for sample sets larger than 100 for normal distributions only. You may have noticed that using the above numerical example on some computer packages such as SPSS, the skewness and the kurtosis are different from what we have computed. For example, the SPSS output for the skewness is 1.190. However, for large a sample size n, the results are identical. Reference and Further Readings: David H., Early Sample Measures of Variability, Statistical Science, 13, 1998, 368-377. This article provides a good historical accounts of statistical measures. Groeneveld R., A class of quantile measures for kurtosis, The American Statistician, 325, Nov. 1998. Hosking J., M, Moments or L moments? An example comparing two measures of distributional shape, The American Statistician, Vo.l 46, 186-189, 1992. Parameters' Estimation and Quality of a 'Good' Estimate Estimation is the process by which sample data are used to indicate the value of an unknown quantity in a population. Results of estimation can be expressed as a single value, known as a point estimate; or a range of values, known as a confidence interval. Whenever we use point estimation, we calculate the margin of error associated with that point estimation. For example, for the estimation of the population mean , the margin of errors calculated as follows: ±1.96 SE( ). In newspapers and television reports on public opinion pools, the margin of error is the margin of "sampling error". There are many nonsampling errors that can and do affect the accuracy of polls. Here we talk about sampling error. The fact that subgroups have larger sampling error than one must include the following statement: "Other sources of error include but are not limited to, individuals refusing to participate in the interview and inability to connect with the selected number. Every feasible effort is made to obtain a response and reduce the error, but the reader (or the viewer) should be aware that some error is inherent in all research." To estimate means to esteem (to give value to). An estimator is any quantity calculated from the sample data which is used to give information about an unknown quantity in the population. For example, the sample mean is an estimator of the population mean . Estimators of population parameters are sometimes distinguished from the true value by using the symbol 'hat'. For example, true population standard deviation  is estimated (from a sample) population standard deviation. Example: The usual estimator of the population mean is = xi / n, where n is the size of the sample and x1, x2, x3,.......,xn are the values of the sample. If the value of the estimator in a particular sample is found to be 5, then 5 is the estimate of the population mean µ. A "Good" estimator is the one which provides an estimate with the following qualities: Unbiasedness: An estimate is said to be an unbiased estimate of a given parameter when the expected value that of estimator can be shown to be equal to the parameter being estimated. For example, the mean of a sample is an unbiased estimate of the mean of the population from which the sample was drawn. Unbiasedness is a good quality for an estimate since in such a case, using weighted average of several estimates provides a better estimate than each one of those estimates. Therefore, unbiasedness allows us to upgrade our estimates. For example is your estimate of the population mean µ are say, 10, and 11.2 from two independent samples of equal sizes 20, and 30 respectively, then the estimate of the population mean µ based on both samples is [20 (10) + 30 (11.2)] (20 + 30) = 10.75. Consistency: The standard deviation of an estimate is called the standard error of that estimate. The larger the standard error means more error in your estimate. It is a commonly used index of the error entailed in estimating a population parameter based on the information in a random sample of size n from the entire population. An estimator is said to be "consistent" if increasing the sample size produces an estimate with smaller standard error. Therefore, your estimate is "consistent" with the sample size. That is, spending more money (to obtain a larger sample) produces a better estimate. Efficiency: An efficient estimate is the one which has the smallest standard error among all other estimators of equal size. Sufficiency: A sufficient estimator based on a statistic contains all the information which is present in the raw data. For example, the sum of your data is sufficient to estimate the mean of the population. You don't have to know the data set itself. This saves a lot of money if the data has to be transmitted by telecommunication network. Simply, send out the total, and the sample. A sufficient statistic t for a parameter is a function of the sample data x1,...,xn, which contains all information in the sample about the parameter. More formally, sufficiency is defined in terms of the likelihood function for . For a sufficient statistic t, the Likelihood L(x1,...,xn|) can be written as g (t | )*k(x1,...,xn) Since the second term does not depend on , t is said to be a sufficient statistic for . Another way of stating this for the usual problems is that one could construct a random process starting from the sufficient statistic, which will have exactly the same distribution as the full sample for all states of nature. To illustrate, let the observations be independent Bernoulli trials with the same probability of success. Suppose that there are n trials, and that person A observes which observations are successes, and person B only finds out the number of successes. Then if B places these successes at random points without replication, the probability that B will now get any given set of successes is exactly the same as the probability that A will see that set, no matter what the true probability of success happens to be. The widely used estimator of the population mean µ is = xi/n, where n is the size of the sample and x1, x2, x3,......., xn are the values of the sample that have all the above properties. Therefore, it is a "good" estimator. If you want an estimate of central tendency as a parameter for a test or for comparison, then small sample sizes are unlikely to yield any stable estimate. The mean is sensible in a symmetrical distribution, as a measure of central tendency, but, e.g., with ten cases you will not be able to judge whether you have a symmetrical distribution. However, the mean estimate is useful if you are trying to estimate the a population sum, or some other function of the expected value of the distribution. Would the median be a better measure? In some distributions (e.g., shirt size) the mode may be better. Box-plot will indicate outliers in the data set. If there are outliers, median is better than mean as a measure of the central tendency. If you have a yes/no question you probably want to calculate a proportion p of yeses (or noes). Under simple random sampling, the variance of p is p(1-p)/n, ignoring the finite population correction. Now a 95% confidence interval is 1.96 [p(1-p)/n]2. A conservative interval can be calculated assuming p(1-p) takes its maximum value, which it does when p = 1/2. Replace 1.96 by 2, put p = 1/2 and you have a 95% confidence interval of 1/n1/2. This approximation works well as long as p is not too close to 0 or 1. This useful approximation allows you to calculate approximate 95% confidence intervals. Conditions Under Which Most Statistical Testing Apply Don't just learn formulas and number-crunching: learn about the conditions under which statistical testing procedures apply. The following conditions are common to almost all tests: 1. homogeneous population (see if there are more than one mode) 2. sample must be random (to test this, perform the Runs Test). 3. In addition to requirement No. 1, each population has a normal distribution (perform Test for Normality) 4. homogeneity of variances. Variation in each population is almost the same as in the others. For 2 populations use the F-test. For 3 or more populations, there is a practical rule known as the "Rule of 2". In this rule one divides the highest variance of a sample to the lowest variance of the other sample. Given that the sample sizes are almost the same, and the value of this division is less than 2, then, the variations of the populations are almost the same. Notice: This important condition in analysis of variance (ANOVA and the t-test for mean differences) is commonly tested by the Levene or its modified test known as the Brown-Forsythe test. Unfortunately, both tests rely on the homogeneity of variances assumption! These assumptions are crucial, not for the method/computation, but for the testing using the resultant statistic. Otherwise, we can do, for example, ANOVA and regression without any assumptions, and the numbers come out the same -- simple computations give us least-square fits, partitions of variance, regression coefficients, and so on. Only when testing certain assumptions about independence, and homogeneous distribution of error terms known as residuals. Homogeneous Population Homogeneous Population: A homogeneous population is a statistical population which has a unique mode. To determine if a given population is homogeneous or not, construct the histogram of a random sample from the entire population. If there is more than one mode, then you have a mixture of population. Know that to perform any statistical testing, you need to make sure you are dealing with homogeneous population. Test for Randomness: The Runs Test A "run" is a maximal subsequence of like elements. Consider the following sequence (D for Defective, N for nondefective items) out of a production line: DDDNNDNDNDDD. Number of runs is R = 7, with n1 = 8, and n2 = 4 which are number of D's and N's (whichever). A sequence is random if it is neither "over-mixed" nor "under-mixed". An example of over-mixed sequence is DDDNDNDNDNDD, with R = 9 while under-mixed looks like DDDDDDDDNNNN with R = 2. There the above sequence seems to be random. The Runs Tests, which is also known as Wald-Wolfowitz Test, is designed to test the randomness of a given sample at 100(1- )% confidence level. To conduct a runs test on a sample, perform the following steps: Step 1: compute the mean of the sample. Step 2: going through the sample sequence, replace any observation with +, or - depending on wether it is above or below the mean. Discard any ties. Step 3: compute R, n1, and n2. Step 4: compute the expected mean and variance of R, as follows:  =1 + 2n1n2/(n 1 + n2). 2 = 2n1n2(2n 1n2-n1- n2)/[[n1 + n2)2 (n1 + n2 -1)]. Step 5: Compute z = (R-)/ . Step 6: Conclusion: If z  Z, then there might be cyclic, seasonality behavior (under-mixing). If z  - Z, then there might be a trend. If z  - Z, or z  Z, reject the randomness. Note: This test is valid for cases for which both n1 and n2 are large, say greater than 10. For small sample sizes special tables must be used. The SPSS command for the runs test: NPAR TEST RUNS(MEAN) X (the name of the variable). For example, suppose for a given sample of size 50, we have R = 24, n1 = 14 and n2 = 36. Test for randomness at  = 0.05. The Plugging these into the above formulas we have  = 16.95,  = 2.473, and z = -2.0 From Z-table, we have Z = 1.645. Therefore, there might be a trend, which means that the sample is not random. Visit the Web site Test for Randomness Lilliefors Test for Normality The following SPSS program computes the KolmogrovSmirinov-Lilliefors statistic called LS. It can easily be converted and run in any other platforms. $SPSS/OUTPUT=L.OUT TITLE 'K-S LILLIEFORS TEST FOR NORMALITY' DATA LIST FREE FILE='L.DAT'/X VAR LABELS X 'SAMPLE VALUES' LIST CASE CASE=20/VARIABLES=ALL CONDESCRIPTIVE X(ZX) LIST CASE CASE=20/VARIABLES=X ZX/ SORT CASES BY ZX(A) RANK VARIABLES=ZX/RFRACTION INTO CRANK/TIES=HIGH COMPUTE Y=CDFNORM(ZX) COMPUTE SPROB=CRANK COMPUTE DA=Y-SPROB COMPUTE DB=Y-LAG(SPROB,1) COMPUTE DAABS=ABS(DA) COMPUTE DBABS=ABS(DB) COMPUTE LS=MAX(DAABS,DBABS) LIST VARIABLES=X,ZX,Y,SPROB,DA,DB LIST VARIABLES=LS SORT CASES BY LS(D) LIST CASES CASE=1/VARIABLES=LS FINISH The output is the statistic LS, which should be compared with the following critical values after setting a significance level  (as a function of the sample size n). Critical Values for the Lilliefors Test Significance Level Critical Value  = 0.15 0.775 / ( n ½ - 0.01 + 0.85 n -½ )  = 0.10 0.819 / ( n ½ - 0.01 + 0.85 n -½ )  = 0.05 0.895 / ( n ½ - 0.01 + 0.85 n -½ )  = 0.025 0.995 / ( n ½ - 0.01 + 0.85 n -½ ) A normal probability plot will also help you distinguish between a systematic departure from normality when it shows up as a curve. In SAS do a PROC UNIVARIATE NORMAL PLOT. Bera-Jarque test, which is widely used by econometricians, might also be applicable. Further Reading Statistical inference by normal probability paper, by T. Takahashi, Computers & Industrial Engineering, Vol. 37, Iss. 1 - 2, pp 121-124, 1999. Bonferroni Method One may combine several t-tests by using the Bonferroni method. It works reasonably well when there are only a few tests, but as the number of comparisons increases above 8, the value of 't' required to conclude that a difference exists becomes much larger than it really needs to be and the method becomes over conservative. One way to make the Bonferroni t test less conservative is to use the estimate of the population variance computed from within the groups in the analysis of variance. t=( 1- 2 )/ ( 2 / n1 + 2 / n2 )1/2 where VW is the population variance computed from within the groups. Chi-square Tests The Chi-square is a distribution, as is the Normal and others. The Normal (or Gaussian or bell-shaped) often occurs naturally in real life. When we know the mean and variance of a Normal then it allows us to find probabilities. So if, for example, you knew some things about the average height of women in the nation (including the fact that heights are distributed normally, you could measure all the women in your extended family, find the average height, and determine a probability associated with your result; if the probability of getting your result, given your knowledge of women nationwide, is high, then your family's female height cannot be said to be different from average. If that probability is low, then your result is rare (given the knowledge about women nationwide), and you can say your family is different. You've just completed a test of the hypothesis that the average height of women in your family is different from the overall average. There are other (similar) tests where finding that probability means NOT using Normal distribution. One of these is a Chisquare test. For instance, if you tested the variance of your family's female heights (which is analogous to your previous test of the mean), you can't assume that the normal distribution is appropriate to use. This should make sense, since the Normal is bell-shaped, and variances have a lower limit of zero. So, while a variance could be any huge number, it gets bounded on the low side by zero. If you were to test whether the variance of heights in your family is different from the nation, a Chi-square test happens to be appropriate, given our original above conditions. The formula and procedure is in your textbook. Crosstables: The variance is not the only thing for which you use a Chi-square test for. Often times it is used to test relationship among two categorical type data, or independence of two variables, such as cigarette smoking and drug use. If you were to survey 1000 people on whether or not they smoke and whether or not they use drugs, you will get one of four answers: (no,no) (no,yes) (yes,no) (yes,yes). By compiling the number of people in each category, you can ultimately test whether drug usage is independent of cigarette smoking by using the Chi-square distribution (this is approximate, but works well). Again, the methodology for this is in your textbook. The degrees of freedom is equal to (number of rows-1)(number of columns -1). That is, these many figures needed to fill in the entire body of the crosstable, the rest will be determined by using the rows and columns sum figures. Don't forget the conditions for the validity of Chi-square test and related expected values greater than 5 in 80% or more of the cells. Otherwise, one could use an "exact" test, using either a permutation or resampling approach. Both SPSS and SAS are capable of doing this test. For a 2-by-2 table, you should use the Yates correction to the chi-square. Chi-square distribution is used as an approximation of the binomial distribution. By applying a continuity correction we get a better approximation to the binomial distribution for the purposes of calculating tail probabilities. Use a relative risk measure such as the risk ratio or odds ratio. In the table: ab cd The most usual measures are: Rate difference a/(a+c) - b/(b+d) Rate ratio (a/(a+c))/(b/(b+d)) Odds ratio ad/bc The rate difference and rate ratio are appropriate when you are contrasting two groups, whose sizes (a+c and b+d) are given. The odds ratio is for when the issue is association rather than difference. Confidence interval methods are available for all of these - though not as well available in software as should be. If the hypothesis test is highly significant, the confidence interval will be well away from the null hypothesis value (0 for the rate difference, 1 for the rate ratio or odds ratio). The risk ratio is the ratio of the proportion (a/(a+b)) to the proportion (c/(c+d)): RR = (a / (a + b)) / (c / (c + d)) RR is thus a measure of how much larger the proportion in the first row is compared to the second row and ranges from 0 to infinity with  1.00 indicating a 'negative' association [a/(a+b)  c/(c+d)], 1.00 indicating no association [a/(a+b) = c/(c+d)], and 1.00 indicating a 'positive' association [a/(a+b)  c/(c+d)]. The further from 1.00, the stronger the association. Most stats packages will calculate the RR and confidence intervals for you. A related measure is the odds ratio (or cross product ratio) which is (a/b)/(c/d). You could also look at the  statistic which is:  = (2/N)½ where 2 is the Pearson's chi-square and N is the sample size. This statistic ranges between 0 and 1 and can be interpreted like the correlation coefficient. Visit Critical Values for the Chi- square Distribution Visit also, the Web sites Exact Unconditional Tests, Statistical tests Reference: Fleiss J., Statistical Methods for Rates and Proportions, Wiley, 1981. Goodness-of-fit Test for Discrete Random Variables There are other tests which might use the Chi-square, such as goodness-of-fit test for discrete random variables. Again don't forget the conditions for the validity of Chi-square test and related expected values greater than 5 in 80% or more of the cells. Therefore, Chi-square is a statistical test that measures "goodness-of-fit". In other words, it measures how much the observed or actual frequencies differ from the expected or predicted frequencies. Using a Chi-square table will enable you to discover how significant the difference is. A null hypothesis in the context of the Chi-square test is the model that you use to calculate your expected or predicted values. If the value you get from calculating the Chi-square statistic is sufficiently high (as compared to the values in the Chi-square table) it tells you that your null hypothesis is probably wrong. Let Y1, Y 2, . . ., Y n be a set of independent and identically distributed random variables. Assume that the probability distribution of the Y i's has the density function f o (y). We can divide the set of all possible values of Yi, i  {1, 2, ..., n}, into m non-overlapping intervals D1, D2, ...., Dm. Define the probability values p1, p2 , ..., pm as; p1 = P(Yi  D1) p2 = P(Yi D2) : : pm = P(Yi  Dm) Since the union of the mutually exclusive intervals D1, D2, ...., Dm is the set of all possible values for the Yi's, (p1 + p2 + .... + pm) = 1. Define the set of discrete random variables X1, X2, ...., Xm, where X1= number of Yi's whose value D1 X2= number of Yi's whose value  D2 : : Xm= number of Yi's whose value  Dm and (X1+ X2+ .... + Xm) = n. Then the set of discrete random variables X1, X2, ...., Xmwill have a multinomial probability distribution with parameters n and the set of probabilities {p1, p2, ..., pm}. If the intervals D1, D2, ...., Dm are chosen such that npi 5 for i = 1, 2, ..., m, then; C =  (Xi - npi) 2/ npi. The sum is over i= 1, 2,..., m. The results is distributed as 2 m-1. For the goodness-of-fit sample test, we formulate the null and alternative hypothesis as Ho : fY(y) = fo(y) H1 : fY(y)  fo(y) At the  level of significance, Ho will be rejected in favor of H1 if C =  (Xi - npi) 2/ npi is greater than 2 m However, it is possible that in a goodness-of-fit test, one or more of the parameters of fo(y) are unknown. Then the probability values p1, p2, ..., pmwill have to be estimated by assuming that Ho is true and calculating their estimated values from the sample data. That is, another set of probability values p'1, p'2, ..., p'm will need to be computed so that the values (np'1, np'2, ..., np'm) are the estimated expected values of the multinomial random variable (X1, X2, ...., Xm). In this case, the random variable C will still have a chi-square distribution, but its degrees of freedom will be reduced. In particular, if the density function fo(y) has r unknown parameters, C =  (Xi - npi) 2/ npi is distributed as 2 m-1-r. For this goodness-of-fit test, we formulate the null and alternative hypothesis as Ho: fY(y) = fo(y) H1: fY(y) fo(y) At the  level of significance, Ho will be rejected in favor of H1 if C is greater than 2 m-1-r. Using chi-square in a 2x2 table requires the Yates's correction. One first subtracts 0.5 from the absolute differences between observed and expected frequencies for each of the 3 genotypes before squaring, dividing by the expected frequency, and summing. The formula for the chisquare value in a 2x2 table can be derived from the Normal Theory comparison of the two proportions in the table using the total incidence to produce the standard errors. The rationale of the correction is a better equivalence of the area under the normal curve and the probabilities obtained from the discrete frequencies. In other words, the simplest correction is to move the cut-off point for the continuous distribution from the observed value of the discrete distribution to midway between that and the next value in the direction of the null hypothesis expectation. Therefore, the correction essentially only applied to 1 df tests where the "square root" of the chi-square looks like a "normal/ttest" and where a direction can be attached to the 0.5 addition. For more, visit the Web sites Chi-Square Lesson, and Exact Unconditional Tests. Statistics with Confidence In practice, a confidence interval is used to express the uncertainty in a quantity being estimated. There is uncertainty because inferences are based on a random sample of finite size from the entire population or process of interest. To judge the statistical procedure we can ask what would happen if we were to repeat the same study, over and over, getting different data (and thus different confidence intervals) each time. In most studies investigators are usually interested in determining the size of difference of a measured outcome between groups, rather than a simple indication of whether or not it is statistically significant. Confidence intervals present a range of values, on the basis of the sample data, in which the population value for such a difference may lie. Know that a confidence interval computed from one sample will be different from a confidence interval computed from another sample. Understand the relationship between sample size and width of confidence interval. Know that sometimes the computed confidence interval does not contain the true mean value (that is, it is incorrect) and understand how this coverage rate is related to confidence level. Just a word of interpretive caution. Let's say you compute a 95% confidence interval for a mean . The way to interpret this is to imagine an infinite number of samples from the same population, 95% of the computed intervals will contain the population mean . However, it is wrong to state, "I am 95% confident that the population mean falls within the interval." Again, the usual definition of a 95% confidence interval is an interval constructed by a process such that the interval will contain the true value 95% of the time. This means that "95%" is a property of the process, not the interval. Is the probability of occurrence of the population mean greater in the confidence interval center and lowest at the boundaries? Does the probability of occurrence of the population mean in a confidence interval vary in a measurable way from the center to the boundaries? In a general sense, normality is assumed, and then the interval between CI limits is represented by a bell shaped t distribution. The expectation (E) of another value is highest at the calculated mean value, and decreases as the values approach the CI interval limits. An approximation for the single measurement tolerance interval is n times confidence interval of the mean. Determining sample size: At the planning stage of a statistical investigation the question of sample size (n) is critical. The above figure also provides a practical guide to sample size determination in the context of statistical estimations and statistical significance tests. The confidence level of conclusions drawn from a set of data depends on the size of data set. The larger the sample, the higher is the associated confidence. However, larger samples also require more effort and resources. Thus, your goal must be to find the smallest sample size that will provide the desirable confidence. In the above figure, formulas are presented for determining the sample size required to achieve a given level of accuracy and confidence. In estimating the sample size, when the standard deviation is not known, one may use 1/4 of the range for sample of size over 30 as a "good" estimate for the standard deviation. It is a good practice to compare the result with IQR/1.349. A Note on Multiple Comparison via the Individual Intervals: Notice that, if the confidence intervals from two samples do not overlap there is a statistically significant difference, say at 5%. However, the other way is not true two confidence intervals can overlap quite a lot yet there is a significant difference between them. One should examine the confidence interval for the difference explicitly. Even if the C.I.'s are overlapping it is hard to find the exact overall confidence level. However, the sum of individual confidence levels can serve as an upper limit upper limit. This is evident from the fact that: P(A and B) P(A) + P(B). Further Reading Hahn G. and W. Meeker, Statistical Intervals: A Guide for Practitioners, Wiley, 1991. Also visit the Web sites Confidence Interval Applet, statpage. Entropy Measure Inequality coefficients used in sociology, economy, biostatistics, ecology, physics, image analysis and information processing are analyzed in order to shed light on economic disparity world-wide. Variability of a categorical data is measured by the entropy function: E= -  pi ln(pi) where, sum is over all the categories and pi is the relative frequency of the ith category. It is interesting to note that this quantity is maximized when all pi's, are equal. For a rXc contingency table it is E=  pij ln(pij) - ( pij) ln((pij) - ( pij) ln((pij) The sums are over all i and j, and j and i's. Another measure is the Kullback-Liebler distance (related to information theory): ((Pi - Qi)*log(Pi/Qi)) = (Pi*log(Pi/Qi )) + (Qi*log(Qi/Pi )) or the variation distance ( | Pi - Qi | )/2 where Pi and Qi are the probabilities for the i-th category for the two populations. For more on entropy visit the Web sites Entropy on WWW, Entropy and Inequality Measures, and Biodiversity. What Is Central Limit Theorem? The central limit theorem (CLT) is a "limit" that is "central" to statistical practice. For practical purposes, the main idea of the CLT is that the average (center of data) of a sample of observations drawn from some population is approximately distributed as a normal distribution if certain conditions are met. In theoretical statistics there are several versions of the central limit theorem depending on how these conditions are specified. These are concerned with the types of conditions made about the distribution of the parent population (population from which the sample is drawn) and the actual sampling procedure. One of the simplest versions of the theorem says that if we take a random sample of size (n) from the entire population, then the sample mean which is a random variable defined by  xi / n has a histogram which converges to a normal distribution shape if n is large enough (say more than 30). Equivalently, the sample mean distribution approaches to a normal distribution as the sample size increases. In applications of the central limit theorem to practical problems in statistical inference, however, statisticians are more interested in how closely the approximate distribution of the sample mean follows a normal distribution for finite sample sizes, than the limiting distribution itself. Sufficiently close agreement with a normal distribution allows statisticians to use normal theory for making inferences about population parameters (such as the mean ) using the sample mean, irrespective of the actual form of the parent population. It can be shown that, if the parent population has mean and finite standard deviation , then the sample mean distribution has the same mean but with smaller standard deviation which is  divided by n½. You know by now that, whatever the parent population is, the standardized variable will have a distribution with a mean = 0 and standard deviation =1 under random sampling. Moreover, if the parent population is normal, then z is distributed exactly as a standard normal variable. The central limit theorem states the remarkable result that, even when the parent population is non-normal, the standardized variable is approximately normal if the sample size is large enough. It is generally not possible to state conditions under which the approximation given by the central limit theorem works and what sample sizes are needed before the approximation becomes good enough. As a general guideline, statisticians have used the prescription that if the parent distribution is symmetric and relatively short-tailed, then the sample mean reaches approximate normality for smaller samples than if the parent population is skewed or long-tailed. Under certain conditions, in large samples, the sampling distribution of the sample mean can be approximated by a normal distribution. The sample size needed for the approximation to be adequate depends strongly on the shape of the parent distribution. Symmetry (or lack thereof) is particularly important. For a symmetric parent distribution, even if very different from the shape of a normal distribution, an adequate approximation can be obtained with small samples (e.g., 10 or 12 for the uniform distribution). For symmetric shorttailed parent distributions, the sample mean reaches approximate normality for smaller samples than if the parent population is skewed and long-tailed. In some extreme cases (e.g. binomial with ) samples sizes far exceeding the typical guidelines (e.g., 30 or 60) are needed for an adequate approximation. For some distributions without first and second moments (e.g., Cauchy), the central limit theorem does not hold. For some distributions, extremely large (impractical) samples would be required to approach a normal distribution. In manufacturing, for example, when defects occur at a rate of less than 100 parts per million, using a Beta distribution yields an honest CI of total defects in the population. Review also Central Limit Theorem Applet, Sampling Distribution Simulation, and CLT. What Is a Sampling Distribution The sampling distribution describes probabilities associated with a statistic when a random sample is drawn from the entire population. The sampling distribution is the probability distribution or probability density function of the statistic. Derivation of the sampling distribution is the first step in calculating a confidence interval or carrying out a hypothesis test for a parameter. Example: Suppose that x1,.......,xn are a simple random sample from a normally distributed population with expected value and known variance2. Then the sample mean is a statistic used to give information about the population parameter is normally distributed with expected value and variance 2/n. The main idea of statistical inference is to take a random sample from the entire population and then to use the information from the sample to make inferences about particular population characteristics such as the mean (measure of central tendency), the standard deviation (measure of spread)  or the proportion of units in the population that have a certain characteristic. Sampling saves money, time, and effort. Additionally, a sample can, in some cases, provide as much or more accuracy than a corresponding study that would attempt to investigate an entire population-careful collection of data from a sample will often provide better information than a less careful study that tries to look at everything. One must also study the behavior of the mean of sample values from a different specified populations. Because a sample examines only part of a population, the sample mean will not exactly equal the corresponding mean of the population . Thus, an important consideration for those planning and interpreting sampling results is the degree to which sample estimates, such as the sample mean, will agree with the corresponding population characteristic. In practice, only one sample is usually taken (in some cases a small "pilot sample" is used to test the data-gathering mechanisms and to get preliminary information for planning the main sampling scheme). However, for purposes of understanding the degree to which sample means will agree with the corresponding population mean , it is useful to consider what would happen if 10, or 50, or 100 separate sampling studies, of the same type, were conducted. How consistent would the results be across these different studies? If we could see that the results from each of the samples would be nearly the same (and nearly correct!), then we would have confidence in the single sample that will actually be used. On the other hand, seeing that answers from the repeated samples were too variable for the needed accuracy would suggest that a different sampling plan (perhaps with a larger sample size) should be used. A sampling distribution is used to describe the distribution of outcomes that one would observe from replication of a particular sampling plan. Know that estimates computed from one sample will be different from estimates that would be computed from another sample. Understand that estimates are expected to differ from the population characteristics (parameters) that we are trying to estimate, but that the properties of sampling distributions allow us to quantify, based on probability, how they will differ. Understand that different statistics have different sampling distributions with distribution shape depending on (a) the specific statistic, (b) the sample size, and (c) the parent distribution. Understand the relationship between sample size and the distribution of sample estimates. Understand that the variability in a sampling distribution can be reduced by increasing the sample size. See that in large samples, many sampling distributions can be approximated with a normal distribution. To learn more, visit the Web sites Sample, and Sampling Distribution Applet Applications of and Conditions for Using Statistical Tables Some widely used applications of the popular statistical tables can be categorized as follows: Z - Table: Tests concerning µ for one or two-population based on their large size random sample(s), (say,  30, to invoke the Central Limit Theorem). Test concerning proportions, with large size random sample size n (say, n 50, to invoke a convergence theorem). Conditions for using this table: Test for randomness of the data is needed before using this table. Test for normality of the sample distribution is also needed if the sample size is small or it may not be possible to invoke the Central Limit Theorem. T - Table: Tests concerning µ for one or two-population based on small random sample size (s). Tests concerning regression coefficients (slope, and intercepts), df = n - 2. Notes: As you know by now, in test of hypotheses concerning , and construction of confidence interval for it, we start with  known, since the critical value (and the pvalue) of the Z-Table distribution can be used. Considering the more realistic situations when we don't know  the TTable is used. In both cases we need to verify the normality of the population's distribution, however, if the sample size n is very large, we can in fact switch back to Z-Table by the virtue of the central limit theorem. For perfectly normal population, the t-distribution corrects for any errors introduced by estimating  with s when doing inference. Note also that, in hypothesis testing concerning the parameter of binomial and Poisson distributions for large sample sizes, the standard deviation is known under the null hypotheses. That's why you may use the normal approximations to both of these distributions. Conditions for using this table: Test for randomness of the data is needed before using this table. Test for normality of the sample distribution is also needed if the sample size is small or it may not be possible to invoke the Central Limit Theorem. Chi-Square - Table: Tests concerning 2 for one population based on a random sample from the entire population. Contingency tables (test for independency of categorical data). Goodness-of-fit test for discrete random variables. Conditions for using this table: Tests for randomness of the data and normality of the sample distribution are needed before using this table. F - Table: ANOVA: Tests concerning µ for three or more populations based on their random samples. Tests concerning 2 for two-population based on their random samples. Overall assessment in regression analysis using the F-value. Conditions for using this table: Tests for randomness of the data and normality of the sample distribution are needed before using this table for ANOVA. Same conditions must be satisfied for the residuals in regression analysis. The following chart summarizes statistical tables application with respect to test of hypotheses and construction of confidence intervals for meanand variance 2 in one pr comparison of two or more populations. Further Reading: Kagan. A., What students can learn from tables of basic distributions, Int. Journal of Mathematical Education in Science & Technology, 30(6), 1999. Statistical Tables on the Web: The following Web sites provide critical values useful in statistical testing and construction of confidence intervals. The results are identical to those given in statistic textbooks. However, in most cases they are more extensive (therefore more accurate).       Normal Curve Area Normal Calculator Normal Probability Calculation Critical Values for the t-Distribution Critical Values for the F-Distribution Critical Values for the Chi- square Distribution Read also Kanji G., 100 Statistical Tests, Sage Publisher, 1995. Relationships Among Distributions and Unification of Statistical Tables Particular attention must be paid to a first course in statistics. When I first began studying statistics, it bothered me that there were different tables for different tests. It took me a while to learn that this is not as haphazard as it appeared. Binomial, Normal, Chi-square, t, and F distributions that you will learn about are actually closely connected. A problem with elementary statistical textbooks is that they not only don't provide information of this kind, to permit a useful understanding of the principles involved, but they usually don't provide these conceptual links. If you want to understand connections between statistical concepts, then you should practice in making these connections. Learning by doing statistics lends itself to active rather than passive learning. Statistics is a highly interrelated set of concepts, and to be successful at it, you must learn to make these links conscious in your mind. Students often ask: Why T- table values with d.f.=1 are much larger compared with other d.f. values? Some tables are limited, what should I do when the sample size is too large?, How can I get familiarity with tables and their differences. Is there any type of integration among tables? Is there any connections between test of hypotheses and confidence interval under different scenario, for example testing with respect to one, two more than two populations. And so on. Further Reading: Kagan. A., What students can learn from tables of basic distributions, Int. Journal of Mathematical Education in Science & Technology, 30(6), 1999. The following two Figures demonstrate useful relationships among distributions and a unification of statistical tables: Unification of Common Statistical Tables, needs Acrobat to view Relationship Among Commonly Used Distributions in Testing, needs Acrobat to view Normal Distribution Up to this point we have been concerned with how empirical scores are distributed and how best to describe the distribution. We have discussed several different measures, but the mean will be the measure that we use to describe the center of the distribution and the standard deviation will be the measure we use to describe the spread of the distribution. Knowing these two facts gives us ample information to make statements about the probability of observing a certain value within that distribution. If I know, for example, that the average I.Q. score is 100 with a standard deviation of  = 20, then I know that someone with an I.Q. of 140 is very smart. I know this because 140 deviates from the mean by twice the average amount as the rest of the scores in the distribution. Thus, it is unlikely to see a score as extreme as 140 because most the I.Q. scores are clustered around 100 and on average only deviate 20 points from the mean . Many applications arise from central limit theorem (average of values of n observations approaches normal distribution, irrespective of form of original distribution under quite general conditions). Consequently, appropriate model for many, but not all, physical phenomena. Distribution of physical measurements on living organisms, intelligence test scores, product dimensions, average temperatures, and so on. Know that the Normal distribution is to satisfy seven requirements: the graph should be bell shaped curve, mean, medial and mode equal and located at the center of the distribution, only has one mode, symmetric about mean, continuous, never touches x-axis and area under curve equals one. Many methods of statistical analysis presume normal distribution. Normal Curve Area Area. What Is So Important About the Normal Distributions? Normal Distribution (called also Gaussian) curves, which have a bell-shaped appearance (it is sometimes even referred to as the "bell-shaped curves") are very important in statistical analysis. In any normal distribution is observations are distributed symmetrically around the mean, 68% of all values under the curve lie within one standard deviation of the mean and 95% lie within two standard deviations. There are many reasons for their popularity. The following are the most important reasons for its applicability: 1. One reason the normal distribution is important is that a wide variety of naturally occurring random variables such as heights and weights of all creatures are distributed evenly around a central value, average, or norm (hence, the name normal distribution). Although the distributions are only approximately normal, they are usually quite close. When there are many, too many factors influencing the outcome of a random outcome, then the underlying distribution is approximately normal. For example, the height of a tree is determined by the "sum" of such factors as rain, soil quality, sunshine, disease, etc. As Francis Galton wrote in 1889, "Whenever a large sample of chaotic elements are taken in hand and marshaled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along." Visit the Web sites Quincunx (with 5 influencing factors) influencing, Central Limit Theorem ( with 8 influencing factors), or BallDrop for demos. 2. Almost all statistical tables are limited by the size of their parameters. However, when these parameters are large enough one may use normal distribution for calculating the critical values for these tables. Visit Relationship Among Statistical Tables and Their Applications (pdf version). 3. If the mean and standard deviation of a normal distribution are known, it is easy to convert back and forth from raw scores to percentiles. 4. It's characterized by two independent parameters-mean and standard deviation. Therefore many effective transformations can be applied to convert almost any shaped distribution into a normal one. 5. The most important reason for popularity of normal distribution is the Central Limit Theorem (CLT). The distribution of the sample averages of a large number of independent random variables will be approximately normal regardless of the distributions of the individual random variables. Visit also the Web sites Central Limit Theorem Applet, Sampling Distribution Simulation, and CLT, for some demos. 6. The other reason the normal distributions are so important is that the normality condition is required by almost all kinds of parametricstatistical tests. The CLT is a useful tool when you are dealing with a population with unknown distribution. Often, you may analyze the mean (or the sum) of a sample of size n. For example instead of analyzing the weights of individual items you may analyze the batch of size n, that is, the packages each containing n items. What is a Linear Least Squares Model? Many problems in analyzing data involve describing how variables are related. The simplest of all models describing the relationship between two variables is a linear, or straight-line, model. Linear regression is always linear in the coefficients being estimated, not necessarily linear in the variables. The simplest method of drawing a linear model is to "eyeball" a line through the data on a plot, but a more elegant, and conventional method is that of least squares, which finds the line minimizing the sum of distances between observed points and the fitted line. Realize that fitting the "best" line by eye is difficult, especially when there is a lot of residual variability in the data. Know that there is a simple connection between the numerical coefficients in the regression equation and the slope and intercept of regression line. Know that a single summary statistic like a correlation coefficient does not tell the whole story. A scatterplot is an essential complement to examining the relationship between the two variables. Again, the regression line is a group of estimates for the variable plotted on the Y-axis. It has a form of y = a + mx, m is the slope of the line. The slope is the rise over run. If a line goes up 2 for each 1 it goes over, then its slope is 2. Formulas:  =  x(i)/n This is just the mean of the x values.  =  y(i)/n This is just the mean of the y values.  Sxx = (x(i) - )2 = x(i)2 - [x(i) ] 2 /n  Syy = (y(i) - )2 = y(i)2 - [y(i) ] 2 /n  Sxx = (x(i) . y(i)] / n )(y(i) - ) = x(i).y(i) - [x(i)  Slope m = Sxy / Sxx  Intercept, b = -m. The least squares regression line is: y-predicted = yhat = mx + b  The regression line goes through a mean-mean point. That is the point at the mean of the x values and the mean of the y values. If you drew lines from the mean-mean point out to each of the data points on the scatter plot, each of the lines that you drew would have a slope. The regression slope is the weighted mean of those slopes, where the weights are the runs squared. If you put in each x, the regression line would spit out for you an estimate for each y. Each estimate makes an error. Some errors are positive and some are negative. The sum of squared of the errors plus the sum of squared of the estimates add up to the sum of squared of Y. The regression line is the line that minimizes the variance of the errors. (the mean error is zero, so this means that it minimizes the sum of the squared errors.) The reason for finding the best line is so that you can make a reasonable predictions of what y will be if x is known (not vise-versa). r2 is the variance of the estimates divided by the variance of Y. r is ± the square root of r2. r is the size of the slope of the regression line, in terms of standard deviations. In other words, it is the slope if we use the standardized X and Y. It is how many standard deviations of Y you would go up, when you go one standard deviation of X to the right. Visit also the Web sites Simple Regression, Linear Regression, Putting Points Coefficient of Determination Another measure of the closeness of the points to the regression line is the Coefficient of Determination. r2 = Syhat yhat / Syy which is the amount of the squared deviation which is explained by the points on the least squares regression line. When you have regression equations based on theory, you should compare: 1. R squares, that is, the percentage of of variance [in fact, sum of squares] in Y accounted for variance in X captured by the model. 2. When you want to compare models of different size (different numbers of independent variables (p) and/or different sample sizes n) you must use the Adjusted RSquared, because the usual R-Squared tends to grow with the number of independent variables. R2 a = 1 - (n - 1)(1 - R2)/(n - p - 1) 3. prediction error or standard error 4. trends in error, 'observed-predicted' as a function of control variables such as time. Systematic trends are not uncommon 5. extrapolations to interesting extreme conditions of theoretical significance 6. t-stats on individual parameters 7. values of the parameters and its content to content underpinnings. 8. Fdf1 df2 value for overall assessment. Where df1 (numerator degrees of freedom) is the number of linearly independent predictors in the assumed model minus the number of linearly independent predictors in the restricted model (i.e.,the number of linearly independent restrictions imposed on the assumed model), and df2 (denominator degrees of freedom) is the number of observations minus the number of linearly independent predictors in the assumed model. Homoscedasticity and Heteroscedasiticy: Homoscedasticity (homo=same, skedasis=scattering) is a word used to describe the distribution of data points around the line of best fit. The opposite term is heteroscedasiticy. Briefly, homoscedasticity means that data points are distributed equally about the line of best fit. Therefore, homoscedasticity means constancy of variances for/over all the levels of factors. Heteroscedasiticy means that the data points cluster or clump above and below the line in a nonequal pattern. You should find a discussion of these terms in any decent statistics text that deals with least squares regression. See, e.g., Testing Research Hypothesis with the GLM, by McNeil, Newman and Kelly, 1996 pages 174-176. Finally in statistics for business, there exists an opinion that with more that 4 parameters one can fit an elephant, so that if one attempts to fit a curve that depends on many parameters the result should not be regarded as very reliable. If m1 and m2 are the slopes of two regressions y on x and x on y respectively then R2=m1.m2 Logistic regression: Standard logistic regression is a method for modeling binary data (e.g., does a person smoke or not, does a person survive a disease, or not). Polygamous logistic regression models more than two options (beg., does a person take the bus, drive a car or take the subway, does an office use WordPerfect, Word, or another package). Test for equality of two slopes: Let m1 represent the regression coefficient for explanatory variable X1 in sample 1 with size n1. Let m2 represent the regression coefficient for X1 in sample 2 with size n2. Let S1 and S2 represent the associated standard error estimates. Then, the quantity (m1 - m2) / SQRT(S1 2 + S2 2) has the t distribution with df = n1 + n2 - 4 Regression when both X and Y are in error: Simple linear least-square regression has among its conditions that the data for the independent (X) variables are known without error. Infact, the estimated results are conditioned on whatever errors happened to be present in the independent dataset. When the X-data have an error associated with them the result is to bias the slope downwards. A procedure known as Deming regression can handle this problem quite well. Biased slope estimates (due to error in x) can be avoided using Deming regression. Reference: Cook and Weisberg, An Introduction to Regression Graphics, Wiley, 1994 Regression Analysis: Planning, Development, and Maintenance I – Planning: 1. Define the problem, select response, suggest variables 2. Are the proposed variables fundamental to the problem, and are they variables? Measurable? Can one get a complete set of observations at the same time? Ordinary regression analysis does not assume that the independent variables are measured without error. However, they are conditioned on whatever errors happened to be present in the independent dataset. 3. Is the problem potentially solvable? 4. Correlation Matrix and first regression runs (for a subset of data). Find the basic statistics, correlation matrix. How difficult this problem may be? Compute the Variance Inflation Factor, VIF = 1/(1 rij),, i=1, 2, 3, .., i j. For moderate VIF, say between 2 and 8 you might be able to come-up with a ‘good' model. Inspect rij's , one or two must be large. If all are small, perhaps the ranges of the X variables are too-small. 5. Establish goals, prepare budget and time table. a - the final equation should have R2 = 0.8 (say). b - Coef. of Variation of say less than 0.10 c – Nunmer of predictors should not exceed p (say, 3), (for example for p=3, we need at least 30 points). d – All estimated coefficients must be significant at = 0.05 (say). e – No pattern in the residuals 6. Are goals and budget acceptable? II – Development of the Model: 1. 1 – Collect date, plot, try models, check the quality of date, check the assumptions. 2. 2 – Consult experts for criticism. Plot new variable and examine same fitted model. Also transformed Predictor Variable may be used. 3. 3 – Are goals met? Have you found "the best" model? III – Validation and Maintenance of the Model: 1. 1 – Are parameters stable over the sample space? 2. 2 – Is there lack of fit? Are the coefficients reasonable? Are any obvious variables missing? Is the equation usable for control or for prediction? 3. 3 – Maintenance of the Model. Need to have control chart to check the model periodically by statistical techniques. Predicting Market Response As applied researchers in business and economics, faced with the task of predicting market response, we seldom know the functional form of the response. Perhaps market response is a nonlinear monotonic, or even a non-monotonic function of explanatory variables. Perhaps it is determined by interactions of explanatory variable. Interaction is logically independent of its components. When we try to represent complex market relationships within the context of a linear model, using appropriate transformations of explanatory and response variables, we learn how hard the work of statistics can be. Finding reasonable models is a challenge, and justifying our choice of models to our peers can be even more of a challenge. Alternative specifications abound. Modern regression methods, such as generalized additive models, multivariate adaptive regression splines, and regression trees, have one clear advantage: They can be used without specifying a functional form in advance. These data-adaptive, computer- intensive methods offer a more flexible approach to modeling than traditional statistical methods. How well modern regression methods perform in predicting market response? Some perform quite well based on the results of simulation studies. How to Compare Two Correlations Coefficients? The statistical test is the following for Ho: 1 = 2. Compute t = (z1 - z2) / [ 1/(n1-3) + 1/(n2-3) ]½ n1, n2 3. where z1 = 0.5 ln( (1+r1)/(1-r1) ), z2 = 0.5 ln( (1+r2)/(1-r2) ) and n1= sample size associated with r1, and n2=sample size associated with r2 The distribution of the statistic t is approximately N(0,1). So, you should reject Ho if |t| 1.96 at the 95% confidence level. r is (positive) scale and (any) shift invariant. That is ax + c, and by + d, have same r as x and y, for any positive a and b. Procedures for Statistical Decision Making The two most widely used measuring tools and decision procedures in statistical decision making, are Classical and Bayesian Approaches. Classical Approach: Classical probability of finding this sample statistic -- or any statistic more unlikely-- assuming the null hypothesis is true. A small p-value is not sufficient evidence to reject the null hypothesis and to accept the alternate. As indicated in the above Figure, type-I error occurs when based on your data you reject the null hypothesis when in fact it is true. The probability of a type I error is the level of significance of the test of hypothesis, and is denoted by . A type II error occurs when you do not reject the null hypothesis when it is in fact it is false. The probability of a type-II error is denoted by . The quantity 1 - is known as the Power of a Test. Type-II error can be evaluated for any specific alternative hypotheses stated in the form "Not Equal to" as a competing hypothesis. Bayesian Approach: Difference in expected gain (loss) associated with taking various actions each having an associated gain (loss) and a given Bayesian statistical significance. This is standard Min/Max decision theory using Bayesian strength of belief assessments in the truth of the alternate hypothesis. One would choose the action which minimizes expected loss or maximizes expected gain (the risk function). Hypothesis Testing: Rejecting a Claim To perform a hypothesis testing, one must be very specific about the test one wishes to perform. The null hypothesis must be clearly stated, and the data must be collected in a repeatable manner. Usually, the sampling design will involve random, stratified random, or regular distribution of study plots. If there is any subjectivity, the results are technically not valid. All of the analyses, including the sample size, significance level, the time, and the budget, must be planned in advance, or else the user runs the risk of "data diving" Hypothesis testing is mathematical proof by contradiction. For example, for a Student's t test comparing 2 groups, we assume that the two groups come from the same population (same means, standard deviations, and in general same distributions). Then we try like all get out to prove that this assumption is false. Rejecting H0 means either H0 is false, or a rare event such as has occurred. The real question in statistics not whether a null hypothesis is correct, but whether it is close enough to be used as an approximation. Selecting Statistics In most statistical tests concerning , we start by assuming the 2 & higher moments (skewness, kurtosis) are equal. Then we hypothesize that the 's are equal. Null hypothesis. The "null" suggests no difference between group means, or no relationship between quantitative variables, and so on. Then we test with a calculated t-value. For simplicity, suppose we have a 2 sided test. If the calculated t is close to 0, we say good, as we expected. If the calculated t is far from 0, we say, "the chance of getting this value of t, given my assumption of equal populations, is so small that I will not believe the assumption. We will say that the populations are not equal, specifically the means are not equal." Sketch a normal distribution, with mean 12 and standard deviation s. If the null hypothesis is true, then the mean is 0. We calculate the 't' value, as per the equation. We look up a "critical" value of t. The probability of calculating a t value more extreme ( + or - ) than this, given that the null hypothesis is true, is equal or less than the  risk we used in pulling the critical value from the table. Mark the calculated t, and critical t (both sides) on the sketch of the distribution. Now. If the calculated t is more extreme than the critical value, we say, "the chance of getting this t, by shear chance, when the null hypothesis is true, is so small that I would rather say the null hypothesis is false, and accept the alternative, that the means are not equal." When the calculated value is less extreme than the calculated value, we say, "I could get this value of t by shear chance, often enough that I will not write home about it. I cannot detect a difference in the means of the two groups at the  significance level." In this test we need (among others) the condition that the population variances (i.e., treatment impacts on central tendency but not variability) are equal. However, this test is robust to violations of that condition if n's are large and almost the same size. A counter example would be to try a t-test between (11, 12, 13) and (20, 30, 40). The pooled and un pooled tests both give t statistics of 3.10, but the degrees of freedom are different: 4 (pooled) or about 2 (unpooled). Consequently the pooled test gives p = .036 and the unpooled p = .088. We could go down to n = 2 and get something still more extreme. The Classical Approach to the Test of Hypotheses In this treatment there are two parties, one party (or a person) sets out the null hypothesis (the claim), an alternative hypothesis is proposed by the other party , a significance level  and a sample size n are agreed upon by both parties. The second step is to compute the relevant statistic based on the null hypothesis and the random sample of size n. Finally, one determines the critical region (i.e. rejection region). The conclusion based on this approach is as follows: If the computed statistics falls within the rejection region, then Reject the null hypothesis. Otherwise Do Not Reject the null hypothesis (the claim). You may ask: How to determine the the critical value (such as zvalue) for the rejection interval: for one and two-tailed hypotheses. What is the rule? First you have to choose a significance level . Knowing that the null hypothesis is always in "equality" form, then, the alternative hypothesis has one three possible forms: "greater-than", "lessthan", or "not equal to". The first two forms correspond to onetail hypotheses while the last one corresponds to a two-tail hypothesis.    if your alternative is in the form of "greater-than", then z is the value that gives you an area to the right tail of distribution that is equal to . if your alternative is in the form of "less-than", then z is the value that gives you an area to the left tail of distribution that is equal to . if your alternative is in the form of "not equal to" then, there are two z values, one positive the other negative. The positive z is the value that gives you an /2 area to the right tail of distribution. While, the negative z is the value that gives you an /2 area to the left tail of distribution. This is a general rule, and to implement this process in determining the critical value, for any test of hypothesis, you must first master reading the statistical tables well, because, as you see, not all tables in your textbook are presented in a same format. The Meaning and Interpretation of P-values (what the data say?) The p-value, which directly depends on a given sample, attempts to provide a measure of the strength of the results of a test for the null hypotheses, in contrast to a simple reject or do not reject in the classical approach to the test of hypotheses. If the null hypothesis is true and the chance of random variation is the only reason for sample differences, then the p-value is a quantitative measure to feed into the decision making process as evidence. The following table provides a reasonable interpretation of pvalues: P-value P  0.01 Interpretation very strong evidence against H0 0.01 P  0.05 moderate evidence against H0 0.05 P  0.10 suggestive evidence against H0 0.10 P little or no real evidence against H0 This interpretation is widely accepted, and many scientific journals routinely publish papers using such an interpretation for the result of test of hypothesis. For the fixed-sample size, when the number of realizations is decided in advance, the distribution of p is uniform (assuming the null hypothesis). We would express this as P(p means the criterion of p x) = x. That 0.05 achieves  of 0.05. Understand that the distribution of p-values under null hypothesis H0 is uniform, and thus does not depend on a particular form of the statistical test. In a statistical hypothesis test, the P value is the probability of observing a test statistic at least as extreme as the value actually observed, assuming that the null hypothesis is true. The value of p is defined with respect to a distribution. Therefore, we could call it "model-distributional hypothesis" rather than "the null hypothesis". In short, it simply means that if the null had been true, the p value is the probability against the null in that case. The p-value is determined by the observed value, however, this makes it difficult to even state the inverse of p. Reference: Arsham H., Kuiper's P-value as a Measuring Tool and Decision Procedure for the Goodness-of-fit Test, Journal of Applied Statistics, Vol. 15, No.3, 131-135, 1988. Blending the Classical and the P-value Based Approaches in Test of Hypotheses A p-value is a measure of how much evidence you have against the null hypothesis. Notice that, the null hypothesis is always in = form, and does not contain any forms of inequalities. The smaller the p-value, the more evidence you have. In this setting the pvalue is based on the hull hypothesis and has nothing to do with alternative hypothesis and therefore with the rejection region. In recent years, some authors try to use the mixture of classical approach (which is based the critical value obtained from given , and the computed statistics based) and the p-value approach. This is a blend of two different school of thoughts. In this setting, some textbooks compare the p-value with the significance level to make decision on a given test of hypothesis. Larger the p-value is when compared with  (in one sided alternative hypothesis, and /2 for the two sided alternative hypotheses), less evidence we have in rejecting the null hypothesis. In such a comparison, if the p-value is less than some threshold (usually 0.05, sometimes a bit larger like 0.1 or a bit smaller like 0.01) then you reject the null hypothesis.The following paragraph deal with such a combined approach. Use of P-value and : In this setting, we must also consider the alternative hypothesis in drawing the rejection interval (region) . There is only one p-value to compare with  (or /2). Know that, for any test of hypothesis, there is only one p-value. The following outlines the computation of the p-value and the decision process involving in a given test of hypothesis: 1. P-value for One-side Alternative Hypotheses: The p-value is defined as the area to the right tail of distribution if the rejection region in on the right tail, if the rejection region is on the left tail, then the p-value is the area to the left tail (in one-sided alternative hypotheses). 2. P-value for Two-side Alternative Hypotheses: If the alternative hypothesis is a two-sided (that is, rejection regions are both, on the left and on the right tails) then the p-value is the area to the right tail or to the left of distribution depending on whether the computed statistic is closer to the right rejection region or left rejection region. For symmetric densities (such as t) the left and right tails pvalues are the same. However, for non-symmetric densities (such as Chi-square) used the smaller of the two (this makes the test more conservative). Notice that, for two sided-test alternative hypotheses, the p-value is never greater than 0.5. 3. After finding the p-value as defined here, you compare it with a preset  value for one-sided tests, and with /2 for two sided-test. Larger the p-value is when compared with  (in one sided alternative hypothesis, and /2 for the two sided alternative hypotheses), less evidence we have for rejecting the null hypothesis. To avoid looking-up the p-values from the limited statistical tables given in your textbook, most professional statistical packages such as SPSS provide the two-tail p-value. Based on where the rejection region is, you must find out what p-value to use. Unfortunately, some textbooks have many misleading statements about p-value and its applications, for example in many textbooks you find the authors double the p-value to compare it with  when dealing with the the two-sided test of hypotheses. One wonders how they do it in the case when "their" p-vaue exceeds 0.5? Notice that, while it is correct to compare the pvalue with  for one side test of hypotheses , however, for twosided hypotheses, one must compare the p-value with /2, NOT  with 2 times p-value, as unfortunately some text book advise. While, the decision is the same, but there is a clear distinction here and an important difference which the careful reader will note. When We Should POOL Variance Estimates? Variance estimates should be pooled only if there is a good reason for doing so, and then (depending on that reason) the conclusions might have to be made explicitly conditional on the validity of the equal-variance model. There are several different good reasons for pooling: (a) to get a single stable estimate from several relatively small samples, where variance fluctuations seem not to be systematic; or (b) for convenience, when all the variance estimates are near enough to equality; or (c) when there is no choice but to model variance (as in simple linear regression with no replicated X values), and deviations from the constant-variance model do not seem systematic; or (d) when group sizes are large and nearly equal, so that there is essentially no difference between the pooled and unpooled estimates of standard errors of pairwise contrasts, and degrees of freedom are nearly asymptotic. Note that this last rationale can fall apart for contrasts other than pairwise ones. One is not really pooling variance in case (d), rather one is merely taking a shortcut in the computation of standard errors of pairwise contrasts. If you calculate the test without the assumption, you have to determine the degrees of freedom (or let the statistics package do it). The formula works in such a way that df will be less if the larger sample variance is in the group with the smaller number of observations. This is the case in which the two tests will differ considerably. A study of the formula for the df is most enlightening and one must understand the correspondence between the unfortunate design (having the most observations in the group with little variance) and the low df and accompanying large t-value. Example: When doing t tests for differences in means of populations (a classic independent samples case): 1. Use the standard error formula for differences in means that does not make any assumption about equality of population variances [i.e., (VAR1/n1 + VAR2/n2)½]. 2. Use the "regular" way to calculate df in a t test (n1-1)+(n21), n1, n2 2. 3. If total N is less than 50 and one sample is 1/2 the size of the other (or less) and the smaller sample has a standard deviation at least twice as large as the other sample, then replace #2 with formula for adjusting df value. Otherwise, don't worry about the problem of having an actual level that is much different than what you have set it to be. In the Statistics With Confidence Section we are concerned with the construction of confidence interval where the equality of variances condition is an important issue. Visit also the Web sites Statistics, Statistical tests. Remember that in the t tests for differences in means there is a condition of equal population variances that must be examined. One way to test for possible differences in variances is to do an F test. However, the F test is very sensitive to violations of the normality condition; i.e., if populations appear not to be normal, then the F test will tend to over reject the null of no differences in population variances. SPSS program for T-test, Two-Population Independent Means: $SPSS/OUTPUT=CH2DRUG.OUT TITLE ' T-TEST, TWO INDEPENDENT MEANS ' DATA LIST FREE FILE='A.IN'/drug walk VAR LABELS DRUG 'DRUG OR PLACEBO' WALK 'DIFFERENCE IN TWO WALKS' VALUE LABELS DRUG 1 'DRUG' 2 'PLACEBO' T-TEST GROUPS=DRUG(1,2)/VARIABLES=WALK NPAR TESTS M-W=WALK BY DRUG(1,2)/ NPAR TESTS K-S=WALK BY DRUG(1,2)/ NPAR TESTS K-W=WALK BY DRUG(1,2)/ SAMPLE 10 FROM 20 CONDESCRIPTIVES DRUG(ZDRUG),WALK(ZWALK) LIST CASE CASE =10/VARIABLES=DRUG,ZDRUG,WALK,ZWALK FINISH SPSS program for T-test, Two-Population Dependent Means: $ SPSS/OUTPUT=A.OUT TITLE ' T-TEST, 2 DEPENDENT MEANS' FILE HANDLE MC/NAME='A.IN' DATA LIST FILE=MC/YEAR1,YEAR2,(F4.1,1X,F4.1) VAR LABELS YEAR1 'AVERAGE LENGTH OF STAY IN YEAR 1' YEAR2 'AVERAGE LENGTH OF STAY IN YEAR 2' LIST CASE CASE=11/VARIABLES=ALL/ T-TEST PAIRS=YEAR1 YEAR2 NONPAR COR YEAR1,YEAR2 NPAR TESTS WILCOXON=YEAR1,YEAR2/ NPAR TESTS SIGN=YEAR1,YEAR2/ NPAR TESTS KENDALL=YEAR1,YEAR2/ FINISH Visit also the Web site Statistical tests. Analysis of Variance (ANOVA) The tests we have learned up to this point allow us to test hypotheses that examine the difference between only two means. Analysis of Variance or ANOVA will allow us to test the difference between 2 or more means. ANOVA does this by examining the ratio of variability between two conditions and variability within each condition. For example, say we give a drug that we believe will improve memory to a group of people and give a placebo to another group of people. We might measure memory performance by the number of words recalled from a list we ask everyone to memorize. A t-test would compare the likelihood of observing the difference in the mean number of words recalled for each group. An ANOVA test, on the other hand, would compare the variability that we observe between the two conditions to the variability observed within each condition. Recall that we measure variability as the sum of the difference of each score from the mean. When we actually calculate an ANOVA we will use a short-cut formula Thus, when the variability that we predict (between the two groups) is much greater than the variability we don't predict (within each group), then we will conclude that our treatments produce different results. An Illustrative Numerical Example for ANOVA Introducing ANOVA in simplest forms by numerical illustration. Example: Consider the following (small, and integer, indeed for illustration while saving space) random samples from three different populations. With the null hypothesis H0: µ1 = µ2 = µ3, and the Ha: at least two of the means are not equal. At the significance level = 0.05, the critical value from F-table is F 0.05, 2, 12 = 3.89. Sample 1 Sample 2 Sample 3 2 3 5 3 4 5 1 3 5 3 5 3 1 0 2 SUM 10 15 20 Mean 2 3 4 Demonstrate that, SST=SSB+SSW Computation of sample SST: With the grand mean = 3, first, start with taking the difference between each observation and the grand mean, and then square it for each data point. Sample 1 Sample 2 Sample 3 SUM 1 0 4 0 1 4 4 0 4 0 4 0 4 9 1 9 14 13 Therefore SST=36 with d.f = 15-1 = 14 Computation of sample SSB: Second, let all the data in each sample have the same value as the mean in that sample. This removes any variation WITHIN. Compute SS differences from the grand mean. Sample 1 Sample 2 Sample 3 SUM 1 0 1 1 0 1 1 0 1 1 0 1 1 0 1 5 0 5 Therefore SSB = 10, with d.f = 3-1 = 2 Computation of sample SSW: Third, compute the SS difference within each sample using their own sample means. This provides SS deviation WITHIN all samples. Sample 1 Sample 2 Sample 3 SUM 0 0 1 1 1 1 1 0 1 1 4 1 1 9 4 4 14 8 SSW = 26 with d.f = 3(5-1) = 12 Results are: SST = SSB + SSW, and d.fSST = d.fSSB + d.fSSW, as expected. Now, construct the ANOVA table for this numerical example by plugging the results of your computation in the ANOVA Table. The ANOVA Table Sources of Variation Sum of Squares Degrees of Freedom Mean Squares F-Statistic Between Samples 10 2 5 Within Samples 26 12 2.17 Total 36 14 2.30 Conclusion: There is not enough evidence to reject the null hypothesis Ho. Logic Behind ANOVA: First, let us try to explain the logic and then illustrate it with a simple example. In performing ANOVA test, we are trying to determine if a certain number of population means are equal. To do that, we measure the difference of the sample means and compare that to the variability within the sample observations. That is why the test statistic is the ratio of the between-sample variation (MST) and the within-sample variation (MSE). If this ratio is close to 1, there is evidence that the population means are equal. Here's a hypothetical example: many people believe that men get paid more in the business world than women, simply because they are male. To justify or reject such a claim, you could look at the variation within each group (one group being women's salaries and the other being men salaries) and compare that to the variation between the means of randomly selected samples of each population. If the variation in the women's salaries is much larger than the variation between the men and women's mean salaries, one could say that because the variation is so large within the women's group that this may not be a gender-related problem. Now, getting back to our numerical example, we notice that: given the test conclusion and the ANOVA test's conditions, we may conclude that these three populations are in fact the same population. Therefore, the ANOVA technique could be used as a measuring tool and statistical routine for quality control as described below using our numerical example. Construction of the Control Chart for the Sample Means: Under the null hypothesis the ANOVA concludes that µ1 = µ2 = µ3; that is, we have a "hypothetical parent population." The question is, what is its variance? The estimated variance is 36 / 14 = 2.75. Thus, estimated standard deviation is = 1.60 and estimated standard deviation for the means is 1.6 / 5 = 0.71. Under the conditions of ANOVA, we can construct a control chart with the warning limits = 3 ± 2(0.71); the action limits = 3 ± 3(0.71). The following figure depicts the control chart. Visit also the Web site Statistical tests. Bartlett's Test: The Analysis of Variance requires certain conditions be met if the statistical tests are to be valid. One of the conditions we make is that the errors (residuals) all come from the same normal distribution. Thus we have to test not only for normality, but we must also test homogeneity of the variances. We can do this by subdividing the data into appropriate groups, computing the variances in each of the groups and testing that they are consistent with being sampled from a Normal distribution. The statistical test for homogeneity of variance is due to Bartlett which is a modification of the NeymanPearson likelihood ratio test. Bartlett's Test of Homogeneity of Variances for r Independent Samples is a test to check for equal variances between independent samples of data. The subgroups sizes do not have to be equal. This tests assumes that each sample was randomly and independently drawn from a normal population. SPSS program for ANOVA: More Than Two Independent Means: $SPSS/OUTPUT=4-1.OUT1 TITLE 'ANALYSIS OF VARIANCE - 1st ITERATION' DATA LIST FREE FILE='A.IN'/GP Y ONEWAY Y BY GP(1,5)/RANGES=DUNCAN /STATISTICS DESCRIPTIVES HOMOGENEITY STATISTICS 1 MANOVA Y BY GP(1,5)/PRINT=HOMOGENEITY(BARTLETT)/ NPAR TESTS K-W Y BY GP(1,5)/ FINISH ANOVA like two population t-test can go wrong when the equality of variances condition is not met. Homogeneity of Variance: Checking the equality of variances For 3 or more populations, there is a practical rule known as the "Rule of 2". According to this rule, one divides the highest variance of a sample by the lowest variance of the other sample. Given that the sample sizes are almost the same, and the value of this division is less than 2, then, the variations of the populations are almost the same. Example: Consider the following three random samples from three populations, P1, P2, P2 P1 P2 P3 25 25 20 17 21 17 8 10 14 18 13 6 5 22 25 10 25 19 21 15 16 24 23 16 12 14 6 16 13 6 The summary statistics and the ANOVA table are computed to be: Variable P1 P2 P3 Source Factor Error Total N 10 10 10 DF 2 27 29 Mean 16.90 19.80 11.50 St.Dev 7.87 3.52 3.81 Analysis of Variance SS MS F 79.40 39.70 4.38 244.90 9.07 324.30 SE Mean 2.49 1.11 1.20 p-value 0.023 With an F = 4.38 and a p-value of .023, we reject the null at = 0.05. This is not good news, since ANOVA, like two sample t-test, can go wrong when the equality of variances condition is not met by the Rule of 2. Visit also the Web site Statistical tests. SPSS program for ANOVA: More Than Two Independent Means: $SPSS/OUTPUT=A.OUT TITLE 'ANALYSIS OF VARIANCE - 1st ITERATION' DATA LIST FREE FILE='A.IN'/GP Y ONEWAY Y BY GP(1,5)/RANGES=DUNCAN STATISTICS 1 MANOVA Y BY GP(1,5)/PRINT=HOMOGENEITY(BARTLETT)/ NPAR TESTS K-W Y BY GP(1,5)/ FINISH CHI square test: Dependency $SPSS/OUTPUT=A.OUT TITLE 'PROBLEM 4.2 CHI SQUARE; TABLE 4.18' DATA LIST FREE FILE='A.IN'/FREQ SAMPLE NOM WEIGHT BY FREQ VARIABLE LABELS SAMPLE 'SAMPLE 1 TO 4' NOM 'LESS OR MORE THAN 8' VALUE LABELS SAMPLE 1 'SAMPLE1' 2 'SAMPLE2' 3 'SAMPLE3' 4 'SAMPLE4'/ NOM 1 'LESS THAN 8' 2 'GT/EQ TO 8'/ CROSSTABS TABLES=NOM BY SAMPLE/ STATISTIC 1 FINISH Non-parametric ANOVA: $SPSS/OUTPUT=A.OUT DATA LIST FREE FILE='A.IN'/GP Y NPAR TESTS K-W Y BY GP(1,4) FINISH Power of a Test Power of a test is the probability of correctly rejecting a false null hypothesis. This probability is inversely related to the probability of making a Type II error. Recall that we choose the probability of making a Type I error when we set . If we decrease the probability of making a Type I error we increase the probability of making a Type II error. Therefore, there are basically two errors possible when conducting a statistical analysis; types I and II:  Type I error - risk (i.e. probability) of rejecting the null hypothesis when it is in fact true  Type II error - risk of not rejecting the null hypothesis when it is in fact false Power and Alpha () Thus, the probability of correctly retaining a true null has the same relationship to Type I errors as the probability of correctly rejecting an untrue null does to Type II error. Yet, as I mentioned if we decrease the odds of making one type of error we increase the odds of making the other type of error. What is the relationship between Type I and Type II errors? For a fixed sample size, decreasing one type of error increases the size of the other one. Power and the True Difference Between Population Means Anytime we test whether a sample differs from a population, or whether two samples come from 2 separate populations, there is the condition that each of the populations we are comparing has it's own mean and standard deviation (even if we do not know it). The distance between the two population means will affect the power of our test. Power as a Function of Sample Size and Variance 2: Anything that effects the extent to which the two distributions share common values will increase Beta (the likelihood of making a Type II error) Four factors influence power:     effect size (for example, the difference between the means) standard error  significance level  number of observations, or the sample size n A Numerical Example: The following Figure provides an illustrative numerical example: Not rejecting the null hypothesis when it is false is defined as a type II error, and is denoted by the  region. In the above Figure this region lies to the left of the critical value. In the configuration shown in this Figure,  falls to the left of the critical value (and below the statistic's density under the alternative hypothesis Ha). The  is also defined as the probability of incorrectly not-rejecting a false null hypothesis, also called a miss. Related to the value of  is the power of a test. The power is defined as the probability of rejecting the null hypothesis given that a specific alternative is true, and is computed as (1- ). A Short Discussion: Consider testing a simple null versus simple alternative. In the Neyman-Pearson setup, an upper bound is set for the probability of type I error (), and then it is desirable to find tests with low probability of type II error () given this. The usual justification for this is that "we are more concerned about a type I error, so we set an upper limit on the  we can tolerate." I have seen this sort of reasoning in elementary texts and also some advanced ones. It doesn't seem to make any sense. When the sample size is large, for most standard tests, the ratio / tends to 0. If we care more about type I error than type II error, why should this concern dissipate with increasing sample size? This is indeed a drawback of the classical theory of testing statistical hypotheses. A second drawback is that the choice lies between only two test decisions: reject the null or accept the null. It is worth considering approaches that overcome these deficiencies. This can be done, for example, by the concept of profile-tests at a 'level' . Neither the Type I nor Type II error rates are considered separately, but they are the ratio of a correct decision. For example, we accept the alternative hypothesis Ha and reject the null H0, if an event is observed which is at least a-times greater under Ha than under H0. Conversey, we accept H0 and reject Ha, if an event is observed which is at least a-times greater under H0 than under Ha. This is a symmetric concept which is formulated within the classical approach. Furthermore, more than two decisions can also be formulated. Visit also, the Web site Sample Size Calculations Parametric vs. Non-Parametric vs. Distribution-free Tests One must use a statistical technique called nonparametric if it satisfies at least one of the following five types of criteria: 1. The data entering the analysis are enumerative - that is, count data representing the number of observations in each category or cross-category. 2. The data are measured and/or analyzed using a nominal scale of measurement. 3. The data are measured and/or analyzed using an ordinal scale of measurement. 4. The inference does not concern a parameter in the population distribution - as, for example, the hypothesis that a time-ordered set of observations exhibits a random pattern. 5. The probability distribution of the statistic upon which the analysis is based is not dependent upon specific information or assumptions about the population(s) from which the sample(s) are drawn, but only on general assumptions, such as a continuous and/or symmetric population distribution. By this definition, the distinction of nonparametric is accorded either because of the level of measurement used or required for the analysis, as in types 1 through 3; the type of inference, as in type 4, or the generality of the assumptions made about the population distribution, as in type 5. For example, one may use the Mann-Whitney Rank Test as a nonparametric alternative to Students T-test when one does not have normally distributed data. Mann-Whitney: To be used with two independent groups (analogous to the independent groups t-test) Wilcoxon: To be used with two related (i.e., matched or repeated) groups (analogous to the related samples t-test) Kruskall-Wallis: To be used with two or more independent groups (analogous to the single-factor between-subjects ANOVA) Friedman: To be used with two or more related groups (analogous to the single-factor within-subjects ANOVA) Non-parametric vs. Distribution-free Tests: Non-parametric tests are those used when some specific conditions for the ordinary tests are violated. Distribution-free tests are those for which the procedure is valid for all different shape of the population distribution. For example, the chi-square test concerning the variance of a given population is parametric since this test requires that the population distribution be normal. The chi-square test of independence does not assume normality, or even that the data are numerical. The Kolmogorov-Smirinov goodness-of-fit test is a distribution-free test which can be applied to test any distribution. Pearson's and Spearman's Correlations There are measures that describe the degree to which two variables are linearly related. For the majority of these measures, the correlation is expressed as a coefficient that ranges from 1.00, indicating a perfect linear relationship such that knowing the value of one variable will allow perfect prediction of the value of the related value, to 0.00, indicating no predictability by a linear model, with negative values indicating that when the value of one variable is high, the other is low (and vice versa), and positive values indicating that when the value of one variable is high, so is the other (and vice versa). Correlation has similar interpretation compared with the derivative you have learned in you calculus (a deterministic course). The Pearson's product correlation is an index of the linear relationship between two variables. Formulas:  =  xi / n This is just the mean of the x values.  =  yi / n This is just the mean of the y values.  Sxx = (xi - )2 = xi2 - [xi) ]  Syy = (yi - )2 = yi2 - [yi ]  Sxx = (xi - )(yi - 2 2 /n /n ) = xi. yi - [x(i) . yi ] / n The Pearson's correlation is = Sxy / (Sxx . Syy)0.5 If there is a positive relationship an individual has a score on variable x that is above the mean of variable x, this individual is likely to have a score on variable y that is above the mean of variable y, and vice versa. A negative relationship would be an x score above the mean of x and a y score below the mean of y. It is a measure of the relationship between variables and an index of the proportion of individual differences in one variable that can be associated with the individual differences in another variable. In essence, the product-moment correlation coefficient is the mean of the cross-products of scores. If you have three values for of .40, .60, and .80. you cannot say that the difference between = .40 and = .60 is the same as the difference between =.60 and = .80, or that = .80 is twice as large as = .40 because the scale of values for the correlation coefficient is not interval or ratio, but ordinal. Therefore, all you can say is that, for example, a correlation coefficient of +.80 indicates a high positive linear relationship and a correlation coefficient of +.40 indicates a some what lower positive linear relationship. It can tell us how much of the total variance of one variable can be associated with the variance of another variable. The square of the correlation coefficient equals the proportion of the total variance in Y that can be associated with the variance in x. However, in engineering/manufacturing/development, an of 0.7 is often considered weak, and +0.9 is desirable. When the correlation coefficient is around +0.9, it is time to make a prediction and confirmation trial(s). Note that a correlation coefficient is usually done on linear correlations. If the data forms a symmetric quadratic hump, a linear correlation of x and y will produce an of 0!. So one must be careful and look at data. Spearman rank-order correlation coefficient is used as a nonparametric version of Pearson's. It is expressed as: = 1 - (6d2) / [n(n2 - 1)], where d is the difference rank between each X and Y pair. Spearman correlation coefficient can be algebraically derived from the Pearson correlation formula by making use of sums of series. Pearson contains expressions for x(i), y(i), x(i)2 and y(i)2. In the Spearman case, the x(i)'s and y(i)' are ranks, and so the sums of the ranks, and the sums of the ranks squared, are entirely determined by the number of cases (without any ties). i = (N+1)N/2, i2 = N(N+1)(2N+1)/6 The Spearman formula then is equal to: [12P - 3N(N+1)2] / [N(N2 - 1)], where P is the sum of the product of each pair of ranks x(i)y(i). This reduces to: = 1 - (6d2) / [n(n2 - 1)], where d is the difference rank between each x(i) and y(i) pair. An important consequence of this is that if you enter ranks into a Pearson formula, you get precisely the same numerical value as that obtained by entering the ranks into the Spearman formula. This comes as a bit of a shock to those who like to adopt simplistic slogans such as "Pearson is for interval data, Spearman is for ranked data". Spearman doesn't work too well if there are lots of tied ranks. That's because the formula for calculating the sums of squared ranks no longer holds true. If one has lots of tied ranks, use the Pearson formula. Visit also the Web sites: Correlation Pearsons r, Spearman's Rank Correlation Independence vs. Correlated In the sense that it is used in statistics, i.e., as an assumption in applying a statistical test, a random sample from the entire population provides a set of random variables X1,...., Xn that are identically distributed and and mutually independent (mutually independent is stronger than pairwise independence). The random variables are mutually independent if their joint distribution is equal to the product of their marginal distributions. In the case of joint normality, independence is equivalent to zero correlation but not in general. Independence will imply zero correlation (if the random variables have second moments) but not conversely. Not that not all random variables have a first moment let alone a second moment and hence there may not be a correlation coefficient. However if the correlation coefficient of two random variables (theoretical) is not zero then the random variables are not independent. Correlation, and Level of Significance It is intuitive that with very few data points, a high correlation may not be statistically significant. You may see statements such as, "correlation is significant between x and y at the  = .005 level" and "correlation is significant at the  = .05 level." The question is that how to determine these numbers? For simple correlation, you can look at the test as a test on r2. Looking at a simple correlation, the formula for F, where F is the square of the t-statistic, becomes F= (n-2) r2 / (1-r2), n 2. As you may see, this is monotonic in r2and in n. If the degrees of freedom (n-2) is large, then the F-test is very closely approximated by the chisquared - so that a value of 3.84 is what is needed for reaching  = 5% level. The cutoff value of F changes little enough that the same value, 3.84, gives a pretty good estimate even when the n is small. You can look up an Ftable or chisquared table to see the cutoff values needed for other  levels. Resampling Techniques: Jackknifing, and Bootstrapping Statistical inference techniques that do not require distributional assumptions about the statistics involved. These modern nonparametric methods use large amounts of computation to explore the empirical variability of a statistic, rather than making a priori assumptions about this variability, as is done in the traditional parametric t- and z- tests. Monte Carlo simulation allows for the evaluation of the behavior of a statistic when its mathematical analysis is intractable. Bootstrapping and jackknifing allow inferences to be made from a sample when traditional parametric inference fails. These techniques are especially useful to deal with statistical problems such as small sample size, statistics with no well-developed distributional theory, and parametric inference conditions violations. Both are comouter intensive. Bootstrapping involves taking repeated samples from a popular with the operating rule that you delete n from the sample each time. Jackknifing involves systematically doing n steps, of omitting 1 case from a sample at a time, or, more generally, n/k steps of omitting k cases; computations that compare "included" vs. "omitted" can be used (especially) to reduce the bias of estimation. Bootstrapping means you take repeated samples from a sample and then make statements about a population. Bootstrapping entails sampling-with-replacement from a sample. Both have applications in reducing biase in estimations. Resampling -- including the bootstrap, permutation, and other non-parametric tests -- is a method for hypothesis tests, confidence limits, and other applied problems in statistics and probability. It involves no formulas or tables. Resampling procedure-free for all tests. Following the first publication of the general technique (and the bootstrap) in 1969 by Julian Simon and subsequent independent development by Bradley Efron, resampling has become an alternative approach for test of hypotheses. There are other findings, "The bootstrap started out as a good notion in that it presented, in theory, an elegant statistical procedure that was free of distributional conditions. Unfortunately, it doesn't work very well, and the attempts to modify it make it more complicated and more confusing than the parametric procedures that it was meant to replace." For the pros and cons of the bootstrap, read Young G., Bootstrap: More than a Stab in the Dark?, Statistical Science, l9, 382-395, 1994. visit also, the Web sites Resampling, and Bootstrapping with SAS. Sampling Methods From the food you eat to the TV you watch, from political elections to school board actions, much of your life is regulated by the results of sample surveys. In the information age of today and tomorrow, it is increasingly important that sample survey design and analysis be understood by many so as to produce good data for decision making and to recognize questionable data when it arises. Relevant topics are: Simple Random Sampling, Stratified Random Sampling, Cluster Sampling, Systematic Sampling, Ratio and Regression Estimation, Estimating a Population Size, Sampling a Continuum of Time, Area or Volume, Questionnaire Design, Errors in Surveys. A sample is a group of units selected from a larger group (the population). By studying the sample it is hoped to draw valid conclusions about the larger group. A sample is generally selected for study because the population is too large to study in its entirety. The sample should be representative of the general population. This is often best achieved by random sampling. Also, before collecting the sample, it is important that the researcher carefully and completely defines the population, including a description of the members to be included. Random sampling of size n from a population size N. Unbiased estimate for variance of is Var( ) = S2(1-n/N)/n, where n/N is the sampling fraction. For sampling fraction less than 10% the finite population correction factor (N-n)/(N-1) is almost 1. The total T is estimated by N. , its variance is N2Var( ). For 0, 1, (binary) type variables, variation in Pbar is S2 = Pbar.(1-Pbar).(1-n/N)/(n-1). For ratio r = xi/yi= / , the variation for r is [(N-n)(r2S2x + S2y -2 r Cov(x, y)]/[n(N-1). Stratified Sampling: and t s 2 ]. =  Wt. Bxart, over t=1, 2, ..L (strata), is Xit/nt. Its variance is: W2t /(Nt-nt)S2t/[nt(Nt-1)] Population total T is estimated by N. s, its variance is N2t(Nt-nt)S2t/[nt(Nt-1)]. Since the survey usually measures several attributes for each population member, it is impossible to find an allocation that is simultaneously optimal for each of those variables. Therefore, in such a case we use the popular method of allocation which use the same sampling fraction in each stratum. This yield optimal allocation given the variation of the strata are all the same. Determination of sample sizes (n) with regard to binary data: Smallest integer greater than or equal to: [t2 N p(1-p)] / [t2 p(1-p) + 2 (N-1)] with N being the size of the total number of cases, n being the sample size,  the expected error, t being the value taken from the t distribution corresponding to a certain confidence interval, and p being the probability of an event. Cross-Sectional Sampling:: Cross-Sectional Study the observation of a defined population at a single point in time or time interval. Exposure and outcome are determined simultaneously. For more information on sampling methods, visit the Web sites : Sampling Sampling In Research Sampling, Questionnaire Distribution and Interviewing SRMSNET: An Electronic Bulletin Board for Survey Researchers Sampling and Surveying Handbook Warranties: Statistical Planning and Analysis In today market place, warranty has become an increasingly important component of a product package and most consumer and industrial products are sold with a warranty. The warranty serves many purposes. It provides protection for both buyer and manufacturer. For a manufacturer, a warranty also serves to communicate information about product quality, and, as such, may be used as a very effective marketing tool. Warranty decisions involve both technical and commercial considerations. Because of the possible financial consequences of these decisions, effective warranty management is critical for the financial success of a manufacturing firm. This requires that management at all levels be aware of the concept, role, uses and cost and design implications of warranty. The aim is to understand: the concept of warranty and its uses; warranty policy alternatives; the consumer/manufacturer perspectives with regards warranties; the commercial/technical aspects of warranty and their interaction; strategic warranty management; methods for warranty cost prediction; warranty administration References and Further Readings: Brennan J., Warranties: Planning, Analysis, and Implementation, McGraw Hill, New York, 1994. Factor Analysis Factor Analysis is a technique for data reduction that is, explaining the variation in a collection of continuous variables by a smaller number of underlying dimensions (called factors). Common factor analysis can also be used to form index numbers or factor scores by using correlation or covariance matrix. The main problem with factor analysis concept is that it is very subjective in its interpretation of the results. Delphi Analysis Delphi Analysis is used in decision making process, in particular in forecasting. Several "experts" sit together and try to compromise on something they cannot agree on. Reference: Delbecq, A., Group Techniques for Program Planning, Scott Foresman, 1975. Binomial Distribution Application: Gives probability of exact number of successes in n independent trials, when probability of success p on single trial is a constant. Used frequently in quality control, reliability, survey sampling, and other industrial problems. Example: What is the probability of 7 or more "heads" in 10 tosses of a fair coin? Know that the binomial distribution is to satisfy the five following requirements: each trial can have only two outcomes or its outcomes can be reduced to two categories which is called pass and fail, there must be a fixed number of trials, the outcome of each trail must be independent, the probabilities must remain constant, and the outcome of interest is the number of successes. Comments: Can sometimes be approximated by normal or by Poisson distribution. Poisson Application: Gives probability of exactly x independent occurrences during a given period of time if events take place independently and at a constant rate. May also represent number of occurrences over constant areas or volumes. Used frequently in quality control, reliability, queuing theory, and so on. Example: Used to represent distribution of number of defects in a piece of material, customer arrivals, insurance claims, incoming telephone calls, alpha particles emitted, and so on. Comments: Frequently used as approximation to binomial distribution. Exponential Distribution Application: Gives distribution of time between independent events occurring at a constant rate. Equivalently, probability distribution of life, presuming constant conditional failure (or hazard) rate. Consequently, applicable in many, but not all reliability situations. Example: Distribution of time between arrival of particles at a counter. Also life distribution of complex non redundant systems, and usage life of some components - in particular, when these are exposed to initial burn-in, and preventive maintenance eliminates parts before wear-out. Comments: Special case of both Weibull and gamma distributions. Uniform Distribution Application: Gives probability that observation will occur within a particular interval when probability of occurrence within that interval is directly proportional to interval length. Example: Used to generate random valued. Comments: Special case of beta distribution. The density of geometric mean of n independent uniforms(0,1) is: P(X = x) = n x(n - 1) (Log[1/xn])(n -1) / (n - 1)!. zL = [UL-(1-U)L] / L is said to have Tukey's symmetrical lambda distribution. Student's t-Distributions The t distributions were discovered in 1908 by William Gosset who was a chemist and a statistician employed by the Guinness brewing company. He considered himself a student still learning statistics, so that is how he signed his papers as pseudonym "Student". Or perhaps he used a pseudonym due to "trade secrets" restrictions by Guinness. Note that there are different t distributions, it is a class of distributions. When we speak of a specific t distribution, we have to specify the degrees of freedom. The t density curves are symmetric and bell-shaped like the normal distribution and have their peak at 0. However, the spread is more than that of the standard normal distribution. The larger the degrees of freedom, the closer the t-density is to the normal density. Critical Values for the t-Distribution Annotated Review of Statistical Tools on the Internet Visit also the Web site Computational Tools and Demos on the Internet Introduction: Modern, web-based learning and computing provides the means for fundamentally changing the way in which statistical instruction is delivered to students. Multimedia learning resources combined with CD-ROMs and workbooks attempt to explore the essential concepts of a course by using the full pedagogical power of multimedia. Many Web sites have nice features such as interactive examples, animation, video, narrative, and written text. These web sites are designed to provide students with a "self-help" learning resource to complement the traditional textbook. In a few pilot studies, [Mann, B. (1997) Evaluation of Presentation modalities in a hypermedia system, Computers & Education, 28, 133-143. Ward M. and D. Newlands (1998) Use of the Web in undergraduate teaching, Computers & Education, 31, 171-184.] compared the relative effectiveness of three versions of hypermedia systems, namely, Text, Sound/Text, and Sound. The results indicate that those working with Sound could focus their attention on the critical information. Those working with the Text and Sound/Text version however, did not learn as much and stated their displeasure with reading so much text from the screen. Based on this study, it is clear at least at this time that such web-based innovations cannot serve as an adequate substitute for face-to-face live instruction [See also Mcintyre D., and F. Wolff, An experiment with WWW interactive learning in university education, Computers & Education, 31, 255-264, 1998]. Online learning education does for knowledge what just-in-time delivery does for manufacturing: It delivers the right tools and parts when you need them. The Java applets are probably the most phenomenal way of simplifying various concepts by way of interactive processes. These applets help bring into life every concept from central limit theorem to interactive random games and multimedia applications. The Flashlight Project develops survey items, interview plans, cost analysis methods, and other procedures that institutions can use to monitor the success of educational strategies that use technology. Read also, Critical notice: we are blessed with the emergence of the WWW? Edited by B. Khan, and R. Goodfellow, Computers and Education, 30(1-2), 131-136, 1998. The following compilation summarizes currently available public domain web sites offering statistical instructional material. While some sites may have been missed, I feel that this listing is fully representative. I would welcome information regarding any further sites for inclusion, E-mail. Academic Assistance Access It is a free tutoring service designed to offer assistance to your statistics questions. Basic Definitions, by V. Easton and J. McColl, Contains glossary of basic terms and concepts. Basic principles of statistical analysis, by Bob Baker, Basics concepts of statistical models, Mixed model, Choosing between fixed and random effects, Estimating variances and covariance, Estimating fixed effects, Predicting random effects, Inference space, Conclusions, Some references. Briefbook of Data Analysis, has many contributors. The most comprehensive dictionary of statistics. Includes ANOVA, Analysis of Variance, Attenuation, Average, Bayes Theorem, Bayesian Statistics, Beta Distribution, Bias, Binomial Distribution, Bivariate Normal Distribution, Bootstrap, Cauchy Distribution, Central Limit Theorem, Bootstrap, Chi-square Distribution, Composite Hypothesis, Confidence Level, Correlation Coefficient, Covariance, Cramer-Rao Inequality, Cramer-Smirnov-Von Mises Test, Degrees of Freedom, Discriminant Analysis, Estimator, Exponential Distribution, F-Distribution, F-test, Factor Analysis, Fitting, Geometric Mean, Goodness-of-fit Test, Histogram, Importance Sampling, Jackknife, Kolmogorov Test, Kurtosis, Least Squares, Likelihood, Linear Regression, Maximum Likelihood Method, Mean, Median, Mode, Moment, Monte Carlo Methods, Multinomial Distribution, Multivariate Normal, Distribution Normal Distribution, Outlier, Poisson Distribution, Principal Component Analysis, Probability, Probability Calculus, Random Numbers, Random Variable, Regression Analysis, Residuals, Runs Test, Sample Mean, Sample Variance, Sampling from a Probability Density Function, Scatter Diagram, Significance of Test, Skewness, Standard Deviation, Stratified Sampling, Student's t Distribution, Student's test, Training Sample, Transformation of Random Variables, Trimming, Truly Random Numbers, Uniform Distribution, Validation Sample Variance, Weighted Mean, etc., References, and Index. Calculus Applied to Probability and Statistics for Liberal Arts and Business Majors, by Stefan Waner and Steven Costenoble, contains: Continuous Random Variables and Histograms; Probability Density Functions; Mean, Median, Variance and Standard Deviation. Computing Studio, by John Behrens, Each page is a data entry form that will allow you to type data in and will write a page that walks you through the steps of computing your statistic: Mean, Median, Quartiles, Variance of a population, Sample variance for estimating a population variance, Standard-deviation of a population, Sample standard-deviation used to estimate a population standard-deviation, Covariance for a sample, Pearson Product-Moment Correlation Coefficient (r), Slope of a regression line, Sums-of-squares for simple regression. CTI Statistics, by Stuart Young, CTI Statistics is a statistical resource center. Here you will find software reviews and articles, a searchable guide to software for teaching, a diary of forthcoming statistical events worldwide, a CBL software developers' forum, mailing list information, contact addresses, and links to a wealth of statistical resources worldwide. Data and Story Library, It is an online library of datafiles and stories that illustrate the use of basic statistics methods. DAU Stat Refresher, has many contributors. Tutorial, Tests, Probability, Random Variables, Expectations, Distributions, Data Analysis, Linear Regression, Multiple Regression, Moving Averages, Exponential Smoothing, Clustering Algorithms, etc. Descriptive Statistics Computation, Enter a column of your data so that the mean, standard deviation, etc. will be calculated. Elementary Statistics Interactive, by Wlodzimierz Bryc, Interactive exercises, including links to further reading materials, includes on-line tests. Elementary Statistics, by J. McDowell. Frequency distributions, Statistical moments, Standard scores and the standard normal distribution, Correlation and regression, Probability, Sampling Theory, Inference: One Sample, Two Samples. Evaluation of Intelligent Systems, by Paul Cohen (Editor-inChief), covers: Exploratory data analysis, Hypothesis testing, Modeling, and Statistical terminology. It also serves as community-building function. First Bayes, by Tony O'Hagan, First Bayes is a teaching package for elementary Bayesian Statistics. Fisher's Exact Test, by Øyvind Langsrud, To categorical variables with two levels. Gallery of Statistics Jokes, by Gary Ramseyer, Collection of Statistical Joks. Glossary of Statistical Terms, by D. Hoffman, Glossary of major keywords and phrases in suggested learning order is provided. Graphing Studio, Data entry forms to produce plots for twodimensional scatterplot, and three-dimensional scatterplot. HyperStat Online, by David Lane. It is an introductory-level statistics book. Interactive Statistics, Contains some nice Java applets: guessing correlations, scatterplots, Data Applet, etc. Interactive Statistics Page, by John Pezzullo, Web pages that perform mostly needed statistical calculations. A complete collection on: Calculators, Tables, Descriptives, Comparisons, Cross-Tabs, Regression, Other Tests, Power&Size, Specialized, Textbooks, Other Stats Pages. Internet Glossary of Statistical Terms, by By H. Hoffman, The contents are arranged in suggested learning order and alphabetical order, from Alpha to Z score. Internet Project, by Neil Weiss, Helps students understand statistics by analyzing real data and interacting with graphical demonstrations of statistical concepts. Introduction to Descriptive Statistics, by Jay Hill, Provides everyday's applications of Mode, Median, Mean, Central Tendency, Variation, Range, Variance, and Standard Deviation. Introduction to Quantitative Methods, by Gene Glass. A basic statistics course in the College of Education at Arizona State University. Introductory Statistics Demonstrations, Topics such as Variance and Standard Deviation, Z-Scores, Z-Scores and Probability, Sampling Distributions, Standard Error, Standard Error and Zscore Hypothesis Testing, Confidence Intervals, and Power. Introductory Statistics: Concepts, Models, and Applications, by David Stockburger. It represents over twenty years of experience in teaching the material contained therein by the author. The high price of textbooks and a desire to customize course material for his own needs caused him to write this material. It contains projects, interactive exercises, animated examples of the use of statistical packages, and inclusion of statistical packages. The Introductory Statistics Course: A New Approach, by D. Macnaughton. Students frequently view statistics as the worst course taken in college. To address that problem, this paper proposes five concepts for discussion at the beginning of an introductory course: (1) entities, (2) properties of entities, (3) a goal of science: to predict and control the values of properties of entities, (4) relationships between properties of entities as a key to prediction and control, and (5) statistical techniques for studying relationships between properties of entities as a means to prediction and control. It is argued that the proposed approach gives students a lasting appreciation of the vital role of the field of statistics in scientific research. Successful testing of the approach in three courses is summarized. Java Applets, by many contributors. Distributions (Histograms, Normal Approximation to Binomial, Normal Density, The T distribution, Area Under Normal Curves, Z Scores & the Normal Distribution. Probability & Stochastic Processes (Binomial Probabilities, Brownian Motion, Central Limit Theorem, A Gamma Process, Let's Make a Deal Game. Statistics (Guide to basic stats labs, ANOVA, Confidence Intervals, Regression, Spearman's rank correlation, T-test, Simple Least-Squares Regression, and Discriminant Analysis. The Knowledge Base, by Bill Trochim, The Knowledge Base is an online textbook for an introductory course in research methods. Lies, Damn Lies, and Psychology, by David Howell, This is the homepage for a course modeled after the Chance course. Math Titles: Full List of Math Lesson Titles, by University of Illinois, Lessons on Statistics and Probability topics among others. Nonparametric Statistical Methods, by Anthony Rossini, almost all widely used nonparametric tests are presented. On-Line Statistics, by Ronny Richardson, contains the contents of his lecture notes on: Descriptive Statistics, Probability, Random Variables, The Normal Distribution, Create Your Own Normal Table, Sampling and Sampling Distributions, Confidence Intervals, Hypothesis Testing, Linear Regression Correlation Using Excel. Online Statistical Textbooks, by Haiko Lüpsen. Power Analysis for ANOVA Designs, by Michael Friendly, It runs a SAS program that calculates power or sample size needed to attain a given power for one effect in a factorial ANOVA design. The program is based on specifying Effect Size in terms of the range of treatment means, and calculating the minimum power, or maximum required sample size. Practice Questions for Business Statistics, by Brian Schott, Over 800 statistics quiz questions for introduction to business statistics. Prentice Hall Statistics, This site contains full description of the materials covers in the following books coauthored by Prof. McClave: A First Course In Statistics, Statistics, Statistics For Business And Economics, A First Course In Business Statistics. Probability Lessons, Interactive probability lessons for problemsolving and actively. Probability Theory: The logic of Science, by E. Jaynes. Plausible Reasoning, The Cox Theorems, Elementary Sampling Theory, Elementary Hypothesis Testing, Queer Uses for Probability Theory, Elementary Parameter Estimation, The Central Gaussian, or Normal, Distribution, Sufficiency, Capillarity, and All That, Repetitive Experiments: Probability and Frequency, Physics of ``Random Experiments'', The Entropy Principle, Ignorance Priors -- Transformation Groups, Decision Theory: Historical Survey, Simple Applications of Decision Theory, Paradoxes of Probability Theory, Orthodox Statistics: Historical Background, Principles and Pathology of Orthodox Statistics, The A --Distribution and Rule of Succession. Physical Measurements, Regression and Linear Models, Estimation with Cauchy and t--Distributions, Time Series Analysis and Auto regressive Models, Spectrum / Shape Analysis, Model Comparison and Robustness, Image Reconstruction, Nationalization Theory, Communication Theory, Optimal Antenna and Filter Design, Statistical Mechanics, Conclusions Other Approaches to Probability Theory, Formalities and Mathematical Style, Convolutions and Cumulates, Circlet Integrals and Generating Functions, The Binomial -- Gaussian Hierarchy of Distributions, Courier Analysis, Infinite Series, Matrix Analysis and Computation, Computer Programs. Probability and Statistics, by Beth Chance. Covers the introductory materials supporting Moo re and McCabe, Introduction to the practice of statistics, ND edition, WHO Freeman, 1999. text book. Rice Virtual Lab in Statistics, by David Lane et al., An introductory statistics course which uses Java script Monte Carlo. Sampling distribution demo, by David Lane, Applet estimates and plots the sampling distribution of various statistics given population distribution, sample size, and statistic. Selecting Statistics, Cornell University. Answer the questions therein correctly, then Selecting Statistics leads you to an appropriate statistical test for your data. Simple Regression, Enter pairs of data so that a line can be fit to the data. Scatterplot, by John Behrens, Provides a two-dimensional scatterplot. Selecting Statistics, by Bill Trochim, An expert system for statistical procedures selection. Some experimental pages for teaching statistics, by Juha Puranen, contains some - different methods for visualizing statistical phenomena, such as Power and Box-Cox transformations. Statlets: Download Academic Version (Free), Contains Java Applets for Plots, Summarize, One and two-Sample Analysis, Analysis of Variance, Regression Analysis, Time Series Analysis, Rates and Proportions, and Quality Control. Statistical Analysis Tools, Part of Computation Tools of Hyperstat. Statistical Demos and Monte Carlo, Provides demos for Sampling Distribution Simulation, Normal Approximation to the Binomial Distribution, and A "Small" Effect Size Can Make a Large Difference. Statistical Education Resource Kit, by Laura Simon, This web page contains a collection of resources used by faculty in Penn State's Department of Statistics in teaching a broad range of statistics courses. Statistical Instruction Internet Palette, For teaching and learning statistics, with extensive computational capability. Statistical Terms, by The Animated Software Company, Definitions for terms via a standard alphabetical listing. Statiscope, by Mikael Bonnier, Interactive environment (Java applet) for summarizing data and descriptive statistical charts. Statistical Calculators, Presided at UCLA, Material here includes: Power Calculator, Statistical Tables, Regression and GLM Calculator, Two Sample Test Calculator, Correlation and Regression Calculator, and CDF/PDF Calculators. Statistical Home Page, by David C. Howell, This is a Home Page containing statistical material covered in the author's textbooks (Statistical Methods for Psychology and Fundamental Statistics for the Behavioral Sciences), but it will be useful to others not using those book. It is always under construction. Statistics Page, by Berrie, Movies to illustrate some statistical concepts. Statistical Procedures, by Phillip Ingram, Descriptions of various statistical procedures applicable to the Earth Sciences: Data Manipulation, One and Two Variable Measures, Time Series Analysis, Analysis of Variance, Measures of Similarity, Multivariate Procedures, Multiple regression, and Geostatistical Analysis. Statistical Tests, Contains Probability Distributions (Binomial, Gaussian, Student-t, Chi-Square), One-Sample and MatchedPairs tests, Two-Sample tests, Regression and correlation, and Test for categorical data. Statistical Tools, Pointers for demos on Binomial and Normal distributions, Normal approximation, Sample distribution, Sample mean, Confidence intervals, Correlation, Regression, Leverage points and Chisquare. Statistics, This server will perform some elementary statistical tests on your data. Test included are Sign Test, McNemar's Test, Wilcoxon Matched-Pairs Signed-Ranks Test, Student-t test for one sample, Two-Sample tests, Median Test, Binomial proportions, Wilcoxon Test, Student-t test for two samples, Multiple-Sample tests, Friedman Test, Correlations, Rank Correlation coefficient, Correlation coefficient, Comparing Correlation coefficients, Categorical data (Chi-square tests), Chi-square test for known distributions, Chi-square test for equality of distributions. Statistics Homepage, by StatSoft Co., Complete coverage of almost all topics Statistics: The Study of Stability in Variation, Editor: Jan de Leeuw. It has components which can be used on all levels of statistics teaching. It is disguised as an introductory textbook, perhaps, but many parts are completely unsuitable for introductory teaching. Its contents are Introduction, Analysis of a Single Variable, Analysis of a Pair of Variables, and Analysis of Multi-variables. Statistics Every Writer Should Know, by Robert Niles and Laurie Niles. Treatment of elementary concepts. Statistics Glossary, by V. Easton and J. McColl, Alphabetical index of all major keywords and phrases Statistics Network A Web-based resource for almost all statistical kinds of information. Statistics Online A good collection of links on: Statistics to Use, Confidence Intervals, Hypothesis Testing, Probability Distributions, One-Sample and Matched-Pairs Tests, Two-Sample Tests, Correlations, Categorical Data, and Statistical Tables. Statistics on the Web, by Clay Helberg, Just as the Web itself seems to have unlimited resources, Statistics on the web must have hundreds of sites listing such statistical areas as: Professional Organizations, Institutes and Consulting Groups, Educational Resources, Web courses, and others too numerous to mention. One could literally shop all day finding the joys and treasures of Statistics! Statistics To Use, by T. Kirkman, Among others it contains computations on: Mean, Standard Deviation, etc., Student's tTests, chi-square distribution test, contingency tables, Fisher Exact Test, ANOVA, Ordinary Least Squares, Ordinary Least Squares with Plot option, Beyond Ordinary Least Squares, and Fit to data with errors in both coordinates. Stat Refresher, This module is an interactive tutorial which gives a comprehensive view of Probability and Statistics. This interactive module covers basic probability, random variables, moments, distributions, data analysis including regression, moving averages, exponential smoothing, and clustering. Tables, by William Knight, Tables for: Confidence Intervals for the Median, Binomial Coefficients, Normal, T, Chi-Square, F, and other distributions. Two-Population T-test SURFSTAT Australia, by Keith Dear. Summarizing and Presenting Data, Producing Data, Variation and Probability, Statistical Inference, Control Charts. UCLA Statistics, by Jan de Leeuw, On-line introductory textbook with datasets, Lispstat archive, datasets, and live on-line calculators for most distributions and equations. VassarStats, by Richard Lowry, On-line elementary statistical computation. Web Interface for Statistics Education, by Dale Berger, Sampling Distribution of the Means, Central Limit Theorem, Introduction to Hypothesis Testing, t-test tutorial. Collection of links for Online Tutorials, Glossaries, Statistics Links, On-line Journals, Online Discussions, Statistics Applets. WebStat, by Webster West. Offers many interactive test procedures, graphics, such as Summary Statistics, Z tests (one and two sample) for population means, T tests (one and two sample) for population means, a chi-square test for population variance, a F test for comparing population variances, Regression, Histograms, Stem and Leaf plots, Box plots, Dot plots, Parallel Coordinate plots, Means plots, Scatter plots, QQ plots, and Time Series Plots. WWW Resources for Teaching Statistics, by Robin Lock. Interesting and Useful Sites Selected Reciprocal Web Sites | ABCentral | Bulletin Board Libraries |Business Problem Solving |Business Math |Casebook |Chance |CTI Statistics |Cursos de estadística |Demos for Learning Statistics |Electronic texts and statistical tables |Epidemiology and Biostatistics |Financial and Economic Links | Hyperstat |Intro. to Stat. |Java Applets |Lecturesonline |Lecture summaries | Maths & Stats Links| | Online Statistical Textbooks and Courses |Probability Tutorial | Research Methods & Statistics Resources | Statistical Demos and Calculations |Statistical Education Resource Kit| |Statistical Resources |Statistical Resources on the Web |Statistical tests |Statistical Training on the Web |Statistics Education-I |Statistics Education-II | | Statistics Network |Statistics on the Web |Statistics, Statistical Computing, and Mathematics |Statoo |Stats Links |st@tserv |StatSoft | StatsNet | | StudyWeb | SurfStat |Using Excel |Virtual Library |WebEc | Web Tutorial Links |Yahoo:Statistics| More reciprocal sites may be found by clicking on the following search engines: GoTo| HotBot| InfoSeek| LookSmart| Lycos| General References | The MBA Page | What is OPRE? | Desk Reference| Another Desk Reference | Spreadsheets | All Topics on the Web | Contacts to Statisticians | Statistics Departments (by country)| | ABCentral | Syllabits | World Lecture Hall | Others Selected Links | Virtual Library | Argus Clearinghouse | TILE.NET | CataList | Maths and Computing Lists Statistics References | Careers in Statistics | Conferences | | Statistical List Subscription | Statistics Mailing Lists | Edstat-L | Mailbase Lists | Stat-L | Stats-Discuss | Stat Discussion Group | StatsNet | List Servers| | Math Forum Search| | Statistics Journals | Books and Journal | Main Journals | Journal Web Sites| Statistical Societies & Organizations  American Statistical Association (ASA)  ASA D.C. Chapter  Applied Probability Trust  Bernoulli Society  Biomathematics and Statistics Scotland  Biometric Society  Center for Applied Probability at Columbia   Center for Applied Probability at Georgia Tech Center for Statistical and Mathematical Computing  Classification Society of North America  CTI Statistics  Dublin Applied Probability Group  Institute of Mathematical Statistics  International Association for Statistical Computing  International Biometric Society  International Environmetric Society  International Society for Bayesian Analysis  International Statistical Institute  National Institute of Statistical Sciences  RAND Statistics Group  Royal Statistical Society  Social Statistics  Statistical Engineering Division  Statistical Society of Australia  Statistical Society of Canada Statistics Resources | Statistics Main Resources | Statistics and OPRE Resources | Statistics Links | STATS | StatsNet | Resources | UK Statistical Resources| | Mathematics Internet Resources | Mathematical and Quantitative Methods |Stat Index | StatServ | Study WEB | Ordination Methods for Ecologists| WWW Resources | StatLib: Statistics Library | Guide for Statisticians| | Stat Links | Use and Abuse of Statistics | Statistics Links| | Statistical Links | Statistics Handouts | Statistics Related Links | Statistics Resources |OnLine Text Books| Probability Resources |Probability Tutorial |Probability | Probability & Statistics |Theory of Probability | Virtual Laboratories in Probability and Statistics |Let's Make a Deal Game |Central Limit Theorem | The Probability Web | Probability Abstracts | Coin Flipping |Java Applets on Probability | Uncertainty in AI |Normal Curve Area | Topics in Probability | PQRS Probability Plot | The Birthday Problem| Data and Data Analysis |Histograms | Statistical Data Analysis | Exploring Data | Data Mining |Books on Statistical Data Analysis| | Evaluation of Intelligent Systems | AI and Statistics| Statistical Software | Statistical Software Providers | SPLUS | WebStat | QDStat | Statistical Calculators on Web | MODSTAT | The AssiStat| | Statistical Software | Mathematical and Statistical Software | NCSS Statistical Software| | Choosing a Statistical Analysis Package | Statistical Software Review| Descriptive Statistics by Spreadsheet | Statistics with Microsoft Excel| Learning Statistics | How to Study Statistics | Statistics Education | Web and Statistical Education | Statistics & Decision Sciences | Statistics | Statistical Education through Problem Solving| | Exam, tests samples | INFORMS Education and Students Affairs | CHANCE Magazin | Chance Web Index| | Statistics Education Bibliography | Teacher Network | Computers in Teaching Statistics| Glossary Collections The following sites provide a wide range of keywords & phrases. Visit them frequently to learn the language of statisticians. |Data Analysis Briefbook | Glossary of Statistical Terms |Glossary of Terms |Glossary of Statistics |Internet Glossary of Statistical Terms |Lexicon|Selecting Statistics Glossary |Statistics Glossary | SurfStat glossary| Selected Topics |ANOVA |Confidence Intervals |Regression | Kolmogorov-Smirnov Test | Topics in Statistics-I | Topics in Statistics-I | Statistical Topics | Resampling | Pattern Recognition | Statistical Sites by Applications | Statistics and Computing| | Biostatistics | Biomathematics and Statistics | Introduction to Biostatistics Bartlett Corrections| | Statistical Planning | Regression Analysis | AI-Geostats | TotalQality | Analysis of Variance and Covariance| | Significance Testing | Hypothesis Testing | Two-Tailed Hypothesis Testing | Commentaries on Significance Testing | Bayesian | Philosophy of Testing| Questionnaire Design, Surveys Sampling and Analysis |Questionnaire Design and Statistical Data Analysis |Summary of Survey Analysis Software |Sample Size in Surveys Sampling |Survey Samplings| |Multilevel Statistical Models | Write more effective survey questions| | Sampling In Research | Sampling, Questionnaire Distribution and Interviewing | SRMSNET: An Electronic Bulletin Board for Survey| | Sampling and Surveying Handbook |Surveys Sampling Routines |Survey Software |Multilevel Models Project| Econometric and Forecasting | Time Series Analysis for Official Statisticians | Time Series and Forecasting | Business Forecasting | International Association of Business Forecasting |Institute of Business Forecasting |Principles of Forecasting| | Financial Statistics | Econometric-Research | Econometric Links | Economists | RFE: Resources for Economists | Business & Economics Scout Reports| | A Business Forecasting Course | A Forecasting Course | Time Series Data Library | Journal of Forecasting| | Economics and Teaching |Box-Jenkins Methodology | Statistical Tables The following Web sites provide critical values useful in statistical testing and construction of confidence intervals. The results are identical to those given in almost all textbook. However, in most cases they are more extensive (therefore more accurate). |Normal Curve Area |Normal Calculator |Normal Probability Calculation |Critical Values for the t-Distribution | Critical Values for the F-Distribution |Critical Values for the Chi- square Distribution| A selection of: Academic Info: Business, AOL: Science and Technology, Biz/ed: Business and Economics, BUBL Catalogue, Business & Economics: Scout Report, Business & Finance, Business & Industrial, Business Nation, Dogpile: Statistics, HotBot Directory: Statistics, IFORS, LookSmart: Statistics, LookSmart: Data & Statistics, MathForum: Business,McGraw-Hill: Business Statistics, NEEDS: The National Engineering Education Delivery System, Netscape: Statistics, NetFirst, SavvySearch Guide: Statistics, Small Business, Social Science Information Gateway, WebEc, and the Yahoo The Copy Right Statements: The fair use, according the 1996 Fair Use Guidelines for Educational Multimedia, of materials presented on this Web site is permitted for noncommercial and classroom purposes. This site may be mirrored, intact including these notices, on any server with the public access, it may be linked to any other Web pages. Kindly e-mail me your comments, suggestions, and concerns. Thank you. Professor Hossein Arsham EOF Estimation theory From Wikipedia, the free encyclopedia For other uses, see Estimation (disambiguation). "Parameter estimation" redirects here. It is not to be confused with Point estimation or Interval estimation. Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured/empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their value affects the distribution of the measured data. An estimator attempts to approximate the unknown parameters using the measurements. For example, it is desired to estimate the proportion of a population of voters who will vote for a particular candidate. That proportion is the parameter sought; the estimate is based on a small random sample of voters. Or, for example, in radar the goal is to estimate the range of objects (airplanes, boats, etc.) by analyzing the two-way transit timing of received echoes of transmitted pulses. Since the reflected pulses are unavoidably embedded in electrical noise, their measured values are randomly distributed, so that the transit time must be estimated. In estimation theory, two approaches are generally considered. [1]  The probabilistic approach (described in this article) assumes that the measured data is random with probability distribution dependent on the parameters of interest  The set-membership approach assumes that the measured data vector belongs to a set which depends on the parameter vector. For example, in electrical communication theory, the measurements which contain information regarding the parameters of interest are often associated with a noisy signal. Without randomness, or noise, the problem would be deterministic and estimation would not be needed. Contents [hide]         1Basics 2Estimators 3Examples o 3.1Unknown constant in additive white Gaussian noise  3.1.1Maximum likelihood  3.1.2Cramér–Rao lower bound o 3.2Maximum of a uniform distribution 4Applications 5See also 6Notes 7References 8References Basics[edit] To build a model, several statistical "ingredients" need to be known. These are needed to ensure the estimator has some mathematical tractability. The first is a set of statistical samples taken from a random vector (RV) of size N. Put into a vector, Secondly, there are the corresponding M parameters which need to be established with their continuous probability density function (pdf) or its discrete counterpart, the probability mass function (pmf) It is also possible for the parameters themselves to have a probability distribution (e.g., Bayesian statistics). It is then necessary to define the Bayesian probability After the model is formed, the goal is to estimate the parameters, commonly denoted , where the "hat" indicates the estimate. One common estimator is the minimum mean squared error estimator, which utilizes the error between the estimated parameters and the actual value of the parameters as the basis for optimality. This error term is then squared and minimized for the MMSE estimator. Estimators[edit] Main article: Estimator Commonly used estimators and estimation methods, and topics related to them:      Maximum likelihood estimators Bayes estimators Method of moments estimators Cramér–Rao bound Minimum mean squared error (MMSE), also known as Bayes least squared error (BLSE)          Maximum a posteriori (MAP) Minimum variance unbiased estimator (MVUE) nonlinear system identification Best linear unbiased estimator (BLUE) Unbiased estimators — see estimator bias. Particle filter Markov chain Monte Carlo (MCMC) Kalman filter, and its various derivatives Wiener filter Examples[edit] Unknown constant in additive white Gaussian noise[edit] Consider a received discrete signal, , of independent samples that consists of an unknown constant with additive white Gaussian noise (AWGN) with knownvariance (i.e., variance is known then the only unknown parameter is ). Since the . The model for the signal is then Two possible (of many) estimators for the parameter are:   which is the sample mean Both of these estimators have a mean of , which can be shown through taking the expected value of each estimator and At this point, these two estimators would appear to perform the same. However, the difference between them becomes apparent when comparing the variances. and It would seem that the sample mean is a better estimator since its variance is lower for every N > 1. Maximum likelihood[edit] Main article: Maximum likelihood Continuing the example using the maximum likelihood estimator, the probability density function (pdf) of the noise for one sample and the probability of thought of a becomes ( is can be ) By independence, the probability of becomes Taking the natural logarithm of the pdf and the maximum likelihood estimator is Taking the first derivative of the log-likelihood function and setting it to zero This results in the maximum likelihood estimator which is simply the sample mean. From this example, it was found that the sample mean is the maximum likelihood estimator for samples of a fixed, unknown parameter corrupted by AWGN. Cramér–Rao lower bound[edit] For more details on this topic, see Cramér–Rao bound. To find the Cramér– Rao lower bound (CRLB) of the sample mean estimator, it is first necessary to find the Fisher information number and copying from above Taking the second derivative and finding the negative expected value is trivial since it is now a determini stic constant Finally, putting the Fisher informatio n into result s in C o m p ar in g th is to th e v ar ia n c e of th e s a m pl e m e a n (d et er m in e d pr e vi o u sl y) s h o w s th at th e s a m pl e m e a n is e q u al to t h e C ra m ér – R a o lo w er b o u n d fo r al l v al u e s of a n d . In ot h er w or d s, th e s a m pl e m e a n is th e (n e c e s s ar il y u ni q u e) e ffi ci e nt e st i m at or , a n d th u s al s o th e m in i m u m v ar ia n c e u n bi a s e d e st i m at or ( M V U E ), in a d di ti o n to b ei n g th e m a xi m u m li k el ih o o d e st i m at or . M a x i m u m o f a u n if o r m d i s tr i b u ti o n [ e di t] M ai n ar ti cl e: G er m a n ta n k pr o bl e m O n e of th e si m pl e st n o ntri vi al e x a m pl e s of e st i m at io n is th e e st i m at io n of th e m a xi m u m of a u ni fo r m di st ri b ut io n. It is u s e d a s a h a n d so n cl a s sr o o m e x er ci s e a n d to ill u st ra te b a si c pr in ci pl e s of e st i m at io n th e or y. F ur th er , in th e c a s e of e st i m at io n b a s e d o n a si n gl e s a m pl e, it d e m o n st ra te s p hi lo s o p hi c al is s u e s a n d p o s si bl e m is u n d er st a n di n g s in th e u s e of m a xi m u m li k el ih o o d e st i m at or s a n d li k el ih o o d fu n ct io n s. G iv e n a di s cr et e u ni fo r m di st ri b ut io n w it h u n k n o w n m a xi m u m , th e U M V U e st i m at or fo r th e m a xi m u m is gi v e n b y w h e r e m i s t h e s a m p l e m a x i m u m a n d k i s t h e s a m p l e s i z e , s a m p l i n g w i t h o u t r e p l a c e m e n t . [ 2 ] [ 3 ] T h i s p r o b l e m i s c o m m o n l y k n o w n a s t h e G e r m a n t a n k p r o b l e m , d u e t o a p p l i c a t i o n o f m a x i m u m e s t i m a t i o n t o e s t i m a t e s o f G e r m a n t a n k p r o d u c t i o n d u r i n g W o r l d W a r I I . T h e f o r m u l a m a y b e u n d e r s t o o d i n t u i t i v e l y a s : "The sample maximum plus the average gap between observations in the sample", t h e g a p b e i n g a d d e d t o c o m p e n s a t e f o r t h e n e g a t i v e b i a s o f t h e s a m p l e m a x i m u m a s a n e s t i m a t o r f o r t h e p o p u l a t i o n m a x i m u m . [ n o t e 1 ] T h i s h a s a v a r i a n c e o f [ 2 ] s o a s t a n d a r d d e v i a t i o n o f a p p r o x i m a t e l y , t h e ( p o p u l a t i o n ) a v e r a g e s i z e o f a g a p b e t w e e n s a m p l e s ; c o m p a r e a b o v e . T h i s c a n b e s e e n a s a v e r y s i m p l e c a s e o f m a x i m u m s p a c i n g e s t i m a t i o n . T h e s a m p l e m a x i m u m i s t h e m a x i m u m l i k e l i h o o d e s t i m a t o r f o r t h e p o p u l a t i o n m a x i m u m , b u t , a s d i s c u s s e d a b o v e , i t i s b i a s e d . A p p l i c a t i o n s [ e d i t ] N u m e r o u s f i e l d s r e q u i r e t h e u s e o f e s t i m a t i o n t h e o r y . S o m e o f t h e s e f i e l d s i n c l u d e ( b u t a r e b y n o m e a n s l i m i t e d t o ) :  I n t e r p r e t a t i o n o f s c i e n t i f i c e x p e r i m e n t s  S i g n a l p r o c e s s i n g  C l i n i c a l t r i a l s  O p i n i o n p o l l s  Q u a l i t y c o n t r o l  T e l e c o m m u n i c a t i o n s  P r o j e c t m a n a g e m e n t  S o f t w a r e e n g i n e e r i n g  C o n t r o l t h e o r y ( i n p a r t i c u l a r A d a p t i v e  c o n t r o l ) N e t w o r k i n t r u s i o n d e t e c t i o n s y s t e m  O r b i t d e t e r m i n a t i o n M e a s u r e d d a t a a r e l i k e l y t o b e s u b j e c t t o n o i s e o r u n c e r t a i n t y a n d i t i s t h r o u g h s t a t i s t i c a l p r o b a b i l i t y t h a t o p t i m a l s o l u t i o n s a r e s o u g h t t o e x t r a c t a s m u c h i n f o r m a t i o n f r o m t h e d a t a a s p o s s i b l e . Estimation in Statistics In statistics, estimation refers to the process by which one makes inferences about a population, based on information obtained from a sample. Point Estimate vs. Interval Estimate Statisticians use sample statistics to estimate population parameters. For example, sample means are used to estimate population means; sample proportions, to estimate population proportions. An estimate of a population parameter may be expressed in two ways:  Point estimate. A point estimate of a population parameter is a single value of a statistic. For example, the sample mean x is a point estimate of the population mean μ. Similarly, the sample proportion p is a point estimate of the population proportion P.  Interval estimate. An interval estimate is defined by two numbers, between which a population parameter is said to lie. For example, a < x < b is an interval estimate of the population mean μ. It indicates that the population mean is greater than a but less than b. Confidence Intervals Statisticians use a confidence interval to express the precision and uncertainty associated with a particular sampling method. A confidence interval consists of three parts.  A confidence level.  A statistic.  A margin of error. The confidence level describes the uncertainty of a sampling method. The statistic and the margin of error define an interval estimate that describes the precision of the method. The interval estimate of a confidence interval is defined by the sample statistic + margin of error. For example, suppose we compute an interval estimate of a population parameter. We might describe this interval estimate as a 95% confidence interval. This means that if we used the same sampling method to select different samples and compute different interval estimates, the true population parameter would fall within a range defined by the sample statistic + margin of error 95% of the time. Confidence intervals are preferred to point estimates, because confidence intervals indicate (a) the precision of the estimate and (b) the uncertainty of the estimate. Confidence Level The probability part of a confidence interval is called a confidence level. The confidence level describes the likelihood that a particular sampling method will produce a confidence interval that includes the true population parameter. Here is how to interpret a confidence level. Suppose we collected all possible samples from a given population, and computed confidence intervals for each sample. Some confidence intervals would include the true population parameter; others would not. A 95% confidence level means that 95% of the intervals contain the true population parameter; a 90% confidence level means that 90% of the intervals contain the population parameter; and so on. Margin of Error In a confidence interval, the range of values above and below the sample statistic is called the margin of error. For example, suppose the local newspaper conducts an election survey and reports that the independent candidate will receive 30% of the vote. The newspaper states that the survey had a 5% margin of error and a confidence level of 95%. These findings result in the following confidence interval: We are 95% confident that the independent candidate will receive between 25% and 35% of the vote. Note: Many public opinion surveys report interval estimates, but not confidence intervals. They provide the margin of error, but not the confidence level. To clearly interpret survey results you need to know both! We are much more likely to accept survey findings if the confidence level is high (say, 95%) than if it is low (say, 50%). Test Your Understanding Problem 1 Which of the following statements is true. I. When the margin of error is small, the confidence level is high. II. When the margin of error is small, the confidence level is low. III. A confidence interval is a type of point estimate. IV. A population mean is an example of a point estimate. (A) I only (B) II only (C) III only (D) IV only. (E) None of the above. Solution The correct answer is (E). The confidence level is not affected by the margin of error. When the margin of error is small, the confidence level can low or high or anything in between. A confidence interval is a type of interval estimate, not a type of point estimate. A population mean is not an example of a point estimate; a sample mean is an example of a point estimate. Software Estimation Techniques - Common Test Estimation Techniques used in SDLC In order to successful software project & proper execution of task, the Estimation Techniques plays vital role in software development life cycle. The technique which is used to calculate the time required to accomplish a particular task is called Estimation Techniques. To estimate a task different effective Software Estimation Techniques can be used to get the better estimation. Before moving forward let’s ask some basic questions like What is use of this? or Why this is needed? or Who will do this? So in this article I am discussing all your queries regarding ESTIMATION. What is Estimation? “Estimation is the process of finding an estimate, or approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable.” [Wiki Definition] Software Estimation Techniques The Estimate is prediction or a rough idea to determine how much effort would take to complete a defined task. Here the effort could be time or cost. An estimate is a forecast or prediction and approximate of what it would Cost. A rough idea how long a task would take to complete. An estimate is especially an approximate computation of the probable cost of a piece of work. The calculation of test estimation techniques is based on:   Past Data/Past experience Available documents/Knowledge  Assumptions  Calculated risks Before starting one common question arises in the testers mind is that “Why do we estimate?” The answer to this question is pretty simple, it is to avoid the exceeding timescales and overshooting budgets for testing activities we estimate the task. Few points need to be considered before estimating testing activities:  Check if all requirements are finalize or not. If it not then how frequently they are going to be changed.  All responsibilities and dependencies are clear.  Check if required infrastructure is ready for testing or not. Check if before estimating task is all assumptions and risks are documented.   Software Estimation Techniques There are different Software Testing Estimation Techniques which can be used for estimating a task. 1) Delphi Technique 2) Work Breakdown Structure (WBS) 3) Three Point Estimation 4) Functional Point Method 1) Delphi Technique: Delphi technique – This is one of the widely used software testing estimation technique. In the Delphi Method is based on surveys and basically collects the information from participants who are experts. In this estimation technique each task is assigned to each team member & over multiple rounds surveys are conduct unless & until a final estimation of task is not finalized. In each round the thought about task are gathered & feedback is provided. By using this method, you can get quantitative and qualitative results. In overall techniques this technique gives good confidence in the estimation. This technique can be used with the combination of the other techniques. 2) Work Breakdown Structure (WBS): A big project is made manageable by first breaking it down into individual components in a hierarchical structure, known as the Work breakdown structure, or the WBS. The WBS helps to project manager and the team to create the task scheduling, detailed cost estimation of the project. By using the WBS motions, the project manager and team will have a pretty good idea whether or not they’ve captured all the necessary tasks, based on the project requirements, which are going to need to happen to get the job done. In this technique the complex project is divided into smaller pieces. The modules are divided into smaller sub-modules. Each sub-modules are further divided into functionality. And each functionality can be divided into sub-functionalities. After breakdown the work all functionality should review to check whether each & every functionality is covered in the WBS. Using this you can easily figure out the what all task needs to completed & they are breakdown into details task so estimation to details task would be more easier than estimating overall Complex project at one shot. Work Breakdown Structure has four key benefits:  Work Breakdown Structure forces the team to create detailed steps: In The WBS all steps required to build or deliver the service are divided into detailed task by Project manager, Team and customer. It helps to raise the critical issues early on, narrow down the scope of the project and create a dialogue which will help make clear bring out assumptions, ambiguities, narrow the scope of the project, and raise critical issues early on.  Work Breakdown Structure help to improve the schedule and budget. WBS enables you to make an effective schedule and good budget plans. As all tasks are already available so it helps in generating a meaningful schedule and makes scheming a reliable budget easier.  Work Breakdown Structure creates accountability The level of details task breakdown helps to assign particular module task to individual, which makes easier to hold person accountable to complete the task. Also the detailed task in WBS, people cannot allow hiding under the “cover of broadness.”  Work Breakdown Structure creation breeds commitment The process of developing and completing a WBS breed excitement and commitment. Although the project manager will often develop the high-level WBS, he will seek the participation of his core team to flesh out the extreme detail of the WBS. This participation will spark involvement in the project. 3) Three Point Estimation: Three point estimation is the estimation method is based on statistical data. It is very much similar to WBS technique, task are broken down into subtasks & three types of estimation are done on this sub pieces. Optimistic Estimate (Best case scenario in which nothing goes wrong and all conditions are optimal.) = A Most Likely Estimate (most likely duration and there may be some problem but most of the things will go right.) = M Pessimistic Estimate (worst case scenario which everything goes wrong.) = B Formula to find Value for Estimate (E) = A + (4*M) + B / 6 Standard Deviation (SD) = = (B – A)/6 Now a days, planning poker and Delphi estimates are most popular testing test estimation techniques. 4) Functional Point Method: Functional Point is measured from a functional, or user, point of view. It is independent of computer language, capability, technology or development methodology of the team. It is based on available documents like SRS, Design etc. In this FP technique we have to give weightage to each functional point. Prior to start actual estimating tasks functional points are divided into three groups like Complex, Medium & Simple. Based on similar projects & Organization standards we have to define estimate per function points. Total Effort Estimate = Total Function Points * Estimate defined per Functional Point Let’s take a simple example to get clearer: Weightage Function Points Total 5 5 25 Complex Medium 3 20 60 1 35 35 Simple Function Total Points 120 Estimate defined per point 4.15 Total Estimated Effort (Person Hours): 498 Advantages of the Functional Point Method:   In pre-project stage the estimates can be prepared. Based on requirement specification documents the method’s reliability is relatively high. Disadvantages of Software Estimation Techniques:  Due to hidden factors can be over or under estimated  Not really accurate  It is basd on thinking  Involved Risk  May give false result  Bare to losing  Sometimes cannot trust in estimate Software Estimation Techniques Conclusion: There may be different other methods also which can be effectively used for the project test estimation techniques, in this article we have seen most popular Software Estimation Techniques used in project estimation. There can’t be a sole hard and fast rule for estimating the testing effort for a project. It is recommended to add on to the possible knowledge base of test estimation methods and estimation templates constantly revised based upon new findings. - See more at: http://www.softwaretestingclass.com/software-estimationtechniques/#sthash.sIN4Wn9Q.dpuf 16.1. What is the difference between a statistic and a parameter? A statistic is a numerical characteristic of a sample, and a parameter is a numerical characteristic of a population. 16.2. What is the symbol for the population mean? The symbol is the Greek letter mu (i.e., µ). 16.3. What is the symbol for the population correlation coefficient? The symbol is the Greek letter rho (i.e., ρ ). 16.4. What is the definition of a sampling distribution? The sampling distribution is the theoretical probability distribution of the values of a statistic that results when all possible random samples of a particular size are drawn from a population. 16.5. How does the idea of repeated sampling relate to the concept of a sampling distribution? Repeated sampling involves drawing many or all possible samples from a population. 16.6. Which of the two types of estimation do you like the most, and why? This is an opinion question.  Point estimation is nice because it provides an exact point estimate of the population value. It provides you with the single best guess of the value of the population parameter.  Interval estimation is nice because it allows you to make statements of confidence that an interval will include the true population value. 16.7. What are the advantages of using interval estimation rather than point estimation? The problem with using a point estimate is that although it is the single best guess you can make about the value of a population parameter, it is also usually wrong.  Take a look at the sampling distribution of the mean on page 468 and note that in that case if you would have guessed $50,000 as the correct value (and this WAS the correct value in this case) you would be wrong most of the time.  A major advantage of using interval estimation is that you provide a range of values with a known probability of capturing the population parameter (e.g., if you obtain from SPSS a 95% confidence interval you can claim to have 95% confidence that it will include the true population parameter.  An interval estimate (i.e., confidence intervals) also help one to not be so confident that the population value is exactly equal to the point estimate. That is, it makes us more careful in how we interpret our data and helps keep us in proper perspective.  Actually, perhaps the best thing of all to do is to provide both the point estimate and the interval estimate. For example, our best estimate of the population mean is the value $32,640 (the point estimate) and our 95% confidence interval is $30,913.71 to $34,366.29.  By the way, note that the bigger your sample size, the more narrow the confidence interval will be.  If you want narrow (i.e., very precise) confidence intervals, then remember to include a lot of participants in your research study. 16.8 What is a null hypothesis? A null hypothesis is a statement about a population parameter. It usually predicts no difference or no relationship in the population. The null hypothesis is the “status quo,” the “nothing new,” or the “business as usual” hypothesis. It is the hypothesis that is directly tested in hypothesis testing. 16.9. To whom is the researcher similar to in hypothesis testing: the defense attorney or the prosecuting attorney? Why? The researcher is similar to the prosecuting attorney is the sense that the researcher brings the null hypothesis “to trial” when she believes there is probability strong evidence against the null.  Just as the prosecutor usually believes that the person on trial is not innocent, the researcher usually believes that the null hypothesis is not true.  In the court system the jury must assume (by law) that the person is innocent until the evidence clearly calls this assumption into question; analogously, in hypothesis testing the researcher must assume (in order to use hypothesis testing) that the null hypothesis is true until the evidence calls this assumption into question. 16.10. What is the difference between a probability value and the significance level? Basically in hypothesis testing the goal is to see if the probability value is less than or equal to the significance level (i.e., is p ≤ alpha).  The probability value (also called the p-value) is the probability of the result found in your research study of occurring (or an even more extreme result occurring), under the assumption that the null hypothesis is true.  That is, you assume that the null hypothesis is true and then see how often your finding would occur if this assumption were true.  The significance level (also called the alpha level) is the cutoff value the researcher selects and then uses to decide when to reject the null hypothesis.  Most researchers select the significance or alpha level of .05 to use in their research; hence, they reject the null hypothesis when the p-value (which is obtained from the computer printout) is less than or equal to .05. 16.11. Why do educational researchers usually use .05 as their significance level? It has become part of the statistical hypothesis testing culture.  It is a convention.  It reflects a concern over making type I errors (i.e., wanting to avoid the situation where you reject the null when it is true, that is, wanting to avoid “false positive” errors).  If you set the significance level at .05, then you will only reject a true null hypothesis 5% or the time (i.e., you will only make a type I error 5% of the time) in the long run. 16.12. State the two decision making rules of hypothesis testing.  Rule one: If the p-value is less than or equal to the significance level then reject the null hypothesis and conclude that the research finding is statistically significant.  Rule two: If the p-value is greater than the significance level then you “fail to reject” the null hypothesis and conclude that the finding is not statistically significant. 16.13. Do the following statements sound like typical null or alternative hypotheses? (A) The coin is fair. (B) There is no difference between male and female incomes in the population. (C) There is no correlation in the population. (D) The patient is not sick (i.e., is well). (E) The defendant is innocent. All of these sound like null alternative hypotheses (i.e., the “nothing new” or “status quo” hypothesis). We usually assume that a coin is fair in games of chance; when testing the difference between male and female incomes in hypothesis testing we assume the null of no difference; when testing the statistical significance of a correlation coefficient using hypothesis testing, we assume that the correlation in the population is zero; in medical testing we assume the person does not have the illness until the medical tests suggest otherwise; and in our system of jurisprudence we assume that a defendant is innocent until the evidence strongly suggests otherwise. 16.14. What is a Type I error? What is a Type II error? How can you minimize the risk of both of these types of errors? In hypothesis testing there are two possible errors we can make: Type I and Type II errors.  A Type I error occurs when your reject a true null hypothesis (remember that when the null hypothesis is true you hope to retain it).  A Type II error occurs when you fail to reject a false null hypothesis (remember that when the null hypothesis is false you hope to reject it).  The best way to allow yourself to set a low alpha level (i.e., to have a small chance of making a Type I error) and to have a good chance of rejecting the null when it is false (i.e., to have a small chance of making a Type II error) is to increase the sample size.  The key in hypothesis testing is to use a large sample in your research study rather than a small sample!  If you do reject your null hypothesis, then it is also essential that you determine whether the size of the relationship is practically significant (see the next question). 16.15. If a finding is statistically significant, why is it also important to consider practical significance? When your finding is statistically significant all you know is that your result would be unlikely if the null hypothesis were true and that you therefore have decided to reject your null hypothesis and to go with your alternative hypothesis. Unfortunately, this does not tell you anything about how big of an effect is present or how important the effect would be for practical purposes. That’s why once you determine that a finding is statistically significant you must next use one of the effect size indicators to tell you how strong the relationship. Think about this effect size and the nature of your variables (e.g., is the IV easily manipulated in the real world? Will the amount of change relative to the costs in bringing this about be reasonable?).  Once you consider these additional issues beyond statistical significance, you will be ready to make a decision about the practical significance of your study results. 16.16. How do you write the null and alternative hypotheses for each of the following: (A) The t-test for independent samples, (B) One-way analysis of variance, (C) The t-test for correlation coefficients?, (D) The t-test for a regression coefficient. In each of these, the null hypothesis says there is no relationship and the alternative hypothesis says that there is a relationship. (A) In this case the null hypothesis says that the two population means (i.e., mu one and mu two) are equal; the alternative hypothesis says that they are not equal. (B) In this case the null hypothesis says that all of the population means are equal; the alternative hypothesis says that at least two of the means are not equal. (C) In this case the null hypothesis says that the population correlation (i.e., rho) is zero; the alternative hypothesis says that it is not equal to zero. (D) In this case the null hypothesis says that the population regression coefficient (beta) is zero, and the alternative says that it is not equal to zero. You can examples of these null and alternative hypotheses written out in symbolic form for cases A, B, C, and D in the following Table. Hypothesis Testing for Means & Proportions Introduction This is the first of three modules that will addresses the second area of statistical inference, which is hypothesis testing, in which a specific statement or hypothesis is generated about a population parameter, and sample statistics are used to assess the likelihood that the hypothesis is true. The hypothesis is based on available information and the investigator's belief about the population parameters. The process of hypothesis testing involves setting up two competing hypotheses, the null hypothesis and the alternate hypothesis. One selects a random sample (or multiple samples when there are more comparison groups), computes summary statistics and then assesses the likelihood that the sample data support the research or alternative hypothesis. Similar to estimation, the process of hypothesis testing is based on probability theory and the Central Limit Theorem. This module will focus on hypothesis testing for means and proportions. The next two modules in this series will address analysis of variance and chi-squared tests. Learning Objectives After completing this module, the student will be able to: 1. Define null and research hypothesis, test statistic, level of significance and decision rule 2. Distinguish between Type I and Type II errors and discuss the implications of each 3. Explain the difference between one and two sided tests of hypothesis 4. Estimate and interpret p-values 5. Explain the relationship between confidence interval estimates and p-values in drawing inferences 6. Differentiate hypothesis testing procedures based on type of outcome variable and number of sample Introduction to Hypothesis Testing Techniques for Hypothesis Testing The techniques for hypothesis testing depend on    the type of outcome variable being analyzed (continuous, dichotomous, discrete) the number of comparison groups in the investigation whether the comparison groups are independent (i.e., physically separate such as men versus women) or dependent (i.e., matched or paired such as pre- and postassessments on the same participants). In estimation we focused explicitly on techniques for one and two samples and discussed estimation for a specific parameter (e.g., the mean or proportion of a population), for differences (e.g., difference in means, the risk difference) and ratios (e.g., the relative risk and odds ratio). Here we will focus on procedures for one and two samples when the outcome is either continuous (and we focus on means) or dichotomous (and we focus on proportions). General Approach: A Simple Example The Centers for Disease Control (CDC) reported on trends in weight, height and body mass index from the 1960's through 2002.1 The general trend was that Americans were much heavier and slightly taller in 2002 as compared to 1960; both men and women gained approximately 24 pounds, on average, between 1960 and 2002. In 2002, the mean weight for men was reported at 191 pounds. Suppose that an investigator hypothesizes that weights are even higher in 2006 (i.e., that the trend continued over the subsequent 4 years). The research hypothesis is that the mean weight in men in 2006 is more than 191 pounds. The null hypothesis is that there is no change in weight, and therefore the mean weight is still 191 pounds in 2006. Null Hypothesis H0: μ= 191 (no change) Research Hypothesis H1: μ> 191 (investigator's belief) In order to test the hypotheses, we select a random sample of American males in 2006 and measure their weights. Suppose we have resources available to recruit n=100 men into our sample. We weigh each participant and compute summary statistics on the sample data. Suppose in the sample we determine the following: Do the sample data support the null or research hypothesis? The sample mean of 197.1 is numerically higher than 191. However, is this difference more than would be expected by chance? In hypothesis testing, we assume that the null hypothesis holds until proven otherwise. We therefore need to determine the likelihood of observing a sample mean of 197.1 or higher when the true population mean is 191 (i.e., if the null hypothesis is true or under the null hypothesis). We can compute this probability using the Central Limit Theorem. Specifically, (Notice that we use the sample standard deviation in computing the Z score. This is generally an appropriate substitution as long as the sample size is large, n > 30. Thus, there is less than a 1% probability of observing a sample mean as large as 197.1 when the true population mean is 191. Do you think that the null hypothesis is likely true? Based on how unlikely it is to observe a sample mean of 197.1 under the null hypothesis (i.e., <1% probability), we might infer, from our data, that the null hypothesis is probably not true. Suppose that the sample data had turned out differently. Suppose that we instead observed the following in 2006: How likely it is to observe a sample mean of 192.1 or higher when the true population mean is 191 (i.e., if the null hypothesis is true)? We can again compute this probability using the Central Limit Theorem. Specifically, There is a 33.4% probability of observing a sample mean as large as 192.1 when the true population mean is 191. Do you think that the null hypothesis is likely true? Neither of the sample means that we obtained allows us to know with certainty whether the null hypothesis is true or not. However, our computations suggest that, if the null hypothesis were true, the probability of observing a sample mean >197.1 is less than 1%. In contrast, if the null hypothesis were true, the probability of observing a sample mean >192.1 is about 33%. We can't know whether the null hypothesis is true, but the sample that provided a mean value of 197.1 provides much stronger evidence in favor of rejecting the null hypothesis, than the sample that provided a mean value of 192.1. Note that this does not mean that a sample mean of 192.1 indicates that the null hypothesis is true; it just doesn't provide compelling evidence to reject it. In essence, hypothesis testing is a procedure to compute a probability that reflects the strength of the evidence (based on a given sample) for rejecting the null hypothesis. In hypothesis testing, we determine a threshold or cut-off point (called the critical value) to decide when to believe the null hypothesis and when to believe the research hypothesis. It is important to note that it is possible to observe any sample mean when the true population mean is true (in this example equal to 191), but some sample means are very unlikely. Based on the two samples above it would seem reasonable to believe the research hypothesis when = 197.1, but to believe the null hypothesis when =192.1. What we need is a threshold value such that if is above that threshold then we believe that H1 is true and if is below that threshold then we believe that H0 is true. The difficulty in determining a threshold for is that it depends on the scale of measurement. In this example, the threshold, sometimes called the critical value, might be 195 (i.e., if the sample mean is 195 or more then we believe that H1 is true and if the sample mean is less than 195 then we believe that H0 is true). Suppose we are interested in assessing an increase in blood pressure over time, the critical value will be different because blood pressures are measured in millimeters of mercury (mmHg) as opposed to in pounds. In the following we will explain how the critical value is determined and how we handle the issue of scale. First, to address the issue of scale in determining the critical value, we convert our sample data (in particular the sample mean) into a Z score. We know from the module on probability that the center of the Z distribution is zero and extreme values are those that exceed 2 or fall below -2. Z scores above 2 and below -2 represent approximately 5% of all Z values. If the observed sample mean is close to the mean specified in H0 (here m =191), then Z will be close to zero. If the observed sample mean is much larger than the mean specified in H0, then Z will be large. In hypothesis testing, we select a critical value from the Z distribution. This is done by first determining what is called the level of significance, denoted α ("alpha"). What we are doing here is drawing a line at extreme values. The level of significance is the probability that we reject the null hypothesis (in favor of the alternative) when it is actually true and is also called the Type I error rate. α = Level of significance = P(Type I error) = P(Reject H0 | H0 is true). Because α is a probability, it ranges between 0 and 1. The most commonly used value in the medical literature for α is 0.05, or 5%. Thus, if an investigator selects α=0.05, then they are allowing a 5% probability of incorrectly rejecting the null hypothesis in favor of the alternative when the null is in fact true. Depending on the circumstances, one might choose to use a level of significance of 1% or 10%. For example, if an investigator wanted to reject the null only if there were even stronger evidence than that ensured with α=0.05, they could choose a =0.01as their level of significance. The typical values for α are 0.01, 0.05 and 0.10, with α=0.05 the most commonly used value. Suppose in our weight study we select α=0.05. We need to determine the value of Z that holds 5% of the values above it (see below). The critical value of Z for α =0.05 is Z = 1.645 (i.e., 5% of the distribution is above Z=1.645). With this value we can set up what is called our decision rule for the test. The rule is to reject H0 if the Z score is 1.645 or more. With the first sample we have Because 2.38 > 1.645, we reject the null hypothesis. (The same conclusion can be drawn by comparing the 0.0087 probability of observing a sample mean as extreme as 197.1 to the level of significance of 0.05. If the observed probability is smaller than the level of significance we reject H0). Because the Z score exceeds the critical value, we conclude that the mean weight for men in 2006 is more than 191 pounds, the value reported in 2002. If we observed the second sample (i.e., sample mean =192.1), we would not be able to reject the null hypothesis because the Z score is 0.43 which is not in the rejection region (i.e., the region in the tail end of the curve above 1.645). With the second sample we do not have sufficient evidence (because we set our level of significance at 5%) to conclude that weights have increased. Again, the same conclusion can be reached by comparing probabilities. The probability of observing a sample mean as extreme as 192.1 is 33.4% which is not below our 5% level of significance. Hypothesis Testing: Upper-, Lower, and Two Tailed Tests The procedure for hypothesis testing is based on the ideas described above. Specifically, we set up competing hypotheses, select a random sample from the population of interest and compute summary statistics. We then determine whether the sample data supports the null or alternative hypotheses. The procedure can be broken down into the following five steps.  Step 1. Set up hypotheses and select the level of significance α. H0: Null hypothesis (no change, no difference); H1: Research hypothesis (investigator's belief); α =0.05 Upper-tailed, Lower-tailed, Two-tailed Tests The research or alternative hypothesis can take one of three forms. An investigator might believe that the pa decreased or changed. For example, an investigator might hypothesize: 1. H1: μ > μ 0 , where μ0 is the comparator or null value (e.g., μ0 =191 in our example about weight in me increase is hypothesized - this type of test is called an upper-tailed test; 2. H1: μ < μ0 , where a decrease is hypothesized and this is called a lower-tailed test; or 3. H1: μ ≠ μ 0, where a difference is hypothesized and this is called a two-tailed test. The exact form of the research hypothesis depends on the investigator's belief about the parameter of intere possibly increased, decreased or is different from the null value. The research hypothesis is set up by the inv data are collected.  Step 2. Select the appropriate test statistic. The test statistic is a single number that summarizes the sample information. An example of a test statistic is the Z statistic computed as follows: When the sample size is small, we will use t statistics (just as we did when constructing confidence intervals for small samples). As we present each scenario, alternative test statistics are provided along with conditions for their appropriate use.  Step 3. Set up decision rule. The decision rule is a statement that tells under what circumstances to reject the null hypothesis. The decision rule is based on specific values of the test statistic (e.g., reject H0 if Z > 1.645). The decision rule for a specific test depends on 3 factors: the research or alternative hypothesis, the test statistic and the level of significance. Each is discussed below. 1. The decision rule depends on whether an upper-tailed, lower-tailed, or two-tailed test is proposed. In an upper-tailed test the decision rule has investigators reject H0 if the test statistic is larger than the critical value. In a lower-tailed test the decision rule has investigators reject H0 if the test statistic is smaller than the critical value. In a twotailed test the decision rule has investigators reject H0 if the test statistic is extreme, either larger than an upper critical value or smaller than a lower critical value. 2. The exact form of the test statistic is also important in determining the decision rule. If the test statistic follows the standard normal distribution (Z), then the decision rule will be based on the standard normal distribution. If the test statistic follows the t distribution, then the decision rule will be based on the t distribution. The appropriate critical value will be selected from the t distribution again depending on the specific alternative hypothesis and the level of significance. 3. The third factor is the level of significance. The level of significance which is selected in Step 1 (e.g., α =0.05) dictates the critical value. For example, in an upper tailed Z test, if α =0.05 then the critical value is Z=1.645. The following figures illustrate the rejection regions defined by the decision rule for upper-, lower- and two-tailed Z tests with α=0.05. Notice that the rejection regions are in the upper, lower and both tails of the curves, respectively. The decision rules are written below each figure. Lower-Tailed Test a Z 0.10 -1.282 0.05 -1.645 0.025 -1.960 0.010 -2.326 0.005 -2.576 0.001 -3.090 0.0001 -3.719 Rejection Region for Upper-Tailed Z Test (H1: μ > μ0 ) with α=0.05 The decision rule is: Reject H0 if Z > 1.645. Upper-Tailed Test α Z 0.10 1.282 0.05 1.645 0.025 1.960 0.010 2.326 0.005 2.576 0.001 3.090 0.0001 3.719 Rejection Region for Lower-Tailed Z Test (H1: μ < μ0 ) with α =0.05 The decision rule is: Reject H0 if Z < 1.645. Two-Tailed Test α Z 0.20 1.282 0.10 1.645 0.05 1.960 0.010 2.576 0.001 3.291 0.0001 3.819 Rejection Region for Two-Tailed Z Test (H1: μ ≠ μ 0 ) with α =0.05 The decision rule is: Reject H0 if Z < -1.960 or if Z > 1.960. The complete table of critical values of Z for upper, lower and two-tailed tests can be found in the table of Z values to the right in "Other Resources." Critical values of t for upper, lower and two-tailed tests can be found in the table of t values in "Other Resources."  Step 4. Compute the test statistic. Here we compute the test statistic by substituting the observed sample data into the test statistic identified in Step 2.  Step 5. Conclusion. The final conclusion is made by comparing the test statistic (which is a summary of the information observed in the sample) to the decision rule. The final conclusion will be either to reject the null hypothesis (because the sample data are very unlikely if the null hypothesis is true) or not to reject the null hypothesis (because the sample data are not very unlikely). If the null hypothesis is rejected, then an exact significance level is computed to describe the likelihood of observing the sample data assuming that the null hypothesis is true. The exact level of significance is called the p-value and it will be less than the chosen level of significance if we reject H0. Statistical computing packages provide exact p-values as part of their standard output for hypothesis tests. In fact, when using a statistical computing package, the steps outlined about can be abbreviated. The hypotheses (step 1) should always be set up in advance of any analysis and the significance criterion should also be determined (e.g., α =0.05). Statistical computing packages will produce the test statistic (usually reporting the test statistic as t) and a p-value. The investigator can then determine statistical significance using the following: If p < α then reject H0. Things to Remember When Interpreting P V 1. P-values summarize statistical significance and do not address clinical significance. There are instan and others where they are one or the other but not both. This is because P-values depend upon both (the sample size). When the sample size is large, results can reach statistical significance (i.e., small Conversely, with small sample sizes, results can fail to reach statistical significance yet the effect is la assess both statistical and clinical significance of results. 2. Statistical tests allow us to draw conclusions of significance or not based on a comparison of the p-va conclusion is based on the selected level of significance ( α ) and could change with a different level be examined for clinical importance. 3. When conducting any statistical analysis, there is always a possibility of an incorrect conclusion. Wi increased. Investigators should only conduct the statistical analyses (e.g., tests) of interest and not 4. Many investigators inappropriately believe that the p-value represents the probability that the null hyp that the null hypothesis is true. The p-value is the probability that the data could deviate from the null value measures the compatibility of the data with the null hypothesis, not the probability that the null h 5. Statistical significance does not take into account the possibility of bias or confounding - these issues 6. Evidence-based decision making is important in public health and in medicine, but decisions are rare always important to build a body of evidence to support findings. We now use the five-step procedure to test the research hypothesis that the mean weight in men in 2006 is more than 191 pounds. We will assume the sample data are as follows: n=100, =197.1 and s=25.6.  Step 1. Set up hypotheses and determine level of significance H0: μ = 191 H1: μ > 191 α =0.05 The research hypothesis is that weights have increased, and therefore an upper tailed test is used.  Step 2. Select the appropriate test statistic. Because the sample size is large (n>30) the appropriate test statistic is .  Step 3. Set up decision rule. In this example, we are performing an upper tailed test (H1: μ> 191), with a Z test statistic and selected α =0.05. Reject H0 if Z > 1.645.  Step 4. Compute the test statistic. We now substitute the sample data into the formula for the test statistic identified in Step 2. . Step 5. Conclusion. We reject H0 because 2.38 > 1.645. We have statistically significant evidence at a =0.05, to show that the mean weight in men in 2006 is more than 191 pounds. Because we rejected the null hypothesis, we now approximate the p-value which is the likelihood of observing the sample data if the null hypothesis is true. An alternative definition of the p-value is the smallest level of significance where we can still reject H0. In this example, we observed Z=2.38 and for α=0.05, the critical value was 1.645. Because 2.38 exceeded 1.645 we rejected H0. In our conclusion we reported a statistically significant increase in mean weight at a 5% level of significance. Using the table of critical values for upper tailed tests, we can approximate the p-value. If we select α=0.025, the critical value is 1.96, and we still reject H0 because 2.38 > 1.960. If we select α=0.010 the critical value is 2.326, and we still reject H0 because 2.38 > 2.326. However, if we select α=0.005, the critical value is 2.576, and we cannot reject H0 because 2.38 < 2.576. Therefore, the smallest α where we still reject H0 is 0.010. This is the p-value. A statistical computing package would produce a more precise p- value which would be in between 0.005 and 0.010. Here we are approximating the p-value and would report p < 0.010. Type I and Type II Errors In all tests of hypothesis, there are two types of errors that can be committed. The first is called a Type I error and refers to the situation where we incorrectly reject H0 when in fact it is true. This is also called a false positive result (as we incorrectly conclude that the research hypothesis is true when in fact it is not). When we run a test of hypothesis and decide to reject H0 (e.g., because the test statistic exceeds the critical value in an upper tailed test) then either we make a correct decision because the research hypothesis is true or we commit a Type I error. The different conclusions are summarized in the table below. Note that we will never know whether the null hypothesis is really true or false (i.e., we will never know which row of the following table reflects reality). Conclusion in Test of Hypothes Do Not Reject H0 R H0 is True Correct Decision Ty H0 is False Type II Error In the first step of the hypothesis test, we select a level of significance, α, and α= P(Type I error). Because we purposely select a small value for α, we control the probability of committing a Type I error. For example, if we select α=0.05, and our test tells us to reject H0, then there is a 5% probability that we commit a Type I error. Most investigators are very comfortable with this and are confident when rejecting H0 that the research hypothesis is true (as it is the more likely scenario when we reject H0). When we run a test of hypothesis and decide not to reject H0 (e.g., because the test statistic is below the critical value in an upper tailed test) then either we make a correct decision because the null hypothesis is true or we commit a Type II error. Beta (β) represents the probability of a Type II error and is defined as follows: β=P(Type II error) = P(Do not Reject H0 | H0 is false). Unfortunately, we cannot choose β to be small (e.g., 0.05) to control the probability of committing a Type II error because β depends on several factors including the sample size, α, and the research hypothesis. When we do not reject H0, it may be very likely that we are committing a Type II error (i.e., failing to reject H0 when in fact it is false). Therefore, when tests are run and the null hypothesis is not rejected we often make a weak concluding statement allowing for the possibility that we might be committing a Type II error. If we do not reject H0, we conclude that we do not have significant evidence to show that H1 is true. We do not conclude that H0 is true. The most common reason for a Type II error is a small sample size. Tests with One Sample, Continuous Outcome Corre Hypothesis testing applications with a continuous outcome variable in a single population are performed according to the five-step procedure outlined above. A key component is setting up the null and research hypotheses. The objective is to compare the mean in a single population to known mean (μ0). The known value is generally derived from another study or report, for example a study in a similar, but not identical, population or a study performed some years ago. The latter is called a historical control. It is important in setting up the hypotheses in a one sample test that the mean specified in the null hypothesis is a fair and reasonable comparator. This will be discussed in the examples that follow. In one sample tests for a continuous outcome, we set up our hypotheses against an appropriate comparator. We select a sample and compute descriptive statistics on the sample data - including the sample size (n), the sample mean ( ) and the sample standard deviation (s). We then determine the appropriate test statistic (Step 2) for the hypothesis test. The formulas for test statistics depend on the sample size and are given below. Test Statistics for Testing H0: μ= μ0 if n > 30 if n < 30 where df=n-1 Note that statistical computing packages will use the t statistic exclusively and make the necessary adjustments for comparing the test statistic to appropriate values from probability tables to produce a p-value. Example: The National Center for Health Statistics (NCHS) published a report in 2005 entitled Health, United States, containing extensive information on major trends in the health of Americans. Data are provided for the US population as a whole and for specific ages, sexes and races. The NCHS report indicated that in 2002 Americans paid an average of $3,302 per year on health care and prescription drugs. An investigator hypothesizes that in 2005 expenditures have decreased primarily due to the availability of generic drugs. To test the hypothesis, a sample of 100 Americans are selected and their expenditures on health care and prescription drugs in 2005 are measured. The sample data are summarized as follows: n=100, =$3,190 and s=$890. Is there statistical evidence of a reduction in expenditures on health care and prescription drugs in 2005? Is the sample mean of $3,190 evidence of a true reduction in the mean or is it within chance fluctuation? We will run the test using the five-step approach.  Step 1. Set up hypotheses and determine level of significance H0: μ = 3,302 H1: μ < 3,302 α =0.05 The research hypothesis is that expenditures have decreased, and therefore a lower-tailed test is used.  Step 2. Select the appropriate test statistic. Because the sample size is large (n>30) the appropriate test statistic is .  Step 3. Set up decision rule. This is a lower tailed test, using a Z statistic and a 5% level of significance. Reject H0 if Z < -1.645.  Step 4. Compute the test statistic. We now substitute the sample data into the formula for the test statistic identified in Step 2.  Step 5. Conclusion. We do not reject H0 because -1.26 > -1.645. We do not have statistically significant evidence at α=0.05 to show that the mean expenditures on health care and prescription drugs are lower in 2005 than the mean of $3,302 reported in 2002. Recall that when we fail to reject H0 in a test of hypothesis that either the null hypothesis is true (here the mean expenditures in 2005 are the same as those in 2002 and equal to $3,302) or we committed a Type II error (i.e., we failed to reject H0 when in fact it is false). In summarizing this test, we conclude that we do not have sufficient evidence to reject H0. We do not conclude that H0 is true, because there may be a moderate to high probability that we committed a Type II error. It is possible that the sample size is not large enough to detect a difference in mean expenditures. Example. The NCHS reported that the mean total cholesterol level in 2002 for all adults was 203. Total cholesterol levels in participants who attended the seventh examination of the Offspring in the Framingham Heart Study are summarized as follows: n=3,310, =200.3, and s=36.8. Is there statistical evidence of a difference in mean cholesterol levels in the Framingham Offspring? Here we want to assess whether the sample mean of 200.3 in the Framingham sample is statistically significantly different from 203 (i.e., beyond what we would expect by chance). We will run the test using the five-step approach.  Step 1. Set up hypotheses and determine level of significance H0: μ= 203 H1: μ≠ 203 α=0.05 The research hypothesis is that cholesterol levels are different in the Framingham Offspring, and therefore a two-tailed test is used.  Step 2. Select the appropriate test statistic. Because the sample size is large (n>30) the appropriate test statistic is .  Step 3. Set up decision rule. This is a two-tailed test, using a Z statistic and a 5% level of significance. Reject H0 if Z < -1.960 or is Z > 1.960.  Step 4. Compute the test statistic. We now substitute the sample data into the formula for the test statistic identified in Step 2.  Step 5. Conclusion. We reject H0 because -4.22 ≤ -1. .960. We have statistically significant evidence at α=0.05 to show that the mean total cholesterol level in the Framingham Offspring is different from the national average of 203 reported in 2002. Because we reject H0, we also approximate a p-value. Using the twosided significance levels, p < 0.0001. Statistical Significance versus Clinical (Practical) Significance This example raises an important concept of statistical versus clinical or practical significance. From a statistical standpoint, the total cholesterol levels in the Framingham sample are highly statistically significantly different from the national average with p < 0.0001 (i.e., there is less than a 0.01% chance that we are incorrectly rejecting the null hypothesis). However, the sample mean in the Framingham Offspring study is 200.3, less than 3 units different from the national mean of 203. The reason that the data are so highly statistically significant is due to the very large sample size. It is always important to assess both statistical and clinical significance of data. This is particularly relevant when the sample size is large. Is a 3 unit difference in total cholesterol a meaningful difference? Example Consider again the NCHS-reported mean total cholesterol level in 2002 for all adults of 203. Suppose a new drug is proposed to lower total cholesterol. A study is designed to evaluate the efficacy of the drug in lowering cholesterol. Fifteen patients are enrolled in the study and asked to take the new drug for 6 weeks. At the end of 6 weeks, each patient's total cholesterol level is measured and the sample statistics are as follows: n=15, =195.9 and s=28.7. Is there statistical evidence of a reduction in mean total cholesterol in patients after using the new drug for 6 weeks? We will run the test using the five-step approach.  Step 1. Set up hypotheses and determine level of significance H0: μ= 203 H1: μ< 203  α=0.05 Step 2. Select the appropriate test statistic. Because the sample size is small (n<30) the appropriate test statistic is .  Step 3. Set up decision rule. This is a lower tailed test, using a t statistic and a 5% level of significance. In order to determine the critical value of t, we need degrees of freedom, df, defined as df=n-1. In this example df=15-1=14. The critical value for a lower tailed test with df=14 and a =0.05 is -2.145 and the decision rule is as follows: Reject H0 if t < -2.145.  Step 4. Compute the test statistic. We now substitute the sample data into the formula for the test statistic identified in Step 2.  Step 5. Conclusion. We do not reject H0 because -0.96 > -2.145. We do not have statistically significant evidence at α=0.05 to show that the mean total cholesterol level is lower than the national mean in patients taking the new drug for 6 weeks. Again, because we failed to reject the null hypothesis we make a weaker concluding statement allowing for the possibility that we may have committed a Type II error (i.e., failed to reject H0 when in fact the drug is efficacious). This example raises an important issue in terms of study design. In this example we ass null hypothesis that the mean cholesterol level is 203. This is taken to be the mean cho patients without treatment. Is this an appropriate comparator? Alternative and potentiall study designs to evaluate the effect of the new drug could involve two treatment groups group receives the new drug and the other does not, or we could measure each patient pre-treatment cholesterol level and then assess changes from baseline to 6 weeks post These designs are also discussed here. Tests with One Sample, Dichotomous Outcome Hypothesis testing applications with a dichotomous outcome variable in a single population are also performed according to the five-step procedure. Similar to tests for means, a key component is setting up the null and research hypotheses. The objective is to compare the proportion of successes in a single population to a known proportion (p0). That known proportion is generally derived from another study or report and is sometimes called a historical control. It is important in setting up the hypotheses in a one sample test that the proportion specified in the null hypothesis is a fair and reasonable comparator. In one sample tests for a dichotomous outcome, we set up our hypotheses against an appropriate comparator. We select a sample and compute descriptive statistics on the sample data. Specifically, we compute the sample size (n) and the sample proportion which is computed by taking the ratio of the number of successes to the sample size, We then determine the appropriate test statistic (Step 2) for the hypothesis test. The formula for the test statistic is given below. Test Statistic for Testing H0: p = p 0 if min(np0 , n(1-p0))> 5 The formula above is appropriate for large samples, defined when the smaller of np0 and n(1-p0) is at least 5. This is similar, but not identical, to the condition required for appropriate use of the confidence interval formula for a population proportion: Here we use the proportion specified in the null hypothesis as the true proportion of successes rather than the sample proportion. If we fail to satisfy the condition, then alternative procedures, called exact methods must be used to test the hypothesis about the population proportion. . Example The NCHS report indicated that in 2002 the prevalence of cigarette smoking among American adults was 21.1%. Data on prevalent smoking in n=3,536 participants who attended the seventh examination of the Offspring in the Framingham Heart Study indicated that 482/3,536 = 13.6% of the respondents were currently smoking at the time of the exam. Suppose we want to assess whether the prevalence of smoking is lower in the Framingham Offspring sample given the focus on cardiovascular health in that community. Is there evidence of a statistically lower prevalence of smoking in the Framingham Offspring study as compared to the prevalence among all Americans?  Step 1. Set up hypotheses and determine level of significance H0: p = 0.211 H1: p < 0.211  α=0.05 Step 2. Select the appropriate test statistic. We must first check that the sample size is adequate. Specifically, we need to check min(np0, n(1-p0)) = min( 3,536(0.211), 3,536(1-0.211))=min(746, 2790)=746. The sample size is more than adequate so the following formula can be used: .  Step 3. Set up decision rule. This is a lower tailed test, using a Z statistic and a 5% level of significance. Reject H0 if Z < -1.645.  Step 4. Compute the test statistic. We now substitute the sample data into the formula for the test statistic identified in Step 2.  Step 5. Conclusion. We reject H0 because -10.93 < -1.645. We have statistically significant evidence at α=0.05 to show that the prevalence of smoking in the Framingham Offspring is lower than the prevalence nationally (21.1%). Here, p < 0.0001. The NCHS report indicated that in 2002, 75% of children aged 2 to 17 saw a dentist in the past y investigator wants to assess whether use of dental services is similar in children living in the city sample of 125 children aged 2 to 17 living in Boston are surveyed and 64 reported seeing a dent past 12 months. Is there a significant difference in use of dental services between children living the national data? Calculate this on your own before checking the answer. Answer Tests with Two Independent Samples, Continuous Outcome There are many applications where it is of interest to compare two independent groups with respect to their mean scores on a continuous outcome. Here we compare means between groups, but rather than generating an estimate of the difference, we will test whether the observed difference (increase, decrease or difference) is statistically significant or not. Remember, that hypothesis testing gives an assessment of statistical significance, whereas estimation gives an estimate of effect and both are important. Here we discuss the comparison of means when the two comparison groups are independent or physically separate. The two groups might be determined by a particular attribute (e.g., sex, diagnosis of cardiovascular disease) or might be set up by the investigator (e.g., participants assigned to receive an experimental treatment or placebo). The first step in the analysis involves computing descriptive statistics on each of the two samples. Specifically, we compute the sample size, mean and standard deviation in each sample and we denote these summary statistics as follows: n1, 1 and s1 for sample 1 and n2, 2 and s2 for sample 2. The designation of sample 1 and sample 2 is arbitrary. In a clinical trial setting the convention is to call the treatment group 1 and the control group 2. However, when comparing men and women, for example, either group can be 1 or 2. In the two independent samples application with a continuous outcome, the parameter of interest in the test of hypothesis is the difference in population means, μ1-μ2. The null hypothesis is always that there is no difference between groups with respect to means, i.e., H0: μ1 - μ2 = 0. The null hypothesis can also be written as follows: H0: μ1 = μ2. In the research hypothesis, an investigator can hypothesize that the first mean is larger than the second (H1: μ1 > μ2 ), that the first mean is smaller than the second (H1: μ1 < μ2 ), or that the means are different (H1: μ1 ≠ μ2 ). The three different alternatives represent upper-, lower-, and two-tailed tests, respectively. The following test statistics are used to test these hypotheses. Test Statistics for Testing H0: μ1 = μ2 if n1 > 30 and n2 > 30 if n1 < 30 or n2 < 30 where df =n1+n2-2. NOTE: The formulas above assume equal variability in the two populations (i.e., the population variances are equal, or s12 = s22). This means that the outcome is equally variable in each of the comparison populations. For analysis, we have samples from each of the comparison populations. If the sample variances are similar, then the assumption about variability in the populations is probably reasonable. As a guideline, if the ratio of the sample variances, s12/s22 is between 0.5 and 2 (i.e., if one variance is no more than double the other), then the formulas above are appropriate. If the ratio of the sample variances is greater than 2 or less than 0.5 then alternative formulas must be used to account for the heterogeneity in variances. The test statistics include Sp, which is the pooled estimate of the common standard deviation (again assuming that the variances in the populations are similar) computed as the weighted average of the standard deviations in the samples as follows: Because we are assuming equal variances between groups, we pool the information on variability (sample variances) to generate an estimate of the variability in the population. (Note: Because Sp is a weighted average of the standard deviations in the sample, Sp will always be in between s1 and s2.) Example Data measured on n=3,539 participants who attended the seventh examination of the Offspring in the Framingham Heart Study are shown below. Men Characteristic n S Systolic Blood Pressure 1,623 128.2 17.5 Diastolic Blood Pressure 1,622 75.6 9.8 Total Serum Cholesterol 1,544 192.4 35.2 Weight 1,612 194.0 33.8 Height 1,545 68.9 2.7 Body Mass Index 1,545 28.8 4.6 Suppose we now wish to assess whether there is a statistically significant difference in mean systolic blood pressures between men and women using a 5% level of significance.  Step 1. Set up hypotheses and determine level of significance H0: μ1 = μ2 H1: μ1 ≠ μ2  α=0.05 Step 2. Select the appropriate test statistic. Because both samples are large (> 30), we can use the Z test statistic as opposed to t. Note that statistical computing packages use t throughout. Before implementing the formula, we first check whether the assumption of equality of population variances is reasonable. The guideline suggests investigating the ratio of the sample variances, s12/s22. Suppose we call the men group 1 and the women group 2. Again, this is arbitrary; it only needs to be noted when interpreting the results. The ratio of the sample variances is 17.52/20.12 = 0.76, which falls between 0.5 and 2 suggesting that the assumption of equality of population variances is reasonable. The appropriate test statistic is .  Step 3. Set up decision rule. This is a two-tailed test, using a Z statistic and a 5% level of significance. Reject H0 if Z < -1.960 or is Z > 1.960.  Step 4. Compute the test statistic. We now substitute the sample data into the formula for the test statistic identified in Step 2. Before substituting, we will first compute Sp, the pooled estimate of the common standard deviation. Notice that the pooled estimate of the common standard deviation, Sp, falls in between the standard deviations in the comparison groups (i.e., 17.5 and 20.1). Sp is slightly closer in value to the standard deviation in the women (20.1) as there were slightly more women in the sample. Recall, Sp is a weight average of the standard deviations in the comparison groups, weighted by the respective sample sizes. Now the test statistic:  Step 5. Conclusion. We reject H0 because 2.66 > 1.960. We have statistically significant evidence at α=0.05 to show that there is a difference in mean systolic blood pressures between men and women. The p-value is p < 0.010. Here again we find that there is a statistically significant difference in mean systolic blood pressures between men and women at p < 0.010. Notice that there is a very small difference in the sample means (128.2-126.5 = 1.7 units), but this difference is beyond what would be expected by chance. Is this a clinically meaningful difference? The large sample size in this example is driving the statistical significance. A 95% confidence interval for the difference in mean systolic blood pressures is: 1.7 + 1.26 or (0.44, 2.96). The confidence interval provides an assessment of the magnitude of the difference between means whereas the test of hypothesis and p-value provide an assessment of the statistical significance of the difference. Above we performed a study to evaluate a new drug designed to lower total cholesterol. The study involved one sample of patients, each patient took the new drug for 6 weeks and had their cholesterol measured. As a means of evaluating the efficacy of the new drug, the mean total cholesterol following 6 weeks of treatment was compared to the NCHS-reported mean total cholesterol level in 2002 for all adults of 203. At the end of the example, we discussed the appropriateness of the fixed comparator as well as an alternative study design to evaluate the effect of the new drug involving two treatment groups, where one group receives the new drug and the other does not. Here, we revisit the example with a concurrent or parallel control group, which is very typical in randomized controlled trials or clinical trials (refer to the EP713 module on Clinical Trials). Example A new drug is proposed to lower total cholesterol. A randomized controlled trial is designed to evaluate the efficacy of the medication in lowering cholesterol. Thirty participants are enrolled in the trial and are randomly assigned to receive either the new drug or a placebo. The participants do not know which treatment they are assigned. Each participant is asked to take the assigned treatment for 6 weeks. At the end of 6 weeks, each patient's total cholesterol level is measured and the sample statistics are as follows. Treatment Sample Size Mean Standard Deviation New Drug 15 195.9 28.7 Placebo 15 217.4 30.3 Is there statistical evidence of a reduction in mean total cholesterol in patients taking the new drug for 6 weeks as compared to participants taking placebo? We will run the test using the five-step approach.  Step 1. Set up hypotheses and determine level of significance H0: μ1 = μ2 H1: μ1 < μ2  α=0.05 Step 2. Select the appropriate test statistic. Because both samples are small (< 30), we use the t test statistic. Before implementing the formula, we first check whether the assumption of equality of population variances is reasonable. The ratio of the sample variances, s12/s22 =28.72/30.32 = 0.90, which falls between 0.5 and 2, suggesting that the assumption of equality of population variances is reasonable. The appropriate test statistic is: .  Step 3. Set up decision rule. This is a lower-tailed test, using a t statistic and a 5% level of significance. The appropriate critical value can be found in the t Table (in More Resources to the right). In order to determine the critical value of t we need degrees of freedom, df, defined as df=n1+n2-2 = 15+15-2=28. The critical value for a lower tailed test with df=28 and α=0.05 is -2.048 and the decision rule is: Reject H0 if t < -2.048.  Step 4. Compute the test statistic. We now substitute the sample data into the formula for the test statistic identified in Step 2. Before substituting, we will first compute Sp, the pooled estimate of the common standard deviation. Now the test statistic,  Step 5. Conclusion. We reject H0 because -2.92 < -2.048. We have statistically significant evidence at α=0.05 to show that the mean total cholesterol level is lower in patients taking the new drug for 6 weeks as compared to patients taking placebo, p < 0.005. The clinical trial in this example finds a statistically significant reduction in total cholesterol, whereas in the previous example where we had a historical control (as opposed to a parallel control group) we did not demonstrate efficacy of the new drug. Notice that the mean total cholesterol level in patients taking placebo is 217.4 which is very different from the mean cholesterol reported among all Americans in 2002 of 203 and used as the comparator in the prior example. The historical control value may not have been the most appropriate comparator as cholesterol levels have been increasing over time. In the next section, we present another design that can be used to assess the efficacy of the new drug. Tests with Matched Samples, Continuous Outcome In the previous section we compared two groups with respect to their mean scores on a continuous outcome. An alternative study design is to compare matched or paired samples. The two comparison groups are said to be dependent, and the data can arise from a single sample of participants where each participant is measured twice (possibly before and after an intervention) or from two samples that are matched on specific characteristics (e.g., siblings). When the samples are dependent, we focus on difference scores in each participant or between members of a pair and the test of hypothesis is based on the mean difference, μd. The null hypothesis again reflects "no difference" and is stated as H0: μd =0 . Note that there are some instances where it is of interest to test whether there is a difference of a particular magnitude (e.g., μd =5) but in most instances the null hypothesis reflects no difference (i.e., μd=0). The appropriate formula for the test of hypothesis depends on the sample size. The formulas are shown below and are identical to those we presented for estimating the mean of a single sample presented (e.g., when comparing against an external or historical control), except here we focus on difference scores. Test Statistics for Testing H0: μd =0 if n > 30 if n < 30 where df =n-1 Example A new drug is proposed to lower total cholesterol and a study is designed to evaluate the efficacy of the drug in lowering cholesterol. Fifteen patients agree to participate in the study and each is asked to take the new drug for 6 weeks. However, before starting the treatment, each patient's total cholesterol level is measured. The initial measurement is a pre-treatment or baseline value. After taking the drug for 6 weeks, each patient's total cholesterol level is measured again and the data are shown below. The rightmost column contains difference scores for each patient, computed by subtracting the 6 week cholesterol level from the baseline level. The differences represent the reduction in total cholesterol over 4 weeks. (The differences could have been computed by subtracting the baseline total cholesterol level from the level measured at 6 weeks. The way in which the differences are computed does not affect the outcome of the analysis only the interpretation.) Subject Identification Number Baseline 6 Weeks 1 215 205 2 190 156 3 230 190 4 220 180 5 214 201 6 240 227 7 210 197 8 193 173 9 210 204 10 230 217 11 180 142 12 260 262 13 210 207 14 190 184 15 200 193 Because the differences are computed by subtracting the cholesterols measured at 6 weeks from the baseline values, positive differences indicate reductions and negative differences indicate increases (e.g., participant 12 increases by 2 units over 6 weeks). The goal here is to test whether there is a statistically significant reduction in cholesterol. Because of the way in which we computed the differences, we want to look for an increase in the mean difference (i.e., a positive reduction). In order to conduct the test, we need to summarize the differences. In this sample, we have The calculations are shown below. Subject Identification Number Difference Difference2 1 10 100 2 34 1156 3 40 1600 4 40 1600 5 13 169 6 13 169 7 13 169 8 20 400 9 6 36 10 13 169 11 38 1444 12 -2 4 13 3 9 14 6 36 15 7 49 254 7110 Is there statistical evidence of a reduction in mean total cholesterol in patients after using the new medication for 6 weeks? We will run the test using the five-step approach.  Step 1. Set up hypotheses and determine level of significance H0: μd = 0 H1: μd > 0 α=0.05 NOTE: If we had computed differences by subtracting the baseline level from the level measured at 6 weeks then negative differences would have reflected reductions and the research hypothesis would have been H1: μd < 0.  Step 2. Select the appropriate test statistic. Because the sample size is small (n<30) the appropriate test statistic is .  Step 3. Set up decision rule. This is an upper-tailed test, using a t statistic and a 5% level of significance. The appropriate critical value can be found in the t Table at the right, with df=151=14. The critical value for an upper-tailed test with df=14 and α=0.05 is 2.145 and the decision rule is Reject H0 if t > 2.145.  Step 4. Compute the test statistic. We now substitute the sample data into the formula for the test statistic identified in Step 2.  Step 5. Conclusion. We reject H0 because 4.61 > 2.145. We have statistically significant evidence at α=0.05 to show that there is a reduction in cholesterol levels over 6 weeks. Here we illustrate the use of a matched design to test the efficacy of a new drug to lower total cholesterol. We also considered a parallel design (randomized clinical trial) and a study using a historical comparator. It is extremely important to design studies that are best suited to detect a meaningful difference when one exists. There are often several alternatives and investigators work with biostatisticians to determine the best design for each application. It is worth noting that the matched design used here can be problematic in that observed differences may only reflect a "placebo" effect. All participants took the assigned medication, but is the observed reduction attributable to the medication or a result of these participation in a study Tests with Two Independent Samples, Dichotomous Outcome Here we consider the situation where there are two independent comparison groups and the outcome of interest is dichotomous (e.g., success/failure). The goal of the analysis is to compare proportions of successes between the two groups. The relevant sample data are the sample sizes in each comparison group (n1 and n2) and the sample proportions ( and 2 ) which are computed by taking the ratios of the numbers of successes to the sample sizes in each group, i.e., 1 and . There are several approaches that can be used to test hypotheses concerning two independent proportions. Here we present one approach - the chi-square test of independence is an alternative, equivalent, and perhaps more popular approach to the same analysis. Hypothesis testing with the chi-square test is addressed in the third module in this series: BS704_HypothesisTesting-ChiSquare. In tests of hypothesis comparing proportions between two independent groups, one test is performed and results can be interpreted to apply to a risk difference, relative risk or odds ratio. As a reminder, the risk difference is computed by taking the difference in proportions between comparison groups, the risk ratio is computed by taking the ratio of proportions, and the odds ratio is computed by taking the ratio of the odds of success in the comparison groups. Because the null values for the risk difference, the risk ratio and the odds ratio are different, the hypotheses in tests of hypothesis look slightly different depending on which measure is used. When performing tests of hypothesis for the risk difference, relative risk or odds ratio, the convention is to label the exposed or treated group 1 and the unexposed or control group 2. For example, suppose a study is designed to assess whether there is a significant difference in proportions in two independent comparison groups. The test of interest is as follows: H0: p1 = p2 versus H1: p1 ≠ p2. The following are the hypothesis for testing for a difference in proportions using the risk difference, the risk ratio and the odds ratio. First, the hypotheses above are equivalent to the following:   For the risk difference, H0: p1 - p2 = 0 versus H1: p1 - p2 ≠ 0 which are, by definition, equal to H0: RD = 0 versus H1: RD ≠ 0. If an investigator wants to focus on the risk ratio, the equivalent hypotheses are H0: RR = 1 versus H1: RR ≠ 1.  If the investigator wants to focus on the odds ratio, the equivalent hypotheses are H0: OR = 1 versus H1: OR ≠ 1. Suppose a test is performed to test H0: RD = 0 versus H1: RD ≠ 0 and the test rejects H0 at α=0.05. Based on this test we can conclude that there is significant evidence, α=0.05, of a difference in proportions, significant evidence that the risk difference is not zero, significant evidence that the risk ratio and odds ratio are not one. The risk difference is analogous to the difference in means when the outcome is continuous. Here the parameter of interest is the difference in proportions in the population, RD = p1-p2 and the null value for the risk difference is zero. In a test of hypothesis for the risk difference, the null hypothesis is always H0: RD = 0. This is equivalent to H0: RR = 1 and H0: OR = 1. In the research hypothesis, an investigator can hypothesize that the first proportion is larger than the second (H1: p 1 > p 2 , which is equivalent to H1: RD > 0, H1: RR > 1 and H1: OR > 1), that the first proportion is smaller than the second (H1: p 1 < p 2 , which is equivalent to H1: RD < 0, H1: RR < 1 and H1: OR < 1), or that the proportions are different (H1: p 1 ≠ p 2 , which is equivalent to H1: RD ≠ 0, H1: RR ≠ 1 and H1: OR ≠ 1). The three different alternatives represent upper-, lower- and two-tailed tests, respectively. The formula for the test of hypothesis for the difference in proportions is given below. Test Statistics for Testing H0: p 1 = p Where 1 is the proportion of successes in sample 1, 2 is the proportion of successes in sample 2, and is the proportion of successes in the pooled sample. summing all of the successes and dividing by the total sample size, is computed by (this is similar to the pooled estimate of the standard deviation, Sp, used in two independent samples tests with a continuous outcome; just as Sp is in between s1 and s2, between 1 and will be in 2). The formula above is appropriate for large samples, defined as at least 5 successes (np>5) and at least 5 failures (n(1-p>5)) in each of the two samples. If there are fewer than 5 successes or failures in either comparison group, then alternative procedures, called exact methods must be used to estimate the difference in population proportions. Example The following table summarizes data from n=3,799 participants who attended the fifth examination of the Offspring in the Framingham Heart Study. The outcome of interest is prevalent CVD and we want to test whether the prevalence of CVD is significantly higher in smokers as compared to non-smokers. Non-Smoker Current Smoker Total Free of CVD History of CVD 2,757 298 663 81 3,420 379 The prevalence of CVD (or proportion of participants with prevalent CVD) among nonsmokers is 298/3,055 = 0.0975 and the prevalence of CVD among current smokers is 81/744 = 0.1089. Here smoking status defines the comparison groups and we will call the current smokers group 1 (exposed) and the non-smokers (unexposed) group 2. The test of hypothesis is conducted below using the five step approach.  Step 1. Set up hypotheses and determine level of significance H0: p1 = p2  H1: p1 ≠ p2 α=0.05 Step 2. Select the appropriate test statistic. We must first check that the sample size is adequate. Specifically, we need to ensure that we have at least 5 successes and 5 failures in each comparison group. In this example, we have more than enough successes (cases of prevalent CVD) and failures (persons free of CVD) in each comparison group. The sample size is more than adequate so the following formula can be used: .  Step 3. Set up decision rule. Reject H0 if Z < -1.960 or if Z > 1.960.  Step 4. Compute the test statistic. We now substitute the sample data into the formula for the test statistic identified in Step 2. We first compute the overall proportion of successes: We now substitute to compute the test statistic.  Step 5. Conclusion. We do not reject H0 because -1.960 < 0.927 < 1.960. We do not have statistically significant evidence at α=0.05 to show that there is a difference in prevalent CVD between smokers and non-smokers. A 95% confidence interval for the difference in prevalent CVD (or risk difference) between smokers and non-smokers as 0.0114 + 0.0247, or between -0.0133 and 0.0361. Because the 95% confidence interval for the risk difference includes zero we again conclude that there is no statistically significant difference in prevalent CVD between smokers and nonsmokers. Smoking has been shown over and over to be a risk factor for cardiovascular disease. What might explain the fact that we did not observe a statistically significant difference using data from the Framingham Heart Study? HINT: Here we consider prevalent CVD, would the results have been different if we considered incident CVD? Example A randomized trial is designed to evaluate the effectiveness of a newly developed pain reliever designed to reduce pain in patients following joint replacement surgery. The trial compares the new pain reliever to the pain reliever currently in use (called the standard of care). A total of 100 patients undergoing joint replacement surgery agreed to participate in the trial. Patients were randomly assigned to receive either the new pain reliever or the standard pain reliever following surgery and were blind to the treatment assignment. Before receiving the assigned treatment, patients were asked to rate their pain on a scale of 0-10 with higher scores indicative of more pain. Each patient was then given the assigned treatment and after 30 minutes was again asked to rate their pain on the same scale. The primary outcome was a reduction in pain of 3 or more scale points (defined by clinicians as a clinically meaningful reduction). The following data were observed in the trial. Treatment Group n Number with Reduction of 3+ Points New Pain Reliever 50 23 Standard Pain Reliever 50 11 We now test whether there is a statistically significant difference in the proportions of patients reporting a meaningful reduction (i.e., a reduction of 3 or more scale points) using the five step approach.  Step 1. Set up hypotheses and determine level of significance H0: p1 = p2 H1: p1 ≠ p2 α=0.05 Propo Here the new or experimental pain reliever is group 1 and the standard pain reliever is group 2.  Step 2. Select the appropriate test statistic. We must first check that the sample size is adequate. Specifically, we need to ensure that we have at least 5 successes and 5 failures in each comparison group, i.e., In this example, we have min(50(0.46), 50(1-0.46), 50(0.22), 50(1-0.22)) = min(23, 27, 11, 39) = 11. The sample size is adequate so the following formula can be used .  Step 3. Set up decision rule. Reject H0 if Z < -1.960 or if Z > 1.960.  Step 4. Compute the test statistic. We now substitute the sample data into the formula for the test statistic identified in Step 2. We first compute the overall proportion of successes: We now substitute to compute the test statistic.  Step 5. Conclusion. We reject H0 because 2.526 > 1960. We have statistically significant evidence at a =0.05 to show that there is a difference in the proportions of patients on the new pain reliever reporting a meaningful reduction (i.e., a reduction of 3 or more scale points) as compared to patients on the standard pain reliever. A 95% confidence interval for the difference in proportions of patients on the new pain reliever reporting a meaningful reduction (i.e., a reduction of 3 or more scale points) as compared to patients on the standard pain reliever is 0.24 + 0.18 or between 0.06 and 0.42. Because the 95% confidence interval does not include zero we concluded that there was a statistically significant difference in proportions which is consistent with the test of hypothesis result. Again, the procedures discussed here apply to applications where there are two independent comparison groups and a dichotomous outcome. There are other applications in which it is of interest to compare a dichotomous outcome in matched or paired samples. For example, in a clinical trial we might wish to test the effectiveness of a new antibiotic eye drop for the treatment of bacterial conjunctivitis. Participants use the new antibiotic eye drop in one eye and a comparator (placebo or active control treatment) in the other. The success of the treatment (yes/no) is recorded for each participant for each eye. Because the two assessments (success or failure) are paired, we cannot use the procedures discussed here. The appropriate test is called McNemar's test (sometimes called McNemar's test for dependent proportions). Summary Here we presented hypothesis testing techniques for means and proportions in one and two sample situations. Tests of hypothesis involve several steps, including specifying the null and alternative or research hypothesis, selecting and computing an appropriate test statistic, setting up a decision rule and drawing a conclusion. There are many details to consider in hypothesis testing. The first is to determine the appropriate test. We discussed Z and t tests here for different applications. The appropriate test depends on the distribution of the outcome variable (continuous or dichotomous), the number of comparison groups (one, two) and whether the comparison groups are independent or dependent. The following table summarizes the different tests of hypothesis discussed here. Outcome Variable, Number of Groups: Null Hypothesis Continuous Outcome, One Sample: H0: μ = μ0 Continuous Outcome, Two Independent Samples: H0: μ1 = μ2 Continuous Outcome, Two Matched Samples: H0: μd = 0 Dichotomous Outcome, One Sample: H0: p = p 0 Dichotomous Outcome, Two Independent Samples: H0: p1 = p2, RD=0, RR=1, OR=1 Test Once the type of test is determined, the details of the test must be specified. Specifically, the null and alternative hypotheses must be clearly stated. The null hypothesis always reflects the "no change" or "no difference" situation. The alternative or research hypothesis reflects the investigator's belief. The investigator might hypothesize that a parameter (e.g., a mean, proportion, difference in means or proportions) will increase, will decrease or will be different under specific conditions (sometimes the conditions are different experimental conditions and other times the conditions are simply different groups of participants). Once the hypotheses are specified, data are collected and summarized. The appropriate test is then conducted according to the five step approach. If the test leads to rejection of the null hypothesis, an approximate p-value is computed to summarize the significance of the findings. When tests of hypothesis are conducted using statistical computing packages, exact p-values are computed. Because the statistical tables in this textbook are limited, we can only approximate p-values. If the test fails to reject the null hypothesis, then a weaker concluding statement is made for the following reason. In hypothesis testing, there are two types of errors that can be committed. A Type I error occurs when a test incorrectly rejects the null hypothesis. This is referred to as a false positive result, and the probability that this occurs is equal to the level of significance, α. The investigator chooses the level of significance in Step 1, and purposely chooses a small value such as α=0.05 to control the probability of committing a Type I error. A Type II error occurs when a test fails to reject the null hypothesis when in fact it is false. The probability that this occurs is equal to β. Unfortunately, the investigator cannot specify β at the outset because it depends on several factors including the sample size (smaller samples have higher b), the level of significance (β decreases as a increases), and the difference in the parameter under the null and alternative hypothesis. We noted in several examples in this chapter, the relationship between confidence intervals and tests of hypothesis. The approaches are different, yet related. It is possible to draw a conclusion about statistical significance by examining a confidence interval. For example, if a 95% confidence interval does not contain the null value (e.g., zero when analyzing a mean difference or risk difference, one when analyzing relative risks or odds ratios), then one can conclude that a two-sided test of hypothesis would reject the null at α=0.05. It is important to note that the correspondence between a confidence interval and test of hypothesis relates to a two-sided test and that the confidence level corresponds to a specific level of significance (e.g., 95% to α=0.05, 90% to α=0.10 and so on). The exact significance of the test, the pvalue, can only be determined using the hypothesis testing approach and the p-value provides an assessment of the strength of the evidence and not an estimate of the effect. Standard Error of the Mean (2 of 2) A graph of the effect of sample size on the standard error for a standard deviation of 10 is shown below: As you can see, the function levels off. Increasing the sample size by a few subjects makes a big difference when the sample size is small but makes much less of a difference when the sample size is large. Notice that the graph is consistent with the formulas. If σM is 10 for a sample size of 1 then σ Mshould be equal to for a sample size of 25. When s is used as an estimate of σ, the estimated standard error of the mean is . The standard error of the mean is used in the computation of confidence intervalsand significance tests for the mean. Fundamentals of Statistics 3: Sampling :: The standard error of the mean We saw with the sampling distribution of the mean that every sample we take to estimate the unknown population parameter will overestimate or underestimate the mean by some amount. But what's interesting is that the distribution of all these sample means will itself be normally distributed, even if the population is not normally distributed. The central limit theorem states that the mean of the sampling distribution of the mean will be the unknown population mean. The standard deviation of the sampling distribution of the mean is called the standard error. In fact, it is just another standard deviation, we just call it the standard error so we know we're talking about the standard deviation of the sample means instead of the standard deviation of the raw data. The standard deviation of data is the average distance values are from the mean. Ok, so, the variability of the sample means is called the standard error of the mean or the standard deviation of the mean (these terms will be used interchangeably since they mean the same thing) and it looks like this. Standard Error of the Mean (SEM) = The symbol σ sigma represents the population standard deviation and n is the sample size. Population parameters are symbolized using Greek symbols and we almost never know the population parameters. That is also the case with the standard error. Just like we estimated the population standard deviation using the sample standard deviation, we can estimate the population standard error using the sample standard deviation. When we repeatedly sample from a population, the mean of each sample will vary far less than any individual value. For example, when we takerandom samples of women's heights, while any individual height will vary by as much as 12 inches (a woman who is 5'10 and one who is 4'10), the mean will only vary by a few inches. The distribution of sample means varies far less than the individual values in a sample.If we know the population mean height of women is 65 inches then it would be extremely rare to have a sampe mean of 30 women at 74 inches. In fact, if we took a sample of 30 women and found an average height of 6'1, then we would wonder whether these were really from the total population of women. Perhaps it was a population of Olympic Volleyball players. It is possible that a random sample of women from the general population could be 6'1 but it is extremely rare (like winning the lottery). The standard deviation tells us how much variation we can expect in a population. We know from the empirical rule that 95% of values will fall within 2 standard deviations of the mean. Since the standard error is just the standard deviation of the distribution of sample mean, we can also use this rule. So how much variation in the standard error of the mean should we expect from chance alone? Using the empirical rule we'd expect 68% of our sample means to fall within 1 standard error of the true unknown population mean. 95% would fall within 2 standard errors and about 99.7% of the sample means will be within 3 standard errors of the population mean. Just as z-scores can be used to understand the probability of obtaining a raw value given the mean and standard deviation, we can do the same thing with sample means. Sampling Distribution of Difference Between Means Author(s) David M. Lane Prerequisites Sampling Distributions, Sampling Distribution of the Mean, Variance Sum Law I Learning Objectives 1. State the mean and variance of the sampling distribution of the difference between means 2. Compute the standard error of the difference between means 3. Compute the probability of a difference between means being above a specified value Statistical analyses are very often concerned with the difference between means. A typical example is an experiment designed to compare the mean of a control group with the mean of an experimental group. Inferential statistics used in the analysis of this type of experiment depend on the sampling distribution of the difference between means. The sampling distribution of the difference between means can be thought of as the distribution that would result if we repeated the following three steps over and over again: (1) sample n1 scores from Population 1 and n2 scores from Population 2, (2) compute the means of the two samples (M1 and M2), and (3) compute the difference between means, M1 - M2. The distribution of the differences between means is the sampling distribution of the difference between means. As you might expect, the mean of the sampling distribution of the difference between means is: which says that the mean of the distribution of differences between sample means is equal to the difference between population means. For example, say that the mean test score of all 12-year-olds in a population is 34 and the mean of 10-yearolds is 25. If numerous samples were taken from each age group and the mean difference computed each time, the mean of these numerous differences between sample means would be 34 - 25 = 9. From the variance sum law, we know that: which says that the variance of the sampling distribution of the difference between means is equal to the variance of the sampling distribution of the mean for Population 1 plus the variance of the sampling distribution of the mean for Population 2. Recall the formula for the variance of the sampling distribution of the mean: Since we have two populations and two samples sizes, we need to distinguish between the two variances and sample sizes. We do this by using the subscripts 1 and 2. Using this convention, we can write the formula for the variance of the sampling distribution of the difference between means as: Since the standard error of a sampling distribution is the standard deviation of the sampling distribution, the standard error of the difference between means is: Just to review the notation, the symbol on the left contains a sigma (σ), which means it is a standard deviation. The subscripts M1 - M2 indicate that it is the standard deviation of the sampling distribution of M1 - M2. Now let's look at an application of this formula. Assume there are two species of green beings on Mars. The mean height of Species 1 is 32 while the mean height of Species 2 is 22. The variances of the two species are 60 and 70, respectively and the heights of both species are normally distributed. You randomly sample 10 members of Species 1 and 14 members of Species 2. What is the probability that the mean of the 10 members of Species 1 will exceed the mean of the 14 members of Species 2 by 5 or more? Without doing any calculations, you probably know that the probability is pretty high since the difference in population means is 10. But what exactly is the probability? First, let's determine the sampling distribution of the difference between means. Using the formulas above, the mean is The standard error is: The sampling distribution is shown in Figure 1. Notice that it is normally distributed with a mean of 10 and a standard deviation of 3.317. The area above 5 is shaded blue. Figure 1. The sampling distribution of the difference between means. The last step is to determine the area that is shaded blue. Using either a Z table or the normal calculator, the area can be determined to be 0.934. Thus the probability that the mean of the sample from Species 1 will exceed the mean of the sample from Species 2 by 5 or more is 0.934. As shown below, the formula for the standard error of the difference between means is much simpler if the sample sizes and the population variances are equal. When the variances and samples sizes are the same, there is no need to use the subscripts 1 and 2 to differentiate these terms. This simplified version of the formula can be used for the following problem: The mean height of 15-year-old boys (in cm) is 175 and the variance is 64. For girls, the mean is 165 and the variance is 64. If eight boys and eight girls were sampled, what is the probability that the mean height of the sample of girls would be higher than the mean height of the sample of boys? In other words, what is the probability that the mean height of girls minus the mean height of boys is greater than 0? As before, the problem can be solved in terms of the sampling distribution of the difference between means (girls - boys). The mean of the distribution is 165 - 175 = -10. The standard deviation of the distribution is: A graph of the distribution is shown in Figure 2. It is clear that it is unlikely that the mean height for girls would be higher than the mean height for boys since in the population boys are quite a bit taller. Nonetheless it is not inconceivable that the girls' mean could be higher than the boys' mean. Figure 2. Sampling distribution of the difference between mean heights. A difference between means of 0 or higher is a difference of 10/4 = 2.5 standard deviations above the mean of -10. The probability of a score 2.5 or more standard deviations above the mean is 0.0062. Question 1 out of 4. Population 1 has a mean of 20 and a variance of 100. Population 2 has a mean of 15 and a variance of 64. You sample 20 scores from Pop 1 and 16 scores from Pop 2. What is the mean of the sampling distribution of the difference between means (Pop 1 Pop 2)? Test of significance for small samples So far we have discussed problems belonging to large samples. When a small sample (size < 30) is considered, the above tests are inapplicable because the assumptions we made for large sample tests, do not hold good for small samples. In case of small samples it is not possible to assume (i) that the random sampling distribution of a statistics normal and (ii) the sample values are sufficiently close to population values to calculate the S.E. of estimate. Thus an entirely new approach is required to deal with problems of small samples. But one should note that the methods and theory of small samples are applicable to large samples but its converse is not true. Degree of freedom ( df ): By degree of freedom we mean the number of classes to which the value can be assigned arbitrarily or at will without voicing the restrictions or limitations placed. For example, we are asked to choose any 4 numbers whose total is 50. Clearly we are at freedom to choose any 3 numbers say 10, 23, 7 but the fourth number, 10 is fixed since the total is 50 [50 - (10 + 23 + 7) = 10]. Thus we are given a restriction, hence the freedom of selection of number is 4 - 1 = 3. The degree of freedom ( df ) is denoted by (nu) or df and it is given by  = n - k, where n = number of classes and k = number of independent constrains (or restrictions). In general for a Binomial distribution,  = n - 1 For Poisson distribution,  = n - 2 (since we use total frequency and arithmetic mean). For normal distribution,  = n - 3 (since we use total frequency, mean and standard deviation) etc. Student's t-distribution This concept has been introduced by W. S. Gosset (1876 - 1937). He adopted the pen name "student." Therefore, the distribution is known as 'student’s tdistribution'. It is used to establish confidence limits and test the hypothesis when the population variance is not known and sample size is small ( < 30 ). If a random sample x1, x2, ......., xn of n values be drawn from a normal population with mean  and standard deviation s then the mean of sample Estimate of the variance : Let s2 be the estimate of the variance of the sample then s2 given by ( n - 1 ) as denominator in place 'n'. ( I ) The statistic 't' is defined as t= Where x = sample mean,  = actual or hypotheticalmean of population, n = sample size, s = standard deviation of sample. Where s = Note: 't' is distributed as the student distribution with ( n - 1 ) degree of freedom (df ). (II)1) The variable 't' distribution ranges from minus infinity to plus infinity. 2) Like standard normal distribution, it is also symmetrical and has mean zero 3) 2 of t-distribution is greater than 1, but becomes 1 as 'df' increases and thus the sample size becomes large. Thus the variance of t-distribution approaches the variance of the normal distribution as the sample size increases for  ( df ) =, the t-distribution matches with the normal distribution. (observe the adjoining figure). Also note that the t-distribution is lower at the mean and higher at the tails than the normal distribution. The t-distribution has proportionally greater area at its tails than the normal distribution. (III) 1) If | t | exceeds t0.05 then difference between x and  is significant at 0.05 level of significance. 2) If | t | exceeds t0.01, then difference is said to highly significant at 0.01 level of significance. 3) If | t | < t0.05 we conclude that the difference between and m is not significant and the sample might have been drawn from a population with mean =  i.e. the data is consistent with the hypothesis. (IV)Fiducial limits of population mean Example A random sample of 16 values from a normal population is found to leave a mean of 41.5 and standard deviation of 2.795. On this basis is there any reason to reject the hypothesis that the population mean  = 43? Also find the confidence limits for . Solution: Here n = 16 - 1 = 15, = 41.5 ,  = 2.795 and  = 43. Now From the t-table for 15 degree of freedom, the probability of t being 0.05, the value of t = 2.13. Since 2.078 < 2.13. The difference between Now, null hypothesis : Ho :  = 43 and Alternative hypothesis : H :   43. and  is not significant. Thus there is no reason to reject Ho. To find the limits, Example Ten individuals are chosen at random from the population and their heights are found to be inches 63, 63, 64, 65, 66, 69, 69, 70, 70, 71. Discuss the suggestion that the mean height in the universe is 65 inches given that for 9 degree of freedom the value of student’s 't' at 0.05 level of significance is 2.262. Solution: xi = 63, 63, 64, 65, 66, 69, 69, 70, 70, 71 and n = 10 The difference is not significant at a t 0.05 level thus Ho is accepted and we conclude that the mean height is 65 inches. Example Nine items of a sample have the following values 45, 47, 50, 52, 8, 47, 49, 53, 51, 50. Does the mean of the 9 items differ significantly from the assumed population mean of 47.5 ? Given that for degree of freedom = 8. P = 0.945 for t = 1.8 and P = 0.953 for t = 1.9. Solution: Given that for degree of freedom = 8. P = 0.945 for t = 1.8 and P = 0.953 for t = 1.9.  xi = 45 + 47 + 52 + 48 + 47 + 49 + 53 + 51 + 50 = 442 n=9 Therefore for difference of t = 0.043, the difference of P = 0.0034. Hence for t = 1.843, P = 0.9484. Therefore the probability of getting a value of t > 1.43 is ( 1 0.9484 ) = 0.051 which is in fact 2  0.051 = 0.102 and it is greater than 0.05. Thus Ho is accepted, i.e. the mean of 9 items differ significantly from the assumed population mean. Example A certain stimulus administered to each of 12 patients resulted in the following increments in 'Blood pressure' 5, 2, 8, -1, 3, 0, 6, -2, 1, 5, 0, 4. Can it be concluded that the stimulus will in general be accompanied by an increase in blood pressure, given that for all df the value of t0.05 = 2.201? Solution: The null hypothesis Ho :  = 0 i.e. assuming that the stimulus will not be accompanied by an increase in blood pressure (or the mean increase in blood pressure for the population is zero). Now The table value, t0.05, n = 11 = 2.201 Therefore, 2.924 > 2.201 Thus the null hypothesis Ho is rejected i.e. we find that our assumption is wrong and we say that as a result of the stimulus the blood pressure will increase. Statistics – Textbook Nonparametric Statistics Last revised: 5/8/2015 Previous Next Contents Nonparametric Statistics How to Analyze Data with Low Quality or Small Samples, Nonparametric Statistics  General Purpose  Brief Overview of Nonparametric Procedures  When to Use Which Method  Nonparametric Correlations General Purpose Brief review of the idea of significance testing. To understand the idea of nonparametric statistics (the termnonparametric was first used by Wolfowitz, 1942) first requires a basic understanding of parametric statistics.Elementary Concepts introduces the concept of statistical significance testing based on the sampling distribution of a particular statistic (you may want to review that topic before reading on). In short, if we have a basic knowledge of the underlying distribution of a variable, then we can make predictions about how, in repeated samples of equal size, this particular statistic will "behave," that is, how it is distributed. For example, if we draw 100 random samples of 100 adults each from the general population, and compute the mean height in each sample, then the distribution of the standardized means across samples will likely approximate the normal distribution (to be precise, Student's t distribution with 99 degrees of freedom; see below). Now imagine that we take an additional sample in a particular city ("Tallburg") where we suspect that people are taller than the average population. If the mean height in that sample falls outside the upper 95% tail area of the t distribution then we conclude that, indeed, the people of Tallburg are taller than the average population. Are most variables normally distributed? In the above example we relied on our knowledge that, in repeated samples of equal size, the standardized means (for height) will be distributed following the t distribution (with a particular mean and variance). However, this will only be true if in the population the variable of interest (height in our example) is normally distributed, that is, if the distribution of people of particular heights follows the normal distribution (the bell-shape distribution). For many variables of interest, we simply do not know for sure that this is the case. For example, is income distributed normally in the population? -- probably not. The incidence rates of rare diseases are not normally distributed in the population, the number of car accidents is also not normally distributed, and neither are very many other variables in which a researcher might be interested. For more information on the normal distribution, see Elementary Concepts; for information on tests of normality, see Normality tests. Sample size. Another factor that often limits the applicability of tests based on the assumption that the sampling distribution is normal is the size of the sample of data available for the analysis (sample size; n). We can assume that the sampling distribution is normal even if we are not sure that the distribution of the variable in the population is normal, as long as our sample is large enough (e.g., 100 or more observations). However, if our sample is very small, then those tests can be used only if we are sure that the variable is normally distributed, and there is no way to test this assumption if the sample is small. Problems in measurement. Applications of tests that are based on the normality assumptions are further limited by a lack of precise measurement. For example, let us consider a study where grade point average (GPA) is measured as the major variable of interest. Is an A average twice as good as a C average? Is the difference between a B and an A average comparable to the difference between a D and a C average? Somehow, the GPA is a crude measure of scholastic accomplishments that only allows us to establish a rank ordering of students from "good" students to "poor" students. This general measurement issue is usually discussed in statistics textbooks in terms of types of measurement or scale of measurement. Without going into too much detail, most common statistical techniques such as analysis of variance (and t- tests), regression, etc., assume that the underlying measurements are at least of interval, meaning that equally spaced intervals on the scale can be compared in a meaningful manner (e.g, B minus A is equal to D minus C). However, as in our example, this assumption is very often not tenable, and the data rather represent a rank ordering of observations (ordinal) rather than precise measurements. Parametric and nonparametric methods. Hopefully, after this somewhat lengthy introduction, the need is evident for statistical procedures that enable us to process data of "low quality," from small samples, on variables about which nothing is known (concerning their distribution). Specifically, nonparametric methods were developed to be used in cases when the researcher knows nothing about the parameters of the variable of interest in the population (hence the name nonparametric). In more technical terms, nonparametric methods do not rely on the estimation of parameters (such as the mean or the standard deviation) describing the distribution of the variable of interest in the population. Therefore, these methods are also sometimes (and more appropriately) called parameter-free methods or distribution-free methods. Back to Top Brief Overview of Nonparametric Methods Basically, there is at least one nonparametric equivalent for each parametric general type of test. In general, these tests fall into the following categories:  Tests of differences between groups (independent samples);  Tests of differences between variables (dependent samples);  Tests of relationships between variables. Differences between independent groups. Usually, when we have two samples that we want to compare concerning their mean value for some variable of interest, we would use the t-test for independent samples); nonparametric alternatives for this test are the Wald-Wolfowitz runs test, the Mann-Whitney U test, and the Kolmogorov-Smirnov two-sample test. If we have multiple groups, we would use analysis of variance (seeANOVA/MANOVA; the nonparametric equivalents to this method are the Kruskal-Wallis analysis of ranks and the Median test. Differences between dependent groups. If we want to compare two variables measured in the same sample we would customarily use the t-test for dependent samples (in Basic Statistics for example, if we wanted to compare students' math skills at the beginning of the semester with their skills at the end of the semester). Nonparametric alternatives to this test are the Sign test and Wilcoxon's matched pairs test. If the variables of interest are dichotomous in nature (i.e., "pass" vs. "no pass") then McNemar's Chi-square test is appropriate. If there are more than two variables that were measured in the same sample, then we would customarily use repeated measures ANOVA. Nonparametric alternatives to this method are Friedman's two-way analysis of variance and Cochran Q test (if the variable was measured in terms of categories, e.g., "passed" vs. "failed"). Cochran Q is particularly useful for measuring changes in frequencies (proportions) across time. Relationships between variables. To express a relationship between two variables one usually computes the correlation coefficient. Nonparametric equivalents to the standard correlation coefficient are Spearman R,Kendall Tau, and coefficient Gamma (see Nonparametric correlations). If the two variables of interest are categorical in nature (e.g., "passed" vs. "failed" by "male" vs. "female") appropriate nonparametric statistics for testing the relationship between the two variables are the Chi-square test, the Phi coefficient, and the Fisher exact test. In addition, a simultaneous test for relationships between multiple cases is available: Kendall coefficient of concordance. This test is often used for expressing inter-rater agreement among independent judges who are rating (ranking) the same stimuli. Descriptive statistics. When one's data are not normally distributed, and the measurements at best contain rank order information, then computing the standard descriptive statistics (e.g., mean, standard deviation) is sometimes not the most informative way to summarize the data. For example, in the area of psychometrics it is well known that the rated intensity of a stimulus (e.g., perceived brightness of a light) is often a logarithmic function of the actual intensity of the stimulus (brightness as measured in objective units of Lux). In this example, the simple mean rating (sum of ratings divided by the number of stimuli) is not an adequate summary of the average actual intensity of the stimuli. (In this example, one would probably rather compute the geometric mean.) Nonparametrics and Distributions will compute a wide variety of measures of location (mean, median,mode, etc.) and dispersion (variance, average deviation, quartile range, etc.) to provide the "complete picture" of one's data. Back to Top When to Use Which Method It is not easy to give simple advice concerning the use of nonparametric procedures. Each nonparametric procedure has its peculiar sensitivities and blind spots. For example, the Kolmogorov-Smirnov two-sample test is not only sensitive to differences in the location of distributions (for example, differences in means) but is also greatly affected by differences in their shapes. The Wilcoxon matched pairs test assumes that one can rank order the magnitude of differences in matched observations in a meaningful manner. If this is not the case, one should rather use the Sign test. In general, if the result of a study is important (e.g., does a very expensive and painful drug therapy help people get better?), then it is always advisable to run different nonparametric tests; should discrepancies in the results occur contingent upon which test is used, one should try to understand why some tests give different results. On the other hand, nonparametric statistics are less statistically powerful (sensitive) than their parametric counterparts, and if it is important to detect even small effects (e.g., is this food additive harmful to people?) one should be very careful in the choice of a test statistic. Large data sets and nonparametric methods. Nonparametric methods are most appropriate when the sample sizes are small. When the data set is large (e.g., n > 100) it often makes little sense to use nonparametric statistics at all. Elementary Concepts briefly discusses the idea of the central limit theorem. In a nutshell, when the samples become very large, then the sample means will follow the normal distribution even if the respective variable is not normally distributed in the population, or is not measured very well. Thus, parametric methods, which are usually much more sensitive (i.e., have more statistical power) are in most cases appropriate for large samples. However, the tests of significance of many of the nonparametric statistics described here are based on asymptotic (large sample) theory; therefore, meaningful tests can often not be performed if the sample sizes become too small. Please refer to the descriptions of the specific tests to learn more about their power and efficiency. Back to Top Nonparametric Correlations The following are three types of commonly used nonparametric correlation coefficients (Spearman R, Kendall Tau, and Gamma coefficients). Note that the chi-square statistic computed for two-way frequency tables, also provides a careful measure of a relation between the two (tabulated) variables, and unlike the correlation measures listed below, it can be used for variables that are measured on a simple nominal scale. Spearman R. Spearman R (Siegel & Castellan, 1988) assumes that the variables under consideration were measured on at least an ordinal (rank order) scale, that is, that the individual observations can be ranked into two ordered series. Spearman R can be thought of as the regular Pearson product moment correlation coefficient, that is, in terms of proportion of variability accounted for, except that Spearman R is computed from ranks. Kendall tau. Kendall tau is equivalent to Spearman R with regard to the underlying assumptions. It is also comparable in terms of its statistical power. However, Spearman R and Kendall tau are usually not identical in magnitude because their underlying logic as well as their computational formulas are very different. Siegel and Castellan (1988) express the relationship of the two measures in terms of the inequality: More importantly, Kendall tau and Spearman R imply different interpretations: Spearman R can be thought of as the regular Pearson product moment correlation coefficient, that is, in terms of proportion of variability accounted for, except that Spearman R is computed from ranks. Kendall tau, on the other hand, represents a probability, that is, it is the difference between the probability that in the observed data the two variables are in the same order versus the probability that the two variables are in different orders. -1  3 * Kendall tau - 2 * Spearman R  1 Gamma. The Gamma statistic (Siegel & Castellan, 1988) is preferable to Spearman R or Kendall tau when the data contain many tied observations. In terms of the underlying assumptions, Gamma is equivalent to Spearman R or Kendall tau; in terms of its interpretation and computation it is more similar to Kendall tau than Spearman R. In short, Gamma is also a probability; specifically, it is computed as the difference between the probability that the rank ordering of the two variables agree minus the probability that they disagree, divided by 1 minus the probability of ties. Thus, Gamma is basically equivalent to Kendall tau, except that ties are explicitly taken into account. Introduction to Statistical Decision Theory By John Pratt, Howard Raiffa and Robert Schlaifer Overview The Bayesian revolution in statistics—where statistics is integrated with decision making in areas such as management, public policy, engineering, and clinical medicine—is here to stay. Introduction to Statistical Decision Theory states the case and in a selfcontained, comprehensive way shows how the approach is operational and relevant for real-world decision making under uncertainty. Starting with an extensive account of the foundations of decision theory, the authors develop the intertwining concepts of subjective probability and utility. They then systematically and comprehensively examine the Bernoulli, Poisson, and Normal (univariate and multivariate) data generating processes. For each process they consider how prior judgments about the uncertain parameters of the process are modified given the results of statistical sampling, and they investigate typical decision problems in which the main sources of uncertainty are the population parameters. They also discuss the value of sampling information and optimal sample sizes given sampling costs and the economics of the terminal decision problems. Unlike most introductory texts in statistics, Introduction to Statistical Decision Theory integrates statistical inference with decision making and discusses real-world actions involving economic payoffs and risks. After developing the rationale and demonstrating the power and relevance of the subjective, decision approach, the text also examines and critiques the limitations of the objective, classical approach. Reviews “An excellent introduction to Bayesian statistical theory.”—Frank Windmeijer, Times Higher Education Supplement “This book is a classic.... The strengths of this text are twofold. First, it gives a general and well-motivated introduction to the principles of Bayesian decision theory that should be accessible to anyone with a good mathematical statistics background. Second, it provides a good introduction to Bayesian inference in general with particular emphasis on the use of subjective information to choose prior distributions.”—Mark J. Schervish , Journal of the American Statistical Association “This is the authoritative introductory treatise (almost 900 pages) on applied Bayesian statistical theory. It is self-contained and well-presented, developed with great care and obvious affection by the founders of the subject.”—James M. Dickey, Mathematical Reviews Decision Trees for Decision Making  John F. Magee FROM THE JULY 1964 ISSUE  SAVE  SHARE  COMMENT  TEXT SIZE  PRINT  8.95 BUY COPIES Decision Trees for Decision Making VIEW MORE FROM THE July 1964 Issue EXPLORE THE ARCHIVE RECOMMENDED  Decision Trees for Decision Making STRATEGY & EXECUTION HBR ARTICLE o John F. Magee 8.95 ADD TO CART O SAVE O SHARE  Decision Trees LEADERSHIP & MANAGING PEOPLE INDUSTRY AND BACKGROUND NOTE o Robin Greenwood, Lucy White 8.95 ADD TO CART O SAVE O SHARE  Structuring a Competitive Analysis: Decision Trees, Decision Forests, and Payoff Matrices STRATEGY & EXECUTION CASE o Matthias Hild 8.95 ADD TO CART O SAVE O SHARE The management of a company that I shall call Stygian Chemical Industries, Ltd., must decide whether to build a small plant or a large one to manufacture a new product with an expected market life of ten years. The decision hinges on what size the market for the product will be. Possibly demand will be high during the initial two years but, if many initial users find the product unsatisfactory, will fall to a low level thereafter. Or high initial demand might indicate the possibility of a sustained high-volume market. If demand is high and the company does not expand within the first two years, competitive products will surely be introduced. If the company builds a big plant, it must live with it whatever the size of market demand. If it builds a small plant, management has the option of expanding the plant in two years in the event that demand is high during the introductory period; while in the event that demand is low during the introductory period, the company will maintain operations in the small plant and make a tidy profit on the low volume. Management is uncertain what to do. The company grew rapidly during the 1950’s; it kept pace with the chemical industry generally. The new product, if the market turns out to be large, offers the present management a chance to push the company into a new period of profitable growth. The development department, particularly the development project engineer, is pushing to build the large-scale plant to exploit the first major product development the department has produced in some years. The chairman, a principal stockholder, is wary of the possibility of large unneeded plant capacity. He favors a smaller plant commitment, but recognizes that later expansion to meet high-volume demand would require more investment and be less efficient to operate. The chairman also recognizes that unless the company moves promptly to fill the demand which develops, competitors will be tempted to move in with equivalent products. The Stygian Chemical problem, oversimplified as it is, illustrates the uncertainties and issues that business management must resolve in making investment decisions. (I use the term “investment” in a broad sense, referring to outlays not only for new plants and equipment but also for large, risky orders, special marketing facilities, research programs, and other purposes.) These decisions are growing more important at the same time that they are increasing in complexity. Countless executives want to make them better—but how? In this article I shall present one recently developed concept called the “decision tree,” which has tremendous potential as a decision-making tool. The decision tree can clarify for management, as can no other analytical tool that I know of, the choices, risks, objectives, monetary gains, and information needs involved in an investment problem. We shall be hearing a great deal about decision trees in the years ahead. Although a novelty to most businessmen today, they will surely be in common management parlance before many more years have passed. Later in this article we shall return to the problem facing Stygian Chemical and see how management can proceed to solve it by using decision trees. First, however, a simpler example will illustrate some characteristics of the decision-tree approach. Displaying Alternatives Let us suppose it is a rather overcast Saturday morning, and you have 75 people coming for cocktails in the afternoon. You have a pleasant garden and your house is not too large; so if the weather permits, you would like to set up the refreshments in the garden and have the party there. It would be more pleasant, and your guests would be more comfortable. On the other hand, if you set up the party for the garden and after all the guests are assembled it begins to rain, the refreshments will be ruined, your guests will get damp, and you will heartily wish you had decided to have the party in the house. (We could complicate this problem by considering the possibility of a partial commitment to one course or another and opportunities to adjust estimates of the weather as the day goes on, but the simple problem is all we need.) This particular decision can be represented in the form of a “payoff” table: Much more complex decision questions can be portrayed in payoff table form. However, particularly for complex investment decisions, a different representation of the information pertinent to the problem—the decision tree—is useful to show the routes by which the various possible outcomes are achieved. Pierre Massé, Commissioner General of the National Agency for Productivity and Equipment Planning in France, notes: “The decision problem is not posed in terms of an isolated decision (because today’s decision depends on the one we shall make tomorrow) nor yet in terms of a sequence of decisions (because under uncertainty, decisions taken in the future will be influenced by what we have learned in the meanwhile). The problem is posed in terms of a tree of decisions.”1 Exhibit I illustrates a decision tree for the cocktail party problem. This tree is a different way of displaying the same information shown in the payoff table. However, as later examples will show, in complex decisions the decision tree is frequently a much more lucid means of presenting the relevant information than is a payoff table. Exhibit I. Decision Tree for Cocktail Party The tree is made up of a series of nodes and branches. At the first node on the left, the host has the choice of having the party inside or outside. Each branch represents an alternative course of action or decision. At the end of each branch or alternative course is another node representing a chance event—whether or not it will rain. Each subsequent alternative course to the right represents an alternative outcome of this chance event. Associated with each complete alternative course through the tree is a payoff, shown at the end of the rightmost or terminal branch of the course. When I am drawing decision trees, I like to indicate the action or decision forks with square nodes and the chance-event forks with round ones. Other symbols may be used instead, such as single-line and double-line branches, special letters, or colors. It does not matter so much which method of distinguishing you use so long as you do employ one or another. A decision tree of any size will always combine (a) actionchoices with (b) different possible events or results of action which are partially affected by chance or other uncontrollable circumstances. Decision-event chains The previous example, though involving only a single stage of decision, illustrates the elementary principles on which larger, more complex decision trees are built. Let us take a slightly more complicated situation: You are trying to decide whether to approve a development budget for an improved product. You are urged to do so on the grounds that the development, if successful, will give you a competitive edge, but if you do not develop the product, your competitor may—and may seriously damage your market share. You sketch out a decision tree that looks something like the one in Exhibit II. Exhibit II. Decision Tree with Chains of Actions and Events Your initial decision is shown at the left. Following a decision to proceed with the project, if development is successful, is a second stage of decision at Point A. Assuming no important change in the situation between now and the time of Point A, you decide now what alternatives will be important to you at that time. At the right of the tree are the outcomes of different sequences of decisions and events. These outcomes, too, are based on your present information. In effect you say, “If what I know now is true then, this is what will happen.” Of course, you do not try to identify all the events that can happen or all the decisions you will have to make on a subject under analysis. In the decision tree you lay out only those decisions and events or results that are important to you and have consequences you wish to compare. (For more illustrations, see the Appendix.) Appendix For readers interested in further examples of decision-tree structure, I shall describe in this appendix two representative situations with which I am familiar and show the trees that might be drawn to analyze management’s decision-making alternatives. We shall not concern ourselves here with costs, yields, probabilities, or expected values. New Facility The choice of alternatives in building a plant depends upon market forecasts. The alternative chosen will, in turn, affect the market outcome. For example, the military products division of a diversified firm, after some period of low profits due to intense competition, has won a contract to produce a new type of military engine suitable for Army transport vehicles. The division has a contract to build productive capacity and to produce at a specified contract level over a period of three years. Figure A illustrates the situation. The dotted line shows the contract rate. The solid line shows the proposed buildup of production for the military. Some other possibilities are portrayed by dashed lines. The company is not sure whether the contract will be continued at a relatively high rate after the third year, as shown by Line A, or whether the military will turn to another newer development, as indicated by Line B. The company has no guarantee of compensation after the third year. There is also the possibility, indicated by Line C, of a large additional commercial market for the product, this possibility being somewhat dependent on the cost at which the product can be made and sold. If this commercial market could be tapped, it would represent a major new business for the company and a substantial improvement in the profitability of the division and its importance to the company. Management wants to explore three ways of producing the product as follows: 1. It might subcontract all fabrication and set up a simple assembly with limited need for investment in plant and equipment; the costs would tend to be relatively high and the company’s investment and profit opportunity would be limited, but the company assets which are at risk would also be limited. 2. It might undertake the major part of the fabrication itself but use general-purpose machine tools in a plant of general-purpose construction. The division would have a chance to retain more of the most profitable operations itself, exploiting some technical developments it has made (on the basis of which it got the contract). While the cost of production would still be relatively high, the nature of the investment in plant and equipment would be such that it could probably be turned to other uses or liquidated if the business disappeared. 3. The company could build a highly mechanized plant with specialized fabrication and assembly equipment, entailing the largest investment but yielding a substantially lower unit manufacturing cost if manufacturing volume were adequate. Following this plan would improve the chances for a continuation of the military contract and penetration into the commercial market and would improve the profitability of whatever business might be obtained in these markets. Failure to sustain either the military or the commercial market, however, would cause substantial financial loss. Either of the first two alternatives would be better adapted to low-volume production than would the third. Some major uncertainties are: the cost-volume relationships under the alternative manufacturing methods; the size and structure of the future market—this depends in part on cost, but the degree and extent of dependence are unknown; and the possibilities of competitive developments which would render the product competitively or technologically obsolete. How would this situation be shown in decision-tree form? (Before going further you might want to draw a tree for the problem yourself.) Figure B shows my version of a tree. Note that in this case the chance alternatives are somewhat influenced by the decision made. A decision, for example, to build a more efficient plant will open possibilities for an expanded market. Plant Modernization A company management is faced with a decision on a proposal by its engineering staff which, after three years of study, wants to install a computer-based control system in the company’s major plant. The expected cost of the control system is some $30 million. The claimed advantages of the system will be a reduction in labor cost and an improved product yield. These benefits depend on the level of product throughput, which is likely to rise over the next decade. It is thought that the installation program will take about two years and will cost a substantial amount over and above the cost of equipment. The engineers calculate that the automation project will yield a 20% return on investment, after taxes; the projection is based on a ten-year forecast of product demand by the market research department, and an assumption of an eight-year life for the process control system. What would this investment yield? Will actual product sales be higher or lower than forecast? Will the process work? Will it achieve the economies expected? Will competitors follow if the company is successful? Are they going to mechanize anyway? Will new products or processes make the basic plant obsolete before the investment can be recovered? Will the controls last eight years? Will something better come along sooner? The initial decision alternatives are (a) to install the proposed control system, (b) postpone action until trends in the market and/or competition become clearer, or (c) initiate more investigation or an independent evaluation. Each alternative will be followed by resolution of some uncertain aspect, in part dependent on the action taken. This resolution will lead in turn to a new decision. The dotted lines at the right of Figure C indicate that the decision tree continues indefinitely, though the decision alternatives do tend to become repetitive. In the case of postponement or further study, the decisions are to install, postpone, or restudy; in the case of installation, the decisions are to continue operation or abandon. An immediate decision is often one of a sequence. It may be one of a number of sequences. The impact of the present decision in narrowing down future alternatives and the effect of future alternatives in affecting the value of the present choice must both be considered. READ MORE Adding Financial Data Now we can return to the problems faced by the Stygian Chemical management. A decision tree characterizing the investment problem as outlined in the introduction is shown in Exhibit III. At Decision #1 the company must decide between a large and a small plant. This is all that must be decided now. But if the company chooses to build a small plant and then finds demand high during the initial period, it can in two years—at Decision #2—choose to expand its plant. Exhibit III. Decisions and Events for Stygian Chemical Industries, Ltd. But let us go beyond a bare outline of alternatives. In making decisions, executives must take account of the probabilities, costs, and returns which appear likely. On the basis of the data now available to them, and assuming no important change in the company’s situation, they reason as follows:  Marketing estimates indicate a 60% chance of a large market in the long run and a 40% chance of a low demand, developing initially as follows:  Therefore, the chance that demand initially will be high is 70% (60 + 10). If demand is high initially, the company estimates that the chance it will continue at a high level is 86% (60 ÷ 70). Comparing 86% to 60%, it is apparent that a high initial level of sales changes the estimated chance of high sales in the subsequent periods. Similarly, if sales in the initial period are low, the chances are 100% (30 ÷ 30) that sales in the subsequent periods will be low. Thus the level of sales in the initial period is expected to be a rather accurate indicator of the level of sales in the subsequent periods. Estimates of annual income are made under the assumption of each alternative outcome:  1. A large plant with high volume would yield $1,000,000 annually in cash flow. 2. A large plant with low volume would yield only $100,000 because of high fixed costs and inefficiencies. 3. A small plant with low demand would be economical and would yield annual cash income of $400,000. 4. A small plant, during an initial period of high demand, would yield $450,000 per year, but this would drop to $300,000 yearly in the long run because of competition. (The market would be larger than under Alternative 3, but would be divided up among more competitors.) 5. If the small plant were expanded to meet sustained high demand, it would yield$700,000 cash flow annually, and so would be less efficient than a large plant built initially. 6. If the small plant were expanded but high demand were not sustained, estimated annual cash flow would be $50,000.  It is estimated further that a large plant would cost $3 million to put into operation, a small plant would cost $1.3 million, and the expansion of the small plant would cost an additional $2.2 million. When the foregoing data are incorporated, we have the decision tree shown in Exhibit IV. Bear in mind that nothing is shown here which Stygian Chemical’s executives did not know before; no numbers have been pulled out of hats. However, we are beginning to see dramatic evidence of the value of decision trees in laying out what management knows in a way that enables more systematic analysis and leads to better decisions. To sum up the requirements of making a decision tree, management must: 1. Identify the points of decision and alternatives available at each point. 2. Identify the points of uncertainty and the type or range of alternative outcomes at each point. 3. Estimate the values needed to make the analysis, especially the probabilities of different events or results of action and the costs and gains of various events and actions. 4. Analyze the alternative values to choose a course. Exhibit IV. Decision Tree with Financial Data Choosing Course of Action We are now ready for the next step in the analysis—to compare the consequences of different courses of action. A decision tree does not give management the answer to an investment problem; rather, it helps management determine which alternative at any particular choice point will yield the greatest expected monetary gain, given the information and alternatives pertinent to the decision. Of course, the gains must be viewed with the risks. At Stygian Chemical, as at many corporations, managers have different points of view toward risk; hence they will draw different conclusions in the circumstances described by the decision tree shown in Exhibit IV. The many people participating in a decision—those supplying capital, ideas, data, or decisions, and having different values at risk—will see the uncertainty surrounding the decision in different ways. Unless these differences are recognized and dealt with, those who must make the decision, pay for it, supply data and analyses to it, and live with it will judge the issue, relevance of data, need for analysis, and criterion of success in different and conflicting ways. For example, company stockholders may treat a particular investment as one of a series of possibilities, some of which will work out, others of which will fail. A major investment may pose risks to a middle manager—to his job and career—no matter what decision is made. Another participant may have a lot to gain from success, but little to lose from failure of the project. The nature of the risk—as each individual sees it—will affect not only the assumptions he is willing to make but also the strategy he will follow in dealing with the risk. The existence of multiple, unstated, and conflicting objectives will certainly contribute to the “politics” of Stygian Chemical’s decision, and one can be certain that the political element exists whenever the lives and ambitions of people are affected. Here, as in similar cases, it is not a bad exercise to think through who the parties to an investment decision are and to try to make these assessments:    What is at risk? Is it profit or equity value, survival of the business, maintenance of a job, opportunity for a major career? Who is bearing the risk? The stockholder is usually bearing risk in one form. Management, employees, the community—all may be bearing different risks. What is the character of the risk that each person bears? Is it, in his terms, unique, once-in-alifetime, sequential, insurable? Does it affect the economy, the industry, the company, or a portion of the company? Considerations such as the foregoing will surely enter into top management’s thinking, and the decision tree in Exhibit IV will not eliminate them. But the tree will show management what decision today will contribute most to its long-term goals. The tool for this next step in the analysis is the concept of “rollback.” “Rollback” concept Here is how rollback works in the situation described. At the time of making Decision #1 (see Exhibit IV), management does not have to make Decision #2 and does not even know if it will have the occasion to do so. But if it were to have the option at Decision #2, the company would expand the plant, in view of its current knowledge. The analysis is shown in Exhibit V. (I shall ignore for the moment the question of discounting future profits; that is introduced later.) We see that the total expected value of the expansion alternative is $160,000 greater than the no-expansion alternative, over the eight-year life remaining. Hence that is the alternative management would choose if faced with Decision #2 with its existing information (and thinking only of monetary gain as a standard of choice). Exhibit V. Analysis of Possible Decision #2 (Using Maximum Expected Total Cash Flow as Criterion) Readers may wonder why we started with Decision #2 when today’s problem is Decision #1. The reason is the following: We need to be able to put a monetary value on Decision #2 in order to “roll back” to Decision #1 and compare the gain from taking the lower branch (“Build Small Plant”) with the gain from taking the upper branch (“Build Big Plant”). Let us call that monetary value for Decision #2 its position value. The position value of a decision is the expected value of the preferred branch (in this case, the plant-expansion fork). The expected value is simply a kind of average of the results you would expect if you were to repeat the situation over and over—getting a$5,600 thousand yield 86% of the time and a $400 thousand yield 14% of the time. Stated in another way, it is worth $2,672 thousand to Stygian Chemical to get to the position where it can make Decision #2. The question is: Given this value and the other data shown in Exhibit IV, what now appears to be the best action at Decision #1? Turn now to Exhibit VI. At the right of the branches in the top half we see the yields for various events if a big plant is built (these are simply the figures in Exhibit IV multiplied out). In the bottom half we see the small plant figures, including Decision #2 position value plus the yield for the two years prior to Decision #2. If we reduce all these yields by their probabilities, we get the following comparison: Build big plant: ($10 × .60) + ($2.8 × .10) + ($1 × .30) – $3 = $3,600 thousand Build small plant: ($3.6 × .70) + ($4 × .30) – $1.3 = $2,400 thousand Exhibit VI. Cash Flow Analysis for Decision #1 The choice which maximizes expected total cash yield at Decision #1, therefore, is to build the big plant initially. Accounting for Time What about taking differences in the time of future earnings into account? The time between successive decision stages on a decision tree may be substantial. At any stage, we may have to weigh differences in immediate cost or revenue against differences in value at the next stage. Whatever standard of choice is applied, we can put the two alternatives on a comparable basis if we discount the value assigned to the next stage by an appropriate percentage. The discount percentage is, in effect, an allowance for the cost of capital and is similar to the use of a discount rate in the present value or discounted cash flow techniques already well known to businessmen. When decision trees are used, the discounting procedure can be applied one stage at a time. Both cash flows and position values are discounted. For simplicity, let us assume that a discount rate of 10% per year for all stages is decided on by Stygian Chemical’s management. Applying the rollback principle, we again begin with Decision #2. Taking the same figures used in previous exhibits and discounting the cash flows at 10%, we get the data shown in Part A of Exhibit VII. Note particularly that these are the present values as of the time Decision #2 is made. Exhibit VII. Analysis of Decision #2 with Discounting Note: For simplicity, the first year cash flow is not discounted, the second year cash flow is discounted one year, and so on. Now we want to go through the same procedure used in Exhibit V when we obtained expected values, only this time using the discounted yield figures and obtaining a discounted expected value. The results are shown in Part B of Exhibit VII. Since the discounted expected value of the no-expansion alternative is higher, that figure becomes the position value of Decision #2 this time. Having done this, we go back to work through Decision #1 again, repeating the same analytical procedure as before only with discounting. The calculations are shown in Exhibit VIII. Note that the Decision #2 position value is treated at the time of Decision #1 as if it were a lump sum received at the end of the two years. Exhibit VIII. Analysis of Decision #1 The large-plant alternative is again the preferred one on the basis of discounted expected cash flow. But the margin of difference over the small-plant alternative ($290 thousand) is smaller than it was without discounting. Uncertainty Alternatives In illustrating the decision-tree concept, I have treated uncertainty alternatives as if they were discrete, well-defined possibilities. For my examples I have made use of uncertain situations depending basically on a single variable, such as the level of demand or the success or failure of a development project. I have sought to avoid unnecessary complication while putting emphasis on the key interrelationships among the present decision, future choices, and the intervening uncertainties. In many cases, the uncertain elements do take the form of discrete, single-variable alternatives. In others, however, the possibilities for cash flow during a stage may range through a whole spectrum and may depend on a number of independent or partially related variables subject to chance influences—cost, demand, yield, economic climate, and so forth. In these cases, we have found that the range of variability or the likelihood of the cash flow falling in a given range during a stage can be calculated readily from knowledge of the key variables and the uncertainties surrounding them. Then the range of cash-flow possibilities during the stage can be broken down into two, three, or more “subsets,” which can be used as discrete chance alternatives. Conclusion Peter F. Drucker has succinctly expressed the relation between present planning and future events: “Long-range planning does not deal with future decisions. It deals with the futurity of present decisions.”2 Today’s decision should be made in light of the anticipated effect it and the outcome of uncertain events will have on future values and decisions. Since today’s decision sets the stage for tomorrow’s decision, today’s decision must balance economy with flexibility; it must balance the need to capitalize on profit opportunities that may exist with the capacity to react to future circumstances and needs. The unique feature of the decision tree is that it allows management to combine analytical techniques such as discounted cash flow and present value methods with a clear portrayal of the impact of future decision alternatives and events. Using the decision tree, management can consider various courses of action with greater ease and clarity. The interactions between present decision alternatives, uncertain events, and future choices and their results become more visible. Of course, there are many practical aspects of decision trees in addition to those that could be covered in the space of just one article. When these other aspects are discussed in subsequent articles,3 the whole range of possible gains for management will be seen in greater detail. Surely the decision-tree concept does not offer final answers to managements making investment decisions in the face of uncertainty. We have not reached that stage, and perhaps we never will. Nevertheless, the concept is valuable for illustrating the structure of investment decisions, and it can likewise provide excellent help in the evaluation of capital investment opportunities. 1. Optimal Investment Decisions: Rules for Action and Criteria for Choice (Englewood Cliffs, New Jersey, Prentice-Hall, Inc., 1962), p. 250. 2. “Long-Range Planning,” Management Science, April 1959, p. 239. 3. We are expecting another article by Mr. Magee in a forthcoming issue.—The Editors A version of this article appeared in the July 1964 issue of Harvard Business Review. John F. Magee is chairman of the board of directors of Arthur D. Little, Inc. Over the past three decades, his professional consulting assignments have taken him to Europe frequently. This article is about DECISION MAKING Decision Making Tools Good managers do not simply just make decisions. Instead, they use tools to determine the best course of action, making it possible for the manager to make an informed decision. That does not mean that good managers always make the right decisions, but they certainly are making decisions that are more informed than they would be based purely on guesswork. There are many different tools managers use to make decisions, but the ones that we see the most are thedecision tree, payback analysis and simulation. While there are more, these are the three we see the most, and it's important to understand them and know how to use them before you start making decisions. The Decision Tree Example of a decision tree The first tool we will look at is the decision tree. This tool has us write down an issue or problem, and then, as we think through the problem, we draw solutions or steps that branch out from the original issue. You start your decision tree by taking a piece of paper and drawing a small square to represent the decision you need to make. It could look something like this: 'Should we decide to continue to produce dress shoes only, or should we look at making sneakers as well?' This first block represents the issue that requires you to make a decision. From there, just like the name 'decision tree' implies, branches start to sprout out with your thoughts for different solutions or directions for this issue. You can make any number of branches, and there is no set pattern to the decision tree - it is defined more by its functionality than its form. Using a decision tree, you can capture your thoughts, review them and, if needed, add more branches, hopefully continuing on until you find your answer. Each decision you make leads you to another decision (or would/could choice) and that choice leads you to another. Payback Analysis I'm happy to say that payback analysis is much easier and much more finite than the decision tree. This tool will help you analyze financial investments. When we use payback analysis, we look at an investment and the anticipated savings or cost increase that will result from that investment. Then, we use a calculation that gives us a time frame for us to make back the money spent in the initial investment. For example, let's say we are going to invest in energy-efficient lighting. We know that the initial investment would be $15,000, but we also know that our energy cost savings would be $5,000 a year. We can simply use some basic math to understand how long it would take for us to earn back our initial investment of $15,000. Did he say math? Nooo! Don't worry though - it's pretty simple. We can divide $15,000 (our investment) by the amount we would save each year ($5,000), and from that basic calculation, we can see that it would take 3 years to make back our investment. If that time frame is something that is good for you, then you decide to make the investment. If not, see if you can come up with some other investment or savings to change the calculation. Either way, you will have used the payback analysis tool to make a well-informed decision. Decision Tree Definition A decision tree is a graphical representation of possible solutions to a decision based on certain conditions. It's called a decision tree because it starts with a single box (or root), which then branches off into a number of solutions, just like a tree. Decision trees are helpful, not only because they are graphics that help you 'see' what you are thinking, but also because making a decision tree requires a systematic, documented thought process. Often, the biggest limitation of our decision making is that we can only select from the known alternatives. Decision trees help formalize the brainstorming process so we can identify more potential solutions. Decision Tree Example Applied in real life, decision trees can be very complex and end up including pages of options. But, regardless of the complexity, decision trees are all based on the same principles. Here is a basic example of a decision tree: You are making your weekend plans and find out that your parents might come to town. You'd like to have plans in place, but there are a few unknown factors that will determine what you can, and can't, do. Time for a decision tree. First, you draw your decision box. This is the box that includes the event that starts your decision tree. In this case it is your parents coming to town. Out of that box, you have a branch for each possible outcome. In our example, it's easy: yes or no - either your parents come or they don't. Your parents love the movies, so if they come to town, you'll go to the cinema. Since the goal of the decision tree is to decide your weekend plans, you have an answer. But, what about if your parents don't come to town? We can go back up to the 'no branch' from the decision box and finish our decision tree. If your parents don't come to town, you need to decide what you are going to do. As you think of options, you realize the weather is an important factor. Weather becomes your next box. Since it's spring time, you know it will either be rainy, sunny, or windy. Those three possibilities become your branches. If it's sunny or rainy, you know what you'll do - play tennis or stay in, respectively. But, what if it's windy? If it's windy, you want to get out of the house, but you probably won't be able to play tennis. You could either go to the movies or go shopping. What will determine if you go shopping or go see a movie? Money. To unlock this lesson you must be a Study.com Member. Create your account Register for a free trial Are you a student or a teacher? I am a student Decision Trees for Decision Making by John F. Magee The management of a company that I shall call Stygian Chemical Industries, Ltd., must decide whether to build a small plant or a large one to manufacture a new product with an expected market life of ten years. The decision hinges on what size the market for the product will be. Possibly demand will be high during the initial two years but, if many initial users find the product unsatisfactory, will fall to a low level thereafter. Or high initial demand might indicate the possibility of a sustained high-volume market. If demand is high and the company does not expand within the first two years, competitive products will surely be introduced. If the company builds a big plant, it must live with it whatever the size of market demand. If it builds a small plant, management has the option of expanding the plant in two years in the event that demand is high during the introductory period; while in the event that demand is low during the introductory period, the company will maintain operations in the small plant and make a tidy profit on the low volume. Management is uncertain what to do. The company grew rapidly during the 1950’s; it kept pace with the chemical industry generally. The new product, if the market turns out to be large, offers the present management a chance to push the company into a new period of profitable growth. The development department, particularly the development project engineer, is pushing to build the large-scale plant to exploit the first major product development the department has produced in some years. The chairman, a principal stockholder, is wary of the possibility of large unneeded plant capacity. He favors a smaller plant commitment, but recognizes that later expansion to meet high-volume demand would require more investment and be less efficient to operate. The chairman also recognizes that unless the company moves promptly to fill the demand which develops, competitors will be tempted to move in with equivalent products. The Stygian Chemical problem, oversimplified as it is, illustrates the uncertainties and issues that business management must resolve in making investment decisions. (I use the term “investment” in a broad sense, referring to outlays not only for new plants and equipment but also for large, risky orders, special marketing facilities, research programs, and other purposes.) These decisions are growing more important at the same time that they are increasing in complexity. Countless executives want to make them better—but how? In this article I shall present one recently developed concept called the “decision tree,” which has tremendous potential as a decision-making tool. The decision tree can clarify for management, as can no other analytical tool that I know of, the choices, risks, objectives, monetary gains, and information needs involved in an investment problem. We shall be hearing a great deal about decision trees in the years ahead. Although a novelty to most businessmen today, they will surely be in common management parlance before many more years have passed. Later in this article we shall return to the problem facing Stygian Chemical and see how management can proceed to solve it by using decision trees. First, however, a simpler example will illustrate some characteristics of the decision-tree approach. Displaying Alternatives Let us suppose it is a rather overcast Saturday morning, and you have 75 people coming for cocktails in the afternoon. You have a pleasant garden and your house is not too large; so if the weather permits, you would like to set up the refreshments in the garden and have the party there. It would be more pleasant, and your guests would be more comfortable. On the other hand, if you set up the party for the garden and after all the guests are assembled it begins to rain, the refreshments will be ruined, your guests will get damp, and you will heartily wish you had decided to have the party in the house. (We could complicate this problem by considering the possibility of a partial commitment to one course or another and opportunities to adjust estimates of the weather as the day goes on, but the simple problem is all we need.) This particular decision can be represented in the form of a “payoff” table: Page 2 Much more complex decision questions can be portrayed in payoff table form. However, particularly for complex investment decisions, a different representation of the information pertinent to the problem—the decision tree—is useful to show the routes by which the various possible outcomes are achieved. Pierre Massé, Commissioner General of the National Agency for Productivity and Equipment Planning in France, notes: “The decision problem is not posed in terms of an isolated decision (because today’s decision depends on the one we shall make tomorrow) nor yet in terms of a sequence of decisions (because under uncertainty, decisions taken in the future will be influenced by what we have learned in the meanwhile). The problem is posed in terms of a tree of decisions.” Exhibit I illustrates a decision tree for the cocktail party problem. This tree is a different way of displaying the same information shown in the payoff table. However, as later examples will show, in complex decisions the decision tree is frequently a much more lucid means of presenting the relevant information than is a payoff table. The tree is made up of a series of nodes and branches. At the first node on the left, the host has the choice of having the party inside or outside. Each branch represents an alternative course of action or decision. At the end of each branch or alternative course is another node representing a chance event—whether or not it will rain. Each subsequent alternative course to the right represents an alternative outcome of this chance event. Associated with each complete alternative course through the tree is a payoff, shown at the end of the rightmost or terminal branch of the course. 1 Page 3 When I am drawing decision trees, I like to indicate the action or decision forks with square nodes and the chance-event forks with round ones. Other symbols may be used instead, such as single-line and double-line branches, special letters, or colors. It does not matter so much which method of distinguishing you use so long as you do employ one or another. A decision tree of any size will always combine (a) action choices with (b) different possible events or results of action which are partially affected by chance or other uncontrollable circumstances. Decision-event chains The previous example, though involving only a single stage of decision, illustrates the elementary principles on which larger, more complex decision trees are built. Let us take a slightly more complicated situation: You are trying to decide whether to approve a development budget for an improved product. You are urged to do so on the grounds that the development, if successful, will give you a competitive edge, but if you do not develop the product, your competitor may—and may seriously damage your market share. You sketch out a decision tree that looks something like the one in Exhibit II. Your initial decision is shown at the left. Following a decision to proceed with the project, if development is successful, is a second stage of decision at Point A. Assuming no important change in the situation between now and the time of Point A, you decide now what alternatives will be important to you at that time. At the right of the tree are the outcomes of different sequences of decisions Page 4 and events. These outcomes, too, are based on your present information. In effect you say, “If what I know now is true then, this is what will happen.” Of course, you do not try to identify all the events that can happen or all the decisions you will have to make on a subject under analysis. In the decision tree you lay out only those decisions and events or results that are important to you and have consequences you wish to compare. (For more illustrations, see the Appendix.) Appendix (Located at the end of this article) Adding Financial Data Now we can return to the problems faced by the Stygian Chemical management. A decision tree characterizing the investment problem as outlined in the introduction is shown in Exhibit III. At Decision #1 the company must decide between a large and a small plant. This is all that must be decided now. But if the company chooses to build a small plant and then finds demand high during the initial period, it can in two years—at Decision #2—choose to expand its plant. But let us go beyond a bare outline of alternatives. In making decisions, executives must take account of the probabilities, costs, and returns which appear likely. On the basis of the data now available to them, and assuming no important change in the company’s situation, they reason as follows: • Marketing estimates indicate a 60% chance of a large market in the long run and a 40% chance of a low demand, developing initially as follows: Page 5 • Therefore, the chance that demand initially will be high is 70% (60 + 10). If demand is high initially, the company estimates that the chance it will continue at a high level is 86% (60 ÷ 70). Comparing 86% to 60%, it is apparent that a high initial level of sales changes the estimated chance of high sales in the subsequent periods. Similarly, if sales in the initial period are low, the chances are 100% (30 ÷ 30) that sales in the subsequent periods will be low. Thus the level of sales in the initial period is expected to be a rather accurate indicator of the level of sales in the subsequent periods. • Estimates of annual income are made under the assumption of each alternative outcome: 1. A large plant with high volume would yield $1,000,000 annually in cash flow. 2. A large plant with low volume would yield only $100,000 because of high fixed costs and inefficiencies. 3. A small plant with low demand would be economical and would yield annual cash income of $400,000. 4. A small plant, during an initial period of high demand, would yield $450,000 per year, but this would drop to $300,000 yearly in the long run because of competition. (The market would be larger than under Alternative 3, but would be divided up among more competitors.) 5. If the small plant were expanded to meet sustained high demand, it would yield $700,000 cash flow annually, and so would be less efficient than a large plant built initially. 6. If the small plant were expanded but high demand were not sustained, estimated annual cash flow would be $50,000. • It is estimated further that a large plant would cost $3 million to put into operation, a small plant would cost $1.3 million, and the expansion of the small plant would cost an additional $2.2 million. When the foregoing data are incorporated, we have the decision tree shown in Exhibit IV. Bear in mind that nothing is shown here which Stygian Chemical’s executives did not know before; no numbers have been pulled out of hats. However, we are beginning to see dramatic evidence of the value of decision trees in laying out what management knows in a way that enables more systematic analysis and leads to better decisions. To sum up the requirements of making a decision tree, management must: 1. Identify the points of decision and alternatives available at each point. 2. Identify the points of uncertainty and the type or range of alternative outcomes at each point. 3. Estimate the values needed to make the analysis, especially the probabilities of different events or results of action and the costs and gains of various events and actions. 4. Analyze the alternative values to choose a course. Page 6 Choosing Course of Action We are now ready for the next step in the analysis—to compare the consequences of different courses of action. A decision tree does not give management the answer to an investment problem; rather, it helps management determine which alternative at any particular choice point will yield the greatest expected monetary gain, given the information and alternatives pertinent to the decision. Of course, the gains must be viewed with the risks. At Stygian Chemical, as at many corporations, managers have different points of view toward risk; hence they will draw different conclusions in the circumstances described by the decision tree shown in Exhibit IV. The many people participating in a decision—those supplying capital, ideas, data, or decisions, and having different values at risk—will see the uncertainty surrounding the decision in different ways. Unless these differences are recognized and dealt with, those who must make the decision, pay for it, supply data and analyses to it, and live with it will judge the issue, relevance of data, need for analysis, and criterion of success in different and conflicting ways. For example, company stockholders may treat a particular investment as one of a series of possibilities, some of which will work out, others of which will fail. A major investment may pose risks to a middle manager—to his job and career—no matter what decision is made. Another participant may have a lot to gain from success, but little to lose from failure of the project. The nature of the risk—as each individual sees it—will affect not only the assumptions he is willing to make but also the strategy he will follow in dealing with the risk. The existence of multiple, unstated, and conflicting objectives will certainly contribute to the “politics” of Stygian Chemical’s decision, and one can be certain that the political element exists whenever the lives and ambitions of people are affected. Here, as in similar cases, it is not a bad exercise to think through who the parties to an investment decision are and to try to make these assessments: • What is at risk? Is it profit or equity value, survival of the business, maintenance of a job, opportunity for a major career? • Who is bearing the risk? The stockholder is usually bearing risk in one form. Management, employees, the community—all may be bearing different risks. • What is the character of the risk that each person bears? Is it, in his terms, unique, once-in-a-lifetime, sequential, insurable? Does it affect the economy, the industry, the company, or a portion of the company? Page 7 Considerations such as the foregoing will surely enter into top management’s thinking, and the decision tree in Exhibit IV will not eliminate them. But the tree will show management what decision today will contribute most to its long-term goals. The tool for this next step in the analysis is the concept of “rollback.” “Rollback” concept Here is how rollback works in the situation described. At the time of making Decision #1 (see Exhibit IV), management does not have to make Decision #2 and does not even know if it will have the occasion to do so. But if it were to have the option at Decision #2, the company would expand the plant, in view of its current knowledge. The analysis is shown in Exhibit V. (I shall ignore for the moment the question of discounting future profits; that is introduced later.) We see that the total expected value of the expansion alternative is $160,000 greater than the no-expansion alternative, over the eight-year life remaining. Hence that is the alternative management would choose if faced with Decision #2 with its existing information (and thinking only of monetary gain as a standard of choice). Readers may wonder why we started with Decision #2 when today’s problem is Decision #1. The reason is the following: We need to be able to put a monetary value on Decision #2 in order to “roll back” to Decision #1 and compare the gain from taking the lower branch (“Build Small Plant”) with the gain from taking the upper branch (“Build Big Plant”). Let us call that monetary value for Decision #2 its position value. The position value of a decision is the expected value of the preferred branch (in this case, the plant-expansion fork). The expected value is simply a kind of average of the results you would expect if you were to repeat the situation over and over—getting a $5,600 thousand yield 86% of the time and a $400 thousand yield 14% of the time. Stated in another way, it is worth $2,672 thousand to Stygian Chemical to get to the position where it can make Decision #2. The question is: Given this value and the other data shown in Exhibit IV, what now appears to be the best action at Decision #1? Turn now to Exhibit VI. At the right of the branches in the top half we see the yields for various events if a big plant is built (these are simply the figures in Exhibit IV multiplied out). In the bottom half we see the small plant figures, including Decision #2 position value plus the yield for the two years prior to Decision #2. If we reduce all these yields by their probabilities, we get the following comparison: Build big plant: ($10 × .60) + ($2.8 × .10) + ($1 × .30) – $3 = $3,600 thousand Build small plant: ($3.6 × .70) + ($4 × .30) – $1.3 = $2,400 thousand Page 8 The choice which maximizes expected total cash yield at Decision #1, therefore, is to build the big plant initially. Accounting for Time What about taking differences in the time of future earnings into account? The time between successive decision stages on a decision tree may be substantial. At any stage, we may have to weigh differences in immediate cost or revenue against differences in value at the next stage. Whatever standard of choice is applied, we can put the two alternatives on a comparable basis if we discount the value assigned to the next stage by an appropriate percentage. The discount percentage is, in effect, an allowance for the cost of capital and is similar to the use of a discount rate in the present value or discounted cash flow techniques already well known to businessmen. When decision trees are used, the discounting procedure can be applied one stage at a time. Both cash flows and position values are discounted. For simplicity, let us assume that a discount rate of 10% per year for all stages is decided on by Stygian Chemical’s management. Applying the rollback principle, we again begin with Decision #2. Taking the same figures used in previous exhibits and discounting the cash flows at 10%, we get the data shown in Part A of Exhibit VII. Note particularly that these are the present values as of the time Decision #2 is made. Page 9 Now we want to go through the same procedure used in Exhibit V when we obtained expected values, only this time using the discounted yield figures and obtaining a discounted expected value. The results are shown in Part B of Exhibit VII. Since the discounted expected value of the no-expansion alternative is higher, that figure becomes the position value of Decision #2 this time. Having done this, we go back to work through Decision #1 again, repeating the same analytical procedure as before only with discounting. The calculations are shown in Exhibit VIII. Note that the Decision #2 position value is treated at the time of Decision #1 as if it were a lump sum received at the end of the two years. Page 10 The large-plant alternative is again the preferred one on the basis of discounted expected cash flow. But the margin of difference over the small-plant alternative ($290 thousand) is smaller than it was without discounting. Uncertainty Alternatives In illustrating the decision-tree concept, I have treated uncertainty alternatives as if they were discrete, well-defined possibilities. For my examples I have made use of uncertain situations depending basically on a single variable, such as the level of demand or the success or failure of a development project. I have sought to avoid unnecessary complication while putting emphasis on the key interrelationships among the present decision, future choices, and the intervening uncertainties. In many cases, the uncertain elements do take the form of discrete, single-variable alternatives. In others, however, the possibilities for cash flow during a stage may range through a whole spectrum and may depend on a number of independent or partially related variables subject to chance influences—cost, demand, yield, economic climate, and so forth. In these cases, we have found that the range of variability or the likelihood of the cash flow falling in a given range during a stage can be calculated readily from knowledge of the key variables and the uncertainties surrounding them. Then the range of cash-flow possibilities during the stage can be broken down into two, three, or more “subsets,” which can be used as discrete chance alternatives. Conclusion Peter F. Drucker has succinctly expressed the relation between present planning and future events: “Long-range planning does not deal with future decisions. It deals with the futurity of present decisions.” Today’s decision should be made in light of the anticipated effect it and the outcome of uncertain events will have on future values and decisions. Since today’s decision sets the stage for tomorrow’s decision, today’s decision must balance economy with flexibility; it must balance the need to capitalize on profit opportunities that may exist with the capacity to react to future circumstances and needs. The unique feature of the decision tree is that it allows management to combine analytical techniques such as discounted cash flow and present value methods with a clear portrayal of the impact of future decision alternatives and events. Using the decision tree, management can consider various courses of action with greater ease and clarity. The interactions between present decision alternatives, uncertain events, and future choices and their results become more visible. Of course, there are many practical aspects of decision trees in addition to those that could be covered in the space of just one article. When these other aspects are discussed in subsequent articles, the whole range of possible gains for management will be seen in greater detail. 2 3 Page 11 Surely the decision-tree concept does not offer final answers to managements making investment decisions in the face of uncertainty. We have not reached that stage, and perhaps we never will. Nevertheless, the concept is valuable for illustrating the structure of investment decisions, and it can likewise provide excellent help in the evaluation of capital investment opportunities. 1. Optimal Investment Decisions: Rules for Action and Criteria for Choice (Englewood Cliffs, New Jersey, Prentice-Hall, Inc., 1962), p. 250. 2. “Long-Range Planning,” Management Science, April 1959, p. 239. 3. We are expecting another article by Mr. Magee in a forthcoming issue.—The Editors Appendix For readers interested in further examples of decision-tree structure, I shall describe in this appendix two representative situations with which I am familiar and show the trees that might be drawn to analyze management’s decision-making alternatives. We shall not concern ourselves here with costs, yields, probabilities, or expected values. New Facility The choice of alternatives in building a plant depends upon market forecasts. The alternative chosen will, in turn, affect the market outcome. For example, the military products division of a diversified firm, after some period of low profits due to intense competition, has won a contract to produce a new type of military engine suitable for Army transport vehicles. The division has a contract to build productive capacity and to produce at a specified contract level over a period of three years. Figure A illustrates the situation. The dotted line shows the contract rate. The solid line shows the proposed buildup of production for the military. Some other possibilities are portrayed by dashed lines. The company is not sure whether the contract will be continued at a relatively high rate after the third year, as shown by Line A, or whether the military will turn to another newer development, as indicated by Line B. The company has no guarantee of compensation after the third year. There is also the possibility, indicated by Line C, of a large additional commercial market for the product, this possibility being somewhat dependent on the cost at which the product can be made and sold. If this commercial market could be tapped, it would represent a major new business for the company and a substantial improvement in the profitability of the division and its importance to the company. Management wants to explore three ways of producing the product as follows: 1. It might subcontract all fabrication and set up a simple assembly with limited need for investment in plant and equipment; the costs would tend to be relatively high and the company’s investment and profit opportunity would be limited, but the company assets which are at risk would also be limited. Page 12 2. It might undertake the major part of the fabrication itself but use general-purpose machine tools in a plant of general-purpose construction. The division would have a chance to retain more of the most profitable operations itself, exploiting some technical developments it has made (on the basis of which it got the contract). While the cost of production would still be relatively high, the nature of the investment in plant and equipment would be such that it could probably be turned to other uses or liquidated if the business disappeared. 3. The company could build a highly mechanized plant with specialized fabrication and assembly equipment, entailing the largest investment but yielding a substantially lower unit manufacturing cost if manufacturing volume were adequate. Following this plan would improve the chances for a continuation of the military contract and penetration into the commercial market and would improve the profitability of whatever business might be obtained in these markets. Failure to sustain either the military or the commercial market, however, would cause substantial financial loss. Either of the first two alternatives would be better adapted to low-volume production than would the third. Some major uncertainties are: the cost-volume relationships under the alternative manufacturing methods; the size and structure of the future market—this depends in part on cost, but the degree and extent of dependence are unknown; and the possibilities of competitive developments which would render the product competitively or technologically obsolete. How would this situation be shown in decision-tree form? (Before going further you might want to draw a tree for the problem yourself.) Figure B shows my version of a tree. Note that in this case the chance alternatives are somewhat influenced by the decision made. A decision, for example, to build a more efficient plant will open possibilities for an expanded market. Plant Modernization Page 13 A company management is faced with a decision on a proposal by its engineering staff which, after three years of study, wants to install a computer-based control system in the company’s major plant. The expected cost of the control system is some $30 million. The claimed advantages of the system will be a reduction in labor cost and an improved product yield. These benefits depend on the level of product throughput, which is likely to rise over the next decade. It is thought that the installation program will take about two years and will cost a substantial amount over and above the cost of equipment. The engineers calculate that the automation project will yield a 20% return on investment, after taxes; the projection is based on a ten-year forecast of product demand by the market research department, and an assumption of an eight-year life for the process control system. What would this investment yield? Will actual product sales be higher or lower than forecast? Will the process work? Will it achieve the economies expected? Will competitors follow if the company is successful? Are they going to mechanize anyway? Will new products or processes make the basic plant obsolete before the investment can be recovered? Will the controls last eight years? Will something better come along sooner? The initial decision alternatives are (a) to install the proposed control system, (b) postpone action until trends in the market and/or competition become clearer, or (c) initiate more investigation or an independent evaluation. Each alternative will be followed by resolution of some uncertain aspect, in part dependent on the action taken. This resolution will lead in turn to a new decision. The dotted lines at the right of Figure C indicate that the decision tree continues indefinitely, though the decision alternatives do tend to become repetitive. In the case of postponement or further study, the decisions are to install, postpone, or restudy; in the case of installation, the decisions are to continue operation or abandon. An immediate decision is often one of a sequence. It may be one of a number of sequences. The impact of the present decision in narrowing down future alternatives and the effect of future alternatives in affecting the value of the present choice must both be considered

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Simple random sampling