Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Psychometrics wikipedia , lookup
Foundations of statistics wikipedia , lookup
History of statistics wikipedia , lookup
Degrees of freedom (statistics) wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Confidence interval wikipedia , lookup
Taylor's law wikipedia , lookup
Misuse of statistics wikipedia , lookup
1 Chapter 5 – The normal distribution 5.1 Probability distributions of continuous random variables A random variable X is called continuous if it can assume any of the possible values in some interval i.e. the number of possible values are infinite. In this case the definition of a discrete random variable (list of possible values with their corresponding probabilities) cannot be used (since there are an infinite number of possible values it is not possible to draw up a list of possible values). For this reason probabilities associated with individual values of a continuous random variable X are taken as 0. The clustering pattern of the values of X over the possible values in the interval is described by a mathematical function f(x) called the probability density function. A high (low) clustering of values will result in high (low) values of this function. For a continuous random variable X, only probabilities associated with ranges of values (e.g. an interval of values from a to b) will be calculated. The probability that the value of X will fall between the values a and b is given by the area between a and b under the curve describing the probability density function f(x). For any probability density function the total area under the graph of f(x) is 1. 5.2 Normal distribution A continuous random variable X is normally distributed (follows a normal distribution) if the probability density function of X is given by (x )2 exp[ ] for -∞ ≤ x ≤ ∞ . 2 2 2 2 1 f(x) = The constants and can be shown to be the mean and standard deviation respectively of X. These constants completely specify the density function. A graph of the curve describing the probability function (known as the normal curve) for the case 0 and 1 is shown below. Graph of standard norm al distribution 0.45 0.4 0.35 0.3 p(z) 0.25 0.2 0.15 0.1 0.05 0 -4 -2 0 z 2 4 2 5.2.1 Properties of the normal distribution The graph of the function defined above has a symmetric, bell-shaped appearance. The mean µ is located on the horizontal axis where the graph reaches its maximum value. At the two ends of the scale the curve describing the function gets closer and closer to the horizontal axis without actually touching it. Many quantities measured in everyday life have a distribution which closely matches that of a normal random variable (e.g. marks in an exam, weights of products, heights of a male population). The parameter µ shows where the distribution is centrally located and σ the spread of the values around µ. A short hand way of referring to a random variable X which follows a normal distribution with mean µ and variance σ2 is by writing X ~ N(µ, σ2). The next diagram shows graphs of normal distributions for various values of μ and σ2. An increase (decrease) in the mean µ results in a shift of the graph to the right (left) e.g. the curve of the distribution with a mean of -2 is moved 2 units to the left. An increase (decrease) in the standard deviation σ results in the graph becoming more (less) spread out e.g. compare the curves of the distributions with σ2 = 0.2, 0.5, 1 and 5. 5.2.2 Empirical example – The normal distribution and the histogram Consider the scores obtained by 4 500 candidates in a matric mathematics examination. The histogram of the marks has an appearance that can be described by a normal curve i.e. it has a symmetric, bell-shaped appearance. The mean of the marks is 59.95 and the standard deviation 10. The histogram below shows the distribution of the marks. 3 Histogram 1000 900 800 freq 700 600 500 400 300 200 100 0 15 25 35 45 55 65 75 90 More mark 5.3 The Standard Normal Distribution To find probabilities for a normally distributed random variable, we need to be able to calculate the areas under the graph of the normal distribution. Such areas are obtained from a table showing the cumulative distribution of the normal distribution (see appendix). Since the normal distribution is specified by the mean (µ) and standard deviation (σ), there are many possible normal distributions that can occur. It will be impossible to construct a table for each possible mean and standard deviation. This problem is overcome by transforming X the normal random variable of interest [X ~ N(µ, σ2) ] to a standardized normal random variable Z= X . It can be shown that the transformed random variable Z ~ N(0, 1). The random variable Z can be transformed back to X by using the formula X = Z . The normal distribution with mean µ = 0 and standard deviation σ = 1 is called the standard normal distribution. The symbol Z is reserved for a random variable with this distribution. The graph of the standard normal distribution appears below. 4 Various areas under the above normal curve are shown. The standard normal table gives the area under the curve to the left of the value z. Other types of areas can be found by combining several of the areas as shown in the next examples. 5.4 Calculating probabilities using the standard normal table The first few lines of the standard normal table are shown below. Z -3.7 -3.6 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.0001 0.0002 0.0001 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 . . . ‘ . . . . 0.0 0.1 0.5000 0.5398 0.4960 0.5438 0.4920 0.5478 0.4880 0.5517 0.4840 0.5557 0.4801 0.5596 0.4761 0.5636 0.4721 0.5675 0.4681 0.5714 0.4641 0.5753 . . . ‘ . . . . 3.7 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 The areas shown in the table are those under the standard normal curve to the left of the value of z looked up i.e. P(Z ≤ z) e.g. P(Z ≤ 0.14) = 0.5557. Note 1 For negative values of z less than the minimum value (-3.79) in the table, the probabilities are taken as 0 i.e. P(Z ≤ z) = 0 for z < -3.79. 5 2 For positive values of z greater than the maximum value (3.79) in the table, the probabilities are taken as 1 i.e. P(Z ≤ z) = 1 for z > 3.79. Examples In all the examples that follow, Z ~ N(0, 1). 1 P(Z < 1.35) = 0.9115 2 P(Z > -0.47) = 1 - P(Z ≤ -0.47) = 1-0.3192 = 0.6808 3 P(-0.47 < Z < 1.35) = P(Z < 1.35) – P(Z < -0.47) = 0.9115-0.3192 = 0.5923 4 P(Z > 0.76) = 1 –P(Z < 0.76) = 1 – 0.7764 = 0.2236 5 P(0.95 ≤ Z ≤ 1.36) = P(Z ≤ 1.36) – P(Z ≤ 0.95) = 0.9131 – 0.8289 = 0.0842 6 P(-1.96 ≤ Z ≤ 1.96) = P(Z ≤ 1.96) – P(Z ≤ -1.96) = 0.9750 – 0.0250 = 0.95 In all the above examples an area was found for a given value of z. It is also possible to find a value of z when an area to its left is given. This can be written as P(Z ≤ zα) = α ( is the greek letter for a and is pronounced “alpha”). In this case zα has to be found where α is the area to its left Examples 1 Find the value of z that has an area of 0.0344 to its left. Search the body of the table for the required area (0.0344) and then read off the value of z corresponding to this area. In this case z0.0344 = -1.82. 2 Find the value of z that has an area of 0.975 to its left. Finding 0.975 in the body of the table and reading off the z value gives z0.975 = 1.96. 3 Find the values of z that have areas of 0.95 and 0.05 to their left. When searching the body of the table for 0.95 this value is not found. The z value corresponding to 0.95 can be estimated from the following information obtained from the table. z 1.64 ? 1.65 area to left 0.9495 0.95 0.9505 Since the required area (0.95) is halfway between the 2 areas obtained from the table, the required z can be taken as the value halfway between the two z values that were obtained 6 from the table i.e. z = 1.64 1.65 1.645. 2 Exercise: Using the same approach as above, verify that the z value corresponding to an area of 0.05 to its left is -1.645. At the bottom of the standard normal table selected percentiles zα are given for different values of α. This means that the area under the normal curve to the left of zα is α. Examples: 1 α = 0.900, zα = 1.282 means P(Z < 1.282) = 0.900. 2 α = 0.995, zα = 2.576 means P(Z < 2.576) = 0.995. 3 α = 0.005, zα = -2.576 means P(Z < -2.576) = 0.005. The standard normal distribution is symmetric with respect to the mean = 0. From this it follows that the area under the normal curve to the right of a positive z entry in the standard normal table is the same as the area to the left of the associated negative entry (-z) i.e. P(Z ≥ z) = P(Z ≤ -z) . E.g. P(Z ≥ 1.96) = 1 – 0.975 = 0.025 = P(Z ≤ -1.96). 5.5 Calculating probabilities for any normal random variable Let X be a N(μ, σ2) random variable and Z a N(0, 1) random variable. Then 1 P(X ≤ x) = P( X 2 P(a ≤ X ≤ b) = P( a x ) = P(Z ≤ X x b ). ) = P( a Z b ). Examples: 1 The height H (in inches) of a population of women is approximately normally distributed with a mean of 63.5 and a standard deviation of 2.75 inches. To calculate the probability that a woman is less than 63 inches tall, we first find the z-score for 63 inches z 63 63.5 0.18 2.5 and then use P(H ≤ 63) = P(Z ≤-0.18)= 0.4286. This means that 42.86% (a proportion of 0.4286) of women are less than 63 inches tall. 2 The length X (inches) of sardines is a N(4.62, 0.0529) random variable. What proportion of sardines is 7 (a) longer than 5 inches? (b) between 4.35 and 4.85 inches? (a) P(X > 5) = P(Z > 5 4.62 ) = P(Z > 1.65) = 1 – P(Z ≤ 1.65) = 1 – 0.9505 = 0.0495. 0.23 (b) P(4.35 ≤ X ≤ 4.85) = P( 4.35 4.62 4.85 4.62 Z ) P(-1.17 ≤ Z ≤ 1) 0.23 0.23 =P(Z ≤1) - P(Z ≤ -1.17) = 0.8413 – 0.1210 = 0.7203. 5.6 Finding percentiles by using the standard normal table The standard normal table can be used to find percentiles for random variables which are normally distributed. Example The scores S obtained in a mathematics entrance examination are normally distributed with 514 and 113 . Find the score which marks the 80th percentile. From the standard normal table, the z-score which is closest to an entry of 0.80 in the body of the table is 0.84 (the actual area to its left is 0.7995). The score which corresponds to a z-score of 0.84 can be s 514 found by solving 0.84 for s. This yields s 608.92 i.e. a score of approximately 113 609 is better than 80% of all other exam scores. Exercises: All these exercises refer to the normal distribution above. 1 Find P35 . 2 If a person scores in the top 5% of test scores, what is the minimum score they could have received? 3 If a person scores in the bottom 10% of test scores, what is the maximum score they could have received? 5.7 Computer output Excel has a built in function that can be used to find areas under the normal curve for a given z-score or to calculate a z-score that has a given area under the normal curve to its left. 8 The table below shows areas under the standard normal curve to the left of various z-scores. z-score -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 area 0.0062 0.0228 0.0668 0.1587 0.3085 0.5 0.6915 0.8413 0.9332 0.9772 0.9938 2 The table below shows z-scores for certain areas under the standard normal curve to its left. area 0.005 0.01 0.025 0.05 0.1 0.2 0.8 0.9 0.95 0.975 0.99 0.995 z-score -2.5758 -2.3263 -1.96 -1.6449 -1.2816 -0.8416 0.8416 1.2816 1.6449 1.96 2.3263 2.5758 9 Chapter 6 – Sampling distributions 6.1 Definitions A sampling distribution arises when repeated samples are drawn from a particular population (distribution) and a statistic (numerical measure of description of sample data) is calculated for each sample. The interest is then focused on the probability distribution (called the sampling distribution) of the statistic. Sampling distributions arise in the context of statistical inference i.e. when statements are made about a population on the basis of random samples drawn from it. Example Suppose all possible samples of size 2 are drawn with replacement from a population with sample space S = {2, 4, 6, 8} and the mean calculated for each sample. The different values that can be obtained and their corresponding means are shown in the table below. 1st value/2nd value 2 4 6 8 2 2 3 4 5 4 3 4 5 6 6 4 5 6 7 8 5 6 7 8 In the above table the row and column entries indicate the two values in the sample (16 possibilities when combining rows and columns). The mean is located in the cell 46 5. corresponding to these entries e.g. 1st value = 4, 2nd value = 6 has a mean entry of 2 Assuming that random sampling is used, all the mean values in the above table are equally likely. Under this assumption the following distribution can be constructed for these mean values. x count P( X x ) 2 1 1 16 3 2 1 8 4 3 3 16 5 4 1 4 6 3 3 16 7 2 1 8 8 1 1 16 sum 16 1 The above distribution is referred to as the sampling distribution of the mean for random samples of size 2 drawn from this distribution. The mean of the population from which these samples are drawn is µ = 5 and the variance is σ2 = [ x 2 ( x) 2 / N ] N = (22+42+62+82 -202/4)/4 = 5. The sampling distribution of the mean has mean X 5 and variance 10 X2 = x 2 P( X x ) - µ2 = 440 5 2 = 2.5 (verify this result). 16 Note that X 5 = µ and that X2 = σ2/2 = 5/2 =2.5. Consider a population with mean µ and variance σ2. It can be shown that the mean and variance of the sampling distribution of the mean, based on a random sample of size n, are given by X and X2 = σ2/n. X = n is known as the standard error. In the preceding example n = 2. Sampling distributions can involve different statistics (e.g. sample mean, sample proportion, sample variance) calculated from different sample sizes drawn from different distributions. Some of the important results from statistical theory concerning sampling distributions are summarized in the sections that follow. 6.2 The Central Limit Theorem The following result is known as the Central Limit Theorem. Let X1, X2, . . . , Xn be a random sample of size n drawn from a distribution with mean µ n and variance σ2 (σ2 should be finite). Then for sufficiently large n the mean X X i / n is i 1 approximately normally distributed with mean = X and variance = X2 = σ2/n. This result can be written as X ~ N(µ, σ2/n). Note: 1 The random variable Z = X / n ~ N(0, 1). 2 The value of n for which this theorem is valid depends on the distribution from which the sample is drawn. If the sample is drawn from a normal population, the theorem is valid for all n. If the distribution from which the sample is drawn is fairly close to being normal, a value of n > 30 will suffice for the theorem to be valid. If the distribution from which the sample is drawn is substantially different from a normal distribution e.g. positively or negatively skewed, a value of n much larger than 30 will be needed for the theorem to be valid. 3 There are various versions of the central limit theorem. The only other central limit theorem result that will be used here is the following one. 11 If the population from which the sample is drawn is a Bernoulli distribution (consists of only values of 0 or 1 with probability p of drawing a 1 and probability of q = 1-p of drawing a 0), n then S X i follows a binomial distribution with mean µS = np and variance S2 = npq. i 1 n According to the central limit theorem, Pˆ S / n X i / n follows a normal distribution i 1 with mean µ( P̂) = µS /n = np/n = p and variance σ ( P̂) = S2 / n2 = npq/n2 = pq/n when n is 2 sufficiently large. P̂ is the proportion of 1’s in the sample and can be seen as an estimate of p the proportion of 1’s in the population (distribution from which sample is drawn). Using the central limit theorem, it follows that Z= Pˆ ( Pˆ ) Pˆ p ~ N(0, 1). ( Pˆ ) pq / n Example: An electric firm manufactures light bulbs whose lifetime (in hours) follows a normal distribution with mean 800 and variance 1600. A random sample of 10 light bulbs is drawn and the lifetime recorded for each light bulb. Calculate the probability that the mean of this sample (a) differs from the actual mean lifetime of 800 by not more than 16 hours. (b) differs from the actual mean lifetime of 800 by more than 16 hours. (c) is greater than 820 hours. (d) is less than 785 hours. (a) P(-16 ≤ X 800 ≤ 16) = P( | X 800 | ≤ 16) = P(|Z |≤ 16/ 1600 / 10 ) = P(|Z|≤1.265) = P(Z ≤ 1.265) – P(Z ≤ -1.265) = 0.8971 – 0.1029 = 0.7942 (b) P( | X 800 | > 16 ) = 1 - P( | X 800 | ≤ 16) = 1 – 0.7942 = 0.2058 (c) P( X > 820) = P( Z > (d) P( X 785) = P( Z < 820 800 1600 / 10 ) = P( Z > 1.58) = 1 – 0.9429 = 0.0571 785 800 1600 / 10 ) = P( Z < -1.19) = 0.117 12 6.3 The t-distribution (Student’s t-distribution) The central limit theorem states that the statistic Z = X follows a standard normal / n distribution. If σ is not known, it would be logical to replace (in the formula for Z) by its X sample estimate S. For small values of the sample size n , the statistic t = does not S/ n follow a normal distribution. If it is assumed that sampling is done from a population that is approximately a normal population, the distribution of the statistic t follows a t-distribution. This distribution changes with the degrees of freedom = df = n-1 i.e.for each value of degrees of freedom a different distribution is defined. The t-distribution was first proposed in a paper by William Gosset in 1908 who wrote the paper under the pseudonym “Student”. The t-distribution has the following properties. 1. The Student t-distribution is symmetric and bell-shaped, but for smaller sample sizes it shows increased variability when compared to the standard normal distribution (its curve has a flatter appearance than that of the standard normal distribution). In other words, the distribution is less peaked than a standard normal distribution and with thicker tails. As the sample size increases, the distribution approaches a standard normal distribution. For n > 30, the differences are negligible. 2. The mean is zero (like the standard normal distribution). 3. The distribution is symmetrical about the mean. 4. The variance is greater than one, but approaches one from above as the sample size increases (σ2=1 for the standard normal distribution). The graph below shows how the t-distribution changes for different values of r (the degrees of freedom). 13 Tables for the t-distribution The layout of the t-tables is as follows. =df/α 0.900 0.95 1 2 . . ∞ .. 0.995 3.078 6.314 63.66 1.886 2.920 9.925 . . 1.282 1.645 2.576 The row entry is the degrees of freedom (df) and the column entry (α) the area under the tcurve to the left of the value that appears in the table at the intersection of the row and column entry. When a t-value that has an area less than 0.5 to its left is to be looked up, the fact that the tdistribution is symmetrical around 0 is used i.e. P(t ≤ tα) = P(t ≤ -t1-α) = P(t ≥ t1-α) for α ≤ 0.5 (Using symmetry). This means that tα = -t1-α . Examples 1 For df = 2 and α = 0.995 the entry is 9.925. This means that for the t-distribution with 2 degrees of freedom P(t ≤ 9.925) = 0.995. 2 For df = ∞ and α = 0.95 the entry is 1.645. This means that for the t-distribution with ∞ degrees of freedom P(t ≤ 1.645) = 0.95. 3 For df = = 10 and α = 0.10 the value of t0.10 such that P(t ≤ t0.10 ) = 0.10 is found from t0.10 = -t1-0.10 = -t0.90 = -1.372. Note that the percentile values in the last row of the t-distribution are identical to the corresponding percentile entries in the standard normal table. Since the t-distribution for large samples (degrees of freedom) is the same as the standard normal distribution, their percentiles should be the same. 6.4 The chi-square (χ2) distribution The chi-square distribution arises in a number of sampling situations. These include the ones described below. 14 1 Drawing repeated samples of size n from an approximate normal distribution with variance σ2 and calculating the variance (S2) for each sample. It can be shown that the quantity χ = 2 (n 1) S 2 2 follows a chi-square distribution with degrees of freedom = n-1. 2 When comparing sequences of observed and expected frequencies as shown in the table below. The observed frequencies (referring to the number of times values of some variable of interest occur) are obtained from an experiment, while the expected ones arise from some pattern believed to be true. observed frequency f1 f2 .. fk expected frequency e1 e2 .. ek ( f i ei ) 2 can be shown to follow a chi-square distribution with k-1 ei i 1 degrees of freedom. The purpose of calculating this χ2 is to make an assessment as to how well the observed and expected frequencies correspond. The quantity χ2 = k The chi-square curve is different for each value of degrees of freedom. The graph below shows how the chi-square distribution changes for different values of (the degrees of freedom). Unlike the normal and t-distributions the chi-square distribution is only defined for positive values and is not a symmetrical distribution. As the degrees of freedom increase, the chisquare distribution becomes more a more symmetrical. For a sufficiently large value of degrees of freedom the chi-square distribution approaches the normal distribution. 15 Tables for the chi-square distribution The layout of the chi-square tables is as follows. = df/α 0.005 1 2 . 30 0.01 .. 0.99 0.000039 0.000157 6.63 0.010025 0.020101 9.21 13.79 14.95 0.995 7.88 10.60 50.89 53.67 The row entry is the degrees of freedom (df) and the column entry (α) the area under the chisquare curve to the left of the value that appears in the table at the intersection of the row and column entry. Examples: 1 For df = 30 and α = 0.01 the entry is 14.95. This means that for the chi-square distribution with 30 degrees of freedom P( 2 ≤ 14.95) = 0.01. 2 For df = 30 and α = 0.995 the entry is 53.67. This means that for the chi-square distribution with 30 degrees of freedom P( 2 ≤ 53.67) = 0.995. 3 For df = 6 and α = 0.95 the entry is 12.59. This means that for the chi-square distribution with 6 degrees of freedom P( 2 12.59 ) = 0.95 or P( 2 12.59 ) = 0.05. This probability statement is illusrated in the next graph. 16 6.5 The F-distribution Random samples of sizes n and m are drawn from normally distributed populations that are labeled 1 and 2 respectively. Denote the variances calculated from these samples by S12 and S 22 respectively and their corresponding population variances by 12 and 22 respectively. S12 / 12 The ratio F 2 is distributed according to an F-distribution (named after the famous S 2 / 22 statistician R.A. Fisher) with degrees of freedom df1 n1 1 (called the numerator degrees of freedom) and df 2 n2 1 (called the denominator degrees of freedom). When 12 22 the F-ratio is F S12 . S 22 The F-distribution is positively skewed, and the F-values can only be positive. The graph below shows plots for a number of F-distributions (F-curves) with 12 22 . These plots are referred to by F (df1 , df 2 ) e.g. F (33,10) refers to an F-distribution with 33 degrees of freedom associated with the numerator and 10 degrees of freedom associated with the denominator. For each combination of df 1 and df 2 there is a different F-distribution. Three other important distributions are special cases of the F-distribution. The normal distribution is an F(1, infinity) distribution, the t-distribution an F(1, n2 ) distribution and the chi-square distribution an F( n1 , infinity) distribution. 17 Tables for the F-distribution The layout of the F-distribution tables with 12 22 is as follows. df2/df1 1 2 . ∞ 1 2 ... 161.5 199.5 18.51 19.0 ∞ 254.3 19.5 3.85 1.01 3.0 ... The entry in the table corresponding to a pair of ( df1 , df 2 ) values has an area of under the F (df1 , df 2 ) curve to its right. Examples 1 F (3,26) 2.98 has an area (under the F (3,26) curve) of 0.05 to its right (see graph below). 18 2 F (4,32) 2.67 has an area (under the F (4,32) curve ) of 0.05 to its right (see graph below). For each different value of a different F-table is used to read off a value that has an area of to its right i.e. a percentage of 100(1- ) to its left. The F-tables that are used and their and 100(1- ) values are summarized in the table below. Percentage point = 100(1- ) 0.05 95% 0.025 97.5% 0.01 99% The first entry in the above table refers to the percentage of the area under the F-curve to the left of the F-value read off and the second entry to the proportion under the F-curve to the right of this F-value. Examples: 1 For df1 7, df 2 5 the value read from the 95% F-distribution table is 4.88. This means that for this F-distribution 95% of the area under the F-curve is to the left of 4.88 (a proportion of 0.05 to the right of 4.88). P( F 4.88) = 0.95 P( F >4.88) = 0.05 2 For df1 7, df 2 5 the value read from the 97.5% F-distribution table is 6.85. This means that for this F-distribution 97.5% of the area under the F-curve is to the left of 6.85 (a proportion of 0.025 to the right of 6.85). P( F 6.85) = 0.975 P( F >6.85) = 0.025 19 3 For df1 10, df 2 17 the value read from the 99% F-distribution table is 3.59. This means that for this F-distribution 99% of the area under the F-curve is to the left of 3.59 (a proportion of 0.01 to the right of 3.59). P( F 3.59) = 0.99 P( F >3.59) = 0.01 Lower tail values from the F-distribution Only upper tail values (areas of 5%, 2.5% and 1% above) can be read off from the F-tables. Lower tail values can be calculated from the formula F (df1 , df 2 ; ) 1 i.e. F (df 2 , df 1 ,1 ) F value with an area under the F-curve to its left = 1/ (F value with an area 1 under the F-curve to its left with numerator and denominator degrees of freedom interchanged) Examples 1. Find the value such that 2.5% of the area under the F(7,5) curve is to the left of it. In the above formula df1 7, df 2 5 and 0.025 . Then 1 1 F (7,5;0.025) 0.189. F (5,7;0.975) 5.29 2 Find the value such that 1% of the area under the F(10,17) curve is to the left of it. In the above formula df1 10, df 2 17 and 0.01. Then 1 1 F (10,17;0.01) 0.223. F (17,10;0.99) 4.49 6.5 Computer output In excel values from the t, chi-square and F-distributions, that have a given area under the curve above it, can be found by using the TINV(area, df), CHIINV (area, df) and FINV(area, df1,df2) functions respectively. Examples 1 TINV(0.05, 15) = 2.13145. The area under the t(15) curve to the right of 2.13145 is 0.025 and to the left of -2.13145 is 0.025. Thus the total tail area is 0.05. 2 CHIINV(0.01, 14) = 29.14124. The area under the chi-square (14) curve to the right of 29.14124 is 0.01. 3 FINV(0.05,10,8) = 3.347163. The area under the F (10, 8) curve to the right of 3.347163 is 0.05. 20 Chapter 7 – Statistical Inference: Estimation for one sample case 7.1 Statistical inference Statistical inference (inferential statistics) refers to the methodology used to draw conclusions (expressed in the language of probability) about population parameters on the basis of samples drawn from the population. Examples 1 The government of a country wants to estimate the proportion of voters ( p ) in the country that approve of their economic policies. 2 A manufacturer of car batteries wishes to estimate the average lifetime (µ) of their batteries. 3 A paint company is interested in estimating the variability (as measured by the variance, 2 ) in the drying time of their paints. The quantities p , µ and 2 that are to be estimated are called population parameters. A sample estimate of a population parameter is called a statistic. The table below gives examples of some commonly used parameters toegether with their statistics. Parameter Statistic p p̂ x µ 2 σ S2 7.2 Point and interval estimation A point estimate of a parameter is a single value (point) that estimates a parameter. An interval estimate of a parameter is a range of values from L (lower value) to U (upper value) that estimate a parameter. Associated with this range of values is a probability or percentage chance that this range of values will contain the parameter that is being estimated. Examples Suppose the mean time it takes to serve customers at a supermarket checkout counter is to be estimated. 1 The mean service time of 100 customers of (say) x 2.283 minutes is an example of a point estimate of the parameter µ. 2 If it is stated that the probability is 0.95 (95% chance) that the mean service time will be from 1.637 minutes to 4.009 minutes, the interval of values (1.637, 4.009) is an interval estimate of the parameter µ. 21 The estimation approaches discussed will focus mainly on the interval estimate approach. 7.3 Confidence intervals terminology A confidence interval is a range of values from L (lower value) to U (upper value) that estimate a population parameter with 100(1- )% confidence. - pronounced “theta”. L is the lower confidence limit. U is the upper confidence limit. The interval (L, U) is called the confidence interval. 1- is called the confidence coefficient. It is the probability that the confidence interval will contain the parameter that is being estimated. 100 (1 ) is called the confidence percentage. Example Consider example 2 of the previous section. , the parameter that is being estimated, is the population mean . L = 1.637, U = 4.009 The confidence interval is the interval (1.637, 4.009). =0.05 The confidence coefficient is 1- = 0.95 The confidence percentage is 100 (1 ) = 95. In the sections that follow the determination of L and U when estimating the parameters µ, p and σ2 will be discussed. 7.4 Confidence interval for the population mean (population variance known) The determination of the confidence limits is based on the central limit theorem (discussed in the previous chapter). This theorem states that for sufficiently large samples the sample mean X ~ N(µ, 2 n ) and hence that Z = X / n ~ N(0, 1). Formulae for the lower and upper confidence limits can be constructed in the following way. 22 Since Z ~ N(0,1), it follows from the above graph that P(-1.96 ≤ Z ≤ 1.96) = 0.95. P(-1.96 ≤ X / n ≤ 1.96) = 0.95 , ( Substitute Z = X / n in the line above ). By a few steps of mathematical manipulation (not shown here), the above part in brackets can be changed to have only the parameter µ between the inequality signs. This will give P( X 1.96 n ≤ µ ≤ X 1.96 Let L = X 1.96 n n ) = 0.95 . and U = X 1.96 n . Then the above formula can be written as P(L ≤ µ ≤ U) = 0.95. This formula is interpreted in the following way. Since both L and U are determined by the sample values (which determine X ), they (and the confidence interval) will change for different samples. Since the parameter µ that is being estimated remains constant, these intervals will either include or exclude µ. The central limit theorem states that such intervals will include the parameter µ with probability 0.95 (95 out of 100 times). 23 In a practical situation the confidence interval will not be determined by many samples, but by only one sample. Therefore the confidence interval that is calculated in a practical situation will involve replacing the random variable X by the sample value x . Then the above formulae for a 95% confidence interval for the population mean µ becomes ( x 1.96 n , x 1.96 n .) or x 1.96 n . The percentage of confidence associated with the interval is determined by the value (called the z – multiplier) obtained from the standard normal distribution. In the above formula a zmultiplier of 1.96 determines a 95% confidence interval. If a different percentage of confidence is required, the z – multiplier needs to be changed. The table below is a summary of z-multipliers needed for different percentages associated with confidence intervals. confidence percentage 99 95 90 z-multiplier 2.576 1.96 1.645 0.01 0.05 0.10 Calculation of confidence interval for µ (σ2 known) Step 1 : Calculate x . Values of n, σ2 and confidence percentage are given Step 2 : Look up z-multiplier for given a confidence percentage. Step 3 : Confidence interval is x z-multiplier n Example The actual content of cool drink in a 500 milliliter bottle is known to vary. The standard deviation is known to be 5 milliliters. Thirty (30) of these 500 milliliter bottles were selected at random and their mean content found to 498.5. Calculate 95% and 99% confidence intervals for the population mean content of these bottles. Solution 95% confidence interval Substituting x = 498.5, n = 30, σ = 5, z = 1.96 into the above formula gives 498.5 ± 1.96 5 30 = (496.71, 500.29). 99% confidence interval Substituting x = 498.5, n = 30, σ = 5, z = 2.576 into the above formula gives 498.5 ± 2.576 5 30 = (496.15, 500.85). 24 7.5 Confidence interval for the population mean (population variance not known) When the population variance (σ2) is not known, it is replaced by the sample variance (S2) in the formula for Z mentioned in the previous section. In such a case the quantity t= X S/ n follows a t-distribution with degrees of freedom = df = n-1. The confidence interval formula used in the previous section is modified by replacing the zmultiplier by the t-multiplier that is looked up from the t-distribution. Calculation of confidence interval for µ (σ2 not known) Step 1 : Calculate x and S. Values of n and confidence percentage are given Step 2 : Look up t-multiplier for a given confidence percentage and degrees of freedom = df. S Step 3 : Confidence interval is x t-multiplier n Example The time (in seconds) taken to complete a simple task was recorded for each of 15 randomly selected employees at a certain company. The values are given below. 38.2 43.9 38.4 26.2 41.3 42.3 37.5 37.2 41.2 42.3 31 50.1 37.3 36.7 Calculate 95% and 99% confidence intervals for the population mean time it takes to complete this task. Solution n = 15 (given), x 38.36, S = 5.78 (Calculated from the data) 95% confidence interval Looking up the t-multiplier involves a row and column entry in the t-table. Row entry: df = = 15-1 = 14 Column entry: The α entry is determined from the confidence % required. 1-2α = 95 gives α = 0.975 From the t-tables with df = 14 and α = 0.975, t-multiplier = 2.145. Substituting x = 38.36, n = 15, S = 5.78, t = 2.145 into the above formula gives 38.36 ± 2.145 5.78 15 = (35.16, 41.56). 99% confidence interval 31.8 25 Looking up the t-multiplier Row entry: df = = 15-1 = 14 Column entry: 1-2α = 99 which gives α = 0.995. From the t-tables with df = 14 and α = 0.995, t-multiplier = 2.977. Substituting x = 38.36, n = 15, S = 5.78, t = 2.977 into the above formula gives 38.36 ± 2.977 5.78 15 = (33.92, 42.80). 7.6 Confidence interval for population variance The formulae for the confidence interval of the population variance σ2 are based on the fact (n 1) S 2 that follows a chi-square distribution with (n-1) degrees of freedom. Let 2 100 2 (1 ) and 2 ( ) denote the 100( 1 ) and percentile points of the chi-square 2 2 2 2 distribution with (n-1) degrees of freedom. These points are shown in the graph below. For this distribution, it follows from the graph above that 26 (n 1) S 2 P[ 2 ( ) ≤ ≤ 2 (1 ) ] = 1- . 2 2 2 By a few steps of mathematical manipulation (not shown here), the above part in brackets can be changed to have only the parameter σ2 between the inequality signs. This will give P[ (n 1) S 2 (n 1) S 2 ≤ σ2 ≤ ] = 1- , lower upper where upper = 2 (1 2 ) , the larger of the 2 percentile points and lower = 2 ( ) , the smaller of the 2 percentile points. 2 The values of and / 2 are calculated from confidence percentage = 100(1- ) e.g. if confidence percentage = 95, = 0.05 , / 2 0.025 . Calculation of confidence interval for σ2 Step 1 : Calculate S2. Values of n and confidence percentage are given Step 2 : Look up upper and lower chi-square values for a given confidence percentage and degrees of freedom = df. (n 1) S 2 (n 1) S 2 Step 3 : Confidence interval is [ , ] lower upper Example Calculate 90% and 95% confidence intervals for the population variance of the time taken to complete the simple task (see previous example). Solution n =15 , S2 = 33.3811 (Calculated from the data) 90% confidence interval Look up upper and lower chi-square values by using df = = 14 and =0.10. upper = 2 (1 2 ) = 2 (0.95) = 23.68 for = 14. lower = 2 ( ) = 2 (0.05) = 6.57 for = 14. 2 27 (n-1)S2 = 14 x 33.3811 = 467.34 The confidence interval is ( 467.34 467.34 , ) = (19.74, 71.13). 23.68 6.57 95% confidence interval Look up upper and lower chi-square values by using df = = 14 and =0.05. upper = 2 (1 2 ) = 2 (0.975) = 26.12 for = 14. lower = 2 ( ) = 2 (0.025) = 5.63 for = 14. 2 (n-1)S2 = 14 x 33.3811 = 467.34 The confidence interval is ( 467.34 467.34 ) = (17.89, 83.01). , 26.12 5.63 7.7 Confidence interval for population proportion In some experiments the interest is in whether or not items posses a certain characteristic of interest (e.g. whether a patient improves or not after treatment, whether an item manufactured is acceptable or not, whether an answer to a question is correct or incorrect). The population proportion of items labeled “success” in such an experiment (e.g. patient improves, item is acceptable, answer is correct) is estimated by calculating the sample proportion of “success” items. The determination of the confidence limits for the population proportion of items labeled X “success” is based on the central limit theorem for the sample proportion Pˆ , where X is n the number of items in the sample labeled “success”. This theorem states that for sufficiently large samples the sample proportion of “success” items P̂ ~ N(p, Z= pq ) and hence that n Pˆ ( Pˆ ) Pˆ p ~ N(0, 1). ( Pˆ ) pq / n Formulae for the lower and upper confidence limits can be constructed in the following way. Since Z ~ N(0,1), P(-1.96 ≤ Z ≤ 1.96) = 0.95 28 P(-1.96 ≤ Pˆ p pq / n ≤ 1.96) = 0.95 By a few steps of mathematical manipulation (not shown here), the above part in brackets can be changed to have the parameter p (in the numerator) between the inequality signs. This will give P( Pˆ 1.96 pq / n ≤ p ≤ Pˆ 1.96 pq / n ) = 0.95. Since the confidence interval formula is based on a single sample, the random variable X x is replaced by its sample estimate pˆ and the parameters p and q=1-p by their Pˆ n n x respective sample estimates pˆ and qˆ 1 pˆ . n This gives the following 95% confidence interval for p: ( pˆ 1.96 pˆ qˆ / n , pˆ 1.96 pˆ qˆ / n ). If the percentage of confidence is to be changed, the z-multiplier is changed according to the values given in the table below. confidence percentage 99 95 90 z-multiplier 2.576 1.96 1.645 0.01 0.05 0.10 Calculation of confidence interval for p x Step 1 : Calculate pˆ and qˆ 1 pˆ . x, n and confidence percentage are given n Step 2 : Look up z-multiplier for given a confidence percentage. Step 3 : Confidence interval is p̂ z-multiplier pˆ qˆ / n Example During a marketing campaign for a new product 176 out of the 200 potential users of this product that were contacted indicated that they would use it. Calculate a 90% confidence interval for the proportion of potential users who would use this product. Solution x = 176, n = 200, confidence percentage = 90 (given) p̂ = 176 0.88, qˆ 1 pˆ = 0.12. 200 z-multiplier = 1.645 (From above table) Confidence interval is ( 0.88 ± 1.645 0.88 * 0.12 / 200 ) = (0.88 ± 0.0378) = (0.842, 0.918). 29 7.8 Sample size when estimating the population mean Consider the formula for the confidence interval of the mean (µ) when 2 is known. x z-multiplier n The quantity z-multiplier n is known as the error (denoted by E). The smaller the error, the more accurately the parameter μ is estimated. Suppose the size of the error is specified in advance and the sample size n is determined to achieve this accuracy. This can be done by solving for n from the equation E = z-multiplier n=( n , which gives z multiplier * 2 ) . E The z-multiplier is determined by the percentage confidence required in the estimation. Example Consider the example on the interval estimation of the mean content of 500 milliliter cool drink bottles. The standard deviation σ is known to be 5. Suppose it is desired to estimate the mean with 95% confidence and an error that is not greater than 0.8. What sample size is needed to achieve this accuracy? Solution σ = 5, E = 0.8 (given), z-multiplier = 1.96 (from 95% confidence requirement). n= ( 1.96 * 5 2 ) = 150.0625 =151 (n is always rounded up). 0.8 7.9 Sample size for estimation of population proportion The approach used in determining the sample size for the estimation of the population proportion is much the same as that used when estimating the population mean. The equation to be solved for n is z multiplier * pq E= . n When solving for n the formula becomes z multiplier 2 ) . n = pq ( E 30 A practical problem encountered when using this formula is that values for the parameters p and q=1-p are needed. Since the purpose of this technique is to estimate p, these values of p and q are obviously not known. If no information on p is available, the value of p that will give the maximum value of p(1-p) = pq will be taken. It can be shown that p= ½ maximizes this expression. This gives max pq = ¼ . Substituting this maximum value in the above formula gives max n = ¼ ( z multiplier 2 ) . E If more accurate information on the value of p is known (e.g. some range of values), it should be used in the above formula. As explained before, the z-multiplier is determined by the percentage confidence required in the estimation. Example Consider the problem (discussed earlier) of estimating the proportion of potential users who would use a new product. Suppose this proportion is to be estimated with 99% confidence and an error not exceeding 2% (proportion of 0.02) is required. What sample size is needed to achieve this? Solution E = 0.02 (given), z-multiplier = 2.576 (99% confidence required) 2.576 2 ) = 4147.36 = 4148 (rounded up). 0.02 Supppose it is known that the value of p is between 0.8 and 0.9. In such a case n=¼( max p(1 p) pq = 0.8 x 0.2 =0.16 (Why is p = 0.8 used?). 0.8 p 0.9 By using this information the value of n can be calculated as n =0.16 ( 2.576 2 ) = 2654.31 = 2655 (rounded up). 0.02 The additional information on possible values for p reduces the sample size by 36%. 31 7.10 Computer output 1 Confidence interval for the mean ( 2 known). For the data in the example in section 7.4, the information can be typed on an excel sheet and the confidence interval calculated as follows. mean sigma n z multiplier Confidence lower upper 498.5 5 30 1.959964 interval 496.71 500.29 2 Confidence interval for the mean ( 2 not known). For the data in the example in section 7.5, the information can be typed on an excel sheet and the confidence interval calculated as follows. mean stand.dev n t multiplier Confidence lower upper 38.36 5.777642 15 2.144787 interval 35.16 41.56 3 Confidence interval for the variance. For the data in the example in section 7.6, the information can be typed on an excel sheet and the confidence interval calculated as follows. variance n degrees of freedom lower chisq. upper chisq. Confidence lower upper 33.38114 15 14 5.628726 26.11895 interval 17.89 83.03 4 Confidence interval for the proportion of successes. For the data in the example in section 7.7, the information can be typed on an excel sheet and the confidence interval calculated as follows. n x z multiplier st.error Confidence lower upper 200 176 1.644854 0.022978 interval 0.842 0.918 32 Chapter 8 – Statistical Inference: Testing of hypotheses for one sample 8.1 Formulation of hypotheses and related terminology A statistical hypothesis is an assertion (claim) made about a value(s) of a population parameter. The purpose of testing of hypotheses is to determine whether a claim that is made could be true. The conclusion about the truth of such a claim is not stated with absolute certainty, but rather in terms of the language of probability. Examples of claims to be tested 1 A supermarket receives complaints that the mean content of “1 kilogram” sugar bags that are sold by them is less than 1 kilogram. 2 The variability in the drying time of a certain paint (as measured by the variance) has until recently been 65 minutes. It is suspected that the variability has now increased. 3 A construction company suspects that the proportion of jobs they complete behind schedule is 0.20 (20%). They want to test whether this is indeed the case. Null and alternative hypotheses The null hypothesis (H0) is a statement concerning the value of the parameter of interest ( ) in a claim that is made. This is formulated as H0: 0 (The statement that the parameter is equal to the hypothetical value 0 ) . The alternative hypothesis (H1) is a statement about the possible values of the parameter that are believed to be true if H0 is not true. One of the alternative hypotheses shown below will apply. H1a: 0 or H1b: 0 or H1c: 0 . Examples 1 In the first example (above) the parameter of interest is the population mean µ and the hypotheses to be tested are H0: µ = 1 (Population mean is 1 kilogram) versus H1a: µ < 1 (Population mean is less than 1 kilogram) In terms of the general notation stated above =µ and 0 1 . 33 2 In the second example (above) the parameter of interest is the population variance σ2 and the hypotheses to be tested are H0: σ2 = 65 (Population variance is 65) versus H1b: σ2 > 65 (Population variance is greater than 65) In terms of the general notation stated above = σ2 and 0 65. 3 In the third example (above) the parameter of interest is the population proportion, p, of job completions behind schedule and the hypotheses to be tested are H0: p = 0.20 (Population proportion is 0.20) versus H1c: p ≠ 0.20 (Population proportion is not equal to 0.20) In terms of the general notation stated above = p and 0 0.20 . One and two-sided alternatives A one-sided alternative hypothesis is one that specifies the alternative values (to the null hypothesis) in a direction that is either below or above that specified by the null hypothesis. Example The alternative hypothesis H1a (see example 1 above) is the alternative that the value of the parameter is less than that stated under the null hypothesis and the alternative H1b (see example 2 above) is the alternative that the value of the parameter is greater than that stated under the null hypothesis. A two-sided alternative hypothesis is one that specifies the alternative values (to the null hypothesis) in directions that can be either below or above that specified by the null hypothesis. Example The alternative hypothesis H1c (see example 3 above) is the alternative that the value of the parameter is either greater than that stated under the null hypothesis or less than that stated under the null hypothesis. 8.2 Testing of hypotheses for one sample: Terminology and summary of procedure The testing procedure and terminology will be explained for the test for the population mean μ with population variance σ2 known. 34 The hypotheses to be tested are H0 : µ = µ0 versus H1a: µ < µ0 or H1b: µ > µ0 or H1c: µ≠ µ0. The data set that is needed to perform the test is x1, x2, . . . , xn , a random sample of size n drawn from the population for which the mean is tested. The test is performed to see whether or not the sample data are consistent with what is stated by the null hypothesis. The instrument that is used to perform the test is called a test statistic. A test statistic is a quantity calculated from the sample data. When testing for the population mean, the test statistic used is z0 = x 0 / n . If the difference between x and µ0 (and therefore the value of z0) is reasonably small, H0 will be not be rejected. In this case the sample mean is consistent with the value of the population mean that is being tested. If this difference (and therefore the value of z0) is sufficiently large, H0 will be rejected. In this case the sample mean is not consistent with the value of the population mean that is being tested. In order to decide how large this difference between x and μ0 (and therefore the value of z0) should be before H0 is rejected, the following should be considered. Type I error A type I error is committed when the null hypothesis is rejected when, in fact it is true i.e. H0 is wrongly rejected. In this test, a type I error is committed when it is decided that the statement H0: µ = μ0 should be rejected when, in fact, it is true. A type II error is committed when the null hypothesis is not rejected when, in fact, it is false i.e. a decision not to reject H0 is wrong. In this test, a type II error is committed when it is decided that the statement H0: µ = μ0 should not be rejected when, in fact, it is false. The following table gives a summary of possible conclusions and their correctness when performing a test of hypotheses. 35 Actually true/Conclusion Reject H0 Do not reject H0 Type I error Correct conclusion H0 is true Correct conclusion Type II error H0 is false A type I error is often considered to be more serious, and therefore more important to avoid, than a type II error. The hypothesis testing procedure is therefore designed so that there is a guaranteed small probability of rejecting the null hypothesis wrongly. This probability is never 0 (why?). Mathematically the probability of a type I error can be stated as P(type I error) = P(Reject H0 | H0 is true) = α. When testing for the population mean, P(type I error) = P(reject μ = μ0 | μ = μ0 is true) = α and P(type II error) = P(do not reject µ = µ0 | µ = µ0 is false) = β. Probabilities of type I and type II errors work in opposite directions. The more reluctant you are to reject H0, the higher the risk of accepting it when, in fact, it is false. The easier you make it to reject H0, the lower the risk of accepting it when, in fact, it is false Critical value(s) and critical region The critical (cut-off) value (s) for tests of hypotheses is a value(s) with which the test statistic is compared with in order to determine whether or not the null hypothesis should be rejected. The critical value is determined according to the specified value of α, the probability of a type I error. For the test of the population mean the critical value is determined in the following way. Assuming that H0 is true, the test statistic Z0 = X 0 / n ~ N(0, 1). (i) When testing H0 versus the alternative hypothesis H1a (µ < µ0), the critical value is the value Zα which is such that the area under the standard normal curve to the left of Zα is α i.e. P(Z0 < Zα) = α. The graph below illustrates the case α = 0.05 i.e. P(Z0 < -1.645) = 0.05. 36 (ii) When testing H0 versus the alternative hypothesis H1b (µ > µ0) , the critical value is the value Z1-α which is such that the area under the standard normal curve to the right of Z1-α is α i.e. P(Z0 > Z1-α) = α.. The graph below illustrates the case α = 0.05 i.e. P(Z0 > 1.645) = 0.05. (iii) When testing H0 versus the alternative hypothesis H1c (µ ≠ µ0), the critical values are the values Z1-α/2 and Zα/2 which are such that the area under the standard normal curve to the right of Z1-α/2 is α/2 and the area under the standard normal curve to the left of Zα/2 is α/2. i.e. P(Z0 > Z1-α/2) = α/2 and P(Z0 < Zα/2) = α/2. The area under the normal curve between these two critical values is 1-α. The graph below illustrates the case α = 0.05 i.e. P(Z0 <-1.96 or Z0> 1.96) = 0.05. The critical region CR, or rejection region R, is the set of values of the test statistic for which the null hypothesis is rejected. (i) When testing H0 versus the alternative hypothesis H1a , the rejection region is { z0 | z0 < Zα }. 37 (ii) When testing H0 versus the alternative hypothesis H1b , the rejection region is { z0 | z0 > Z1-α }. (iii) When testing H0 versus the alternative hypothesis H1c , the rejection region is { z0 | z0 > Z 1-α/2 or z0 < Zα/2 }. H0 is rejected when there is a sufficiently large difference between the sample mean x and the mean (μ0 ) under H0 . Such a large difference is called a significant difference (result of the test is significant). The value of α is called the level of significance. It specifies the level beyond which this difference (between x and μ0) is sufficiently large for H0 to be rejected. The value of α is specified prior to performing the test and is usually taken as either 0.05 (5% level of significance) or 0.01 (1% level of significance). When H0 is rejected, it does not necessarily mean that it is not true. It means that according to the sample evidence available it appears not to be true. Similarly when H0 is not rejected, it does not necessarily mean that it is true. It means that there is not sufficient sample evidence to disprove H0. Critical values for tests based on the standard normal distribution can be found from the selected percentiles listed at the bottom of the pages of the standard normal table. 8.3 Test for the population mean (population variance known) A summary of the steps to be followed in the testing procedure is shown below. Test for when 2 is known 1 State null and alternative hypotheses. H0: 0 versus H1a: 0 or H1b: > 0 or H1c: 0 2 Calculate the test statistic z 0 x 0 . / n 3 State the level of significance α and determine the critical value(s) and critical region. (i) For alternative H1a the critical region is R = { z0 | z0 < Zα }. (ii) For alternative H1b the critical region is R = { z0 | z0 > Z1-α }. (iii) For alternative H1c the critical region is R = { z0 | z0 > Z1-α/2 or z0 < Zα/2 }. 4 If z0 lies in the critical region, reject H0, otherwise do not reject H0. 5 State conclusion in terms of the original problem. 38 Examples 1 A supermarket receives complaints that the mean content of “1 kilogram” sugar bags that are sold by them is less than 1 kilogram. A random sample of 40 sugar bags is selected from the shelves and the mean found to be 0.987 kilograms. From past experience the standard deviation contents of these bags is known to be 0.025 kilograms. Test, at the 5% level of significance, whether this complaint is justified. H0 : μ = 1 (The complaint is not justified) H1 : μ < 1 (The complaint is justified) n = 40, x = 0.987, σ = 0.025, μ0 = 1 (given) Test statistic: z0 = 0.987 1 0.025 / 40 -3.289. α = 0.05. Critical region R = { z0 < Z0.05 = -1.645 }. Since z0 = -3.289 < -1.645, H0 is rejected. Conclusion: The complaint is justified. 2 A supermarket manager suspects that the machine filling “500 gram” containers of coffee is overfilling them i.e. the actual contents of these containers is more than 500 grams. A random sample of 30 of these containers is selected from the shelves and the mean found to be 501.8 grams. From past experience the variance of contents of these bags is known to be 60 grams. Test at the 5% level of significance whether the manager’s suspicion is justified. Solution H0 : μ = 500 (Suspicion is not justified) H1 : μ > 500 (Suspicion is justified) n = 30, x = 501.8, σ2 = 60, μ0 = 500 (given) Test statistic: z0 = 501.8 500 60 / 30 1.273. α = 0.05. Critical region R = { z0 > Z0.95 = 1.645 }. Since z0 = 1.273 < 1.645, H0 is not rejected. Conclusion: The suspicion is not justified. 3 During a quality control exercise the manager of a factory that fills cans of frozen shrimp wants to check whether the mean weights of the cans conform to specifications i.e. the mean of these cans should be 600 grams as stated on the label of the can. He/she wants to guard 39 against either over or under filling the cans. A random sample of 50 of these cans is selected and the mean found to be 595 grams. From past experience the standard deviation of contents of these bags is known to be 20 grams. Test, at the 5% level of significance, whether the weights conform to specifications. Repeat the test at the 10% level of significance. Solution H0 : μ = 600 (Weights conform to specifications) H1 : μ ≠ 600 (Weights do not conform to specifications) n = 50, x = 595, σ = 20, μ0 = 600 (given) Test statistic: z0 = 595 600 20 / 50 1.768. α = 0.05. Critical region R = { z0 < Z0.025 = -1.96 or z0 > Z0.975 = 1.96 }. Since -1.96 < z0 = 1.768 < 1.96, H0 is not rejected. Conclusion: The weights appear to conform to specifications. Suppose the test is performed at the 10% level of significance. In such a case α = 0.10. Critical region R = { z0 < Z0.25 = -1.645 or z0 > Z0.95 = 1.645 }. Since z0 = 1.768 > 1.645, H0 is rejected. Conclusion: The weights appear not to conform to specifications. Thus, being less strict about controlling a type I error (changing from 0.05 to 0.10) results in a different conclusion about H0 (reject instead of do not reject). Note 1 In example 1 the alternative hypothesis H1a was used, in example 2 the alternative H1b and in example 3 the alternative H1c. 2 Alternatives H1a and H1b [one-sided (tailed) alternatives ] are used when there is a particular direction attached to the range of mean values that could be true if H0 is not true. 3 Alternative H1c [two-sided (tailed) alternative] is used when there is no particular direction attached to the range of mean values that could be true if H0 is not true. 4 If, in the above examples, the level of significance had been changed to 1%, the critical values used would have been Z0.01= -2.326 (in example 1) , Z0.99 = 2.326 (in example 2) and and Z0.005 = -2.576 , Z0.995= 2.576 (in example 3). 40 8.4 Test for the population mean (population variance not known): t-test When performing the test for the population mean for the case where the population variance is not known, the following modifications are made to the procedure. 1 In the test statistic formula the population standard deviation σ is replaced by the sample standard deviation S. 2 Since the test statistic t0 = x 0 that is used to perform the test follows a t-distribution S/ n with n-1 degrees of freedom, critical values are looked up in the t-tables. Test for when 2 is not known (t-test) 1 State null and alternative hypotheses. H0: 0 versus H1a: 0 or H1b: > 0 or H1c: 0 . 2 Calculate the test statistic t 0 x 0 . S/ n 3 State the level of significance α and determine the critical value(s) and critical region. Degrees of freedom = = n-1. (i) For alternative H1a the critical region is R = { t0 | t0 < tα }. (ii) For alternative H1b the critical region is R = { t0 | t0 > t1-α }. (iii) For alternative H1c the critical region is R = { t0 | t0 > t1-α/2 or t0 < tα/2 }. 4 If t0 lies in the critical region, reject H0 , otherwise do not reject H0. 5 State conclusion in terms of the original problem. Examples A paint manufacturer claims that the average drying time for a new paint is 2 hours (120 minutes). The drying times for 20 randomly selected cans of paint were obtained. The results are shown below. 123 127 131 122 109 106 128 133 115 120 139 119 121 116 130 135 130 136 133 109 Assuming that the sample was drawn from a normal distribution, 41 (a) test whether the population mean drying time is greater than 2 hours (120 minutes) (i) at the 5% level of significance. (ii) at the 1% level of significance. (b) test, at the 5% level of significance, whether the population mean drying time could be 2 hours (120 minutes). Solution (a) H0 : μ = 120 (mean is 2 hours) H1 : μ > 120 (mean is greater than 2 hours) n = 20, μ0 = 120 (given), x = 124.1, S = 9.65674 (calculated from the data). Test statistic t0 = (i) If 124.1 120 9.65674 / 20 = 1.899. α = 0.05, 1-α = 0.95. From the t-distribution table with degrees of freedom = = n-1 =19, t0.95 = 1.729. Critical region R = { t0 > t0.95 = 1.729 }. Since 1.899 > 1.729 , H0 is rejected. Conclusion: The mean drying time appears to be greater than 2 hours. (ii) If α = 0.01, 1-α = 0.99. From the t-distribution table with degrees of freedom = = n-1 =19, t0.99 = 2.539. Critical region R = { t0 > t0.95 = 2.539 }. Since 1.899 < 2.539 , H0 is not rejected. Conclusion: The mean drying time appears to be 2 hours. Thus, being more strict about controlling a type I error (changing from 0.05 to 0.01) results in a different conclusion about H0 (Do not reject instead of reject). (b) H0 : μ = 120 (mean is 2 hours) H1 : μ ≠ 120 (mean is not equal to 2 hours) n = 20, μ0 = 120 (given), x = 124.1, S = 9.65674 (calculated from the data). 42 Test statistic: t0 = 124.1 120 9.65674 / 20 = 1.899 (as calculated in part(a)). If α = 0.05, α/2 = 0.025, 1-α/2 = 0.975. From the t-distribution table with degrees of freedom = = n-1 =19, t0.025 = -2.093, t0.975= 2.093. Critical region R = { t0 < -2.093 or t0 > t0.975 = 2.093 }. Since -2.093 <1.899 < 2.093, H0 is not rejected. Conclusion: The mean drying time appears to be 2 hours. Note: Despite the fact that the same data were used in the above examples, the conclusions were different. In the first test H0 was rejected, but in the next 2 tests H0 was not rejected. 1 In the first test the probability of a type I error was set at 5%, while in the second test this was changed to 1%. To achieve this, the critical was moved from 1.729 to 2.539, resulting in the test statistic value (1.899) being less than (in stead of greater than) the critical value. 2 In the third test (which has a two-sided alternative hypothesis), the upper critical value was increased to 2.093 (to have an area of 0.025 under the t-curve to its right). Again this resulted in the test statistic value (1.899) being less than (in stead of greater than) the critical value. 8.5 Test for population variance The test for the population variance is based on 2 (n 1) S 2 2 following a chi-square distribution with n-1 degrees of freedom. The critical values are therefore obtained from the chi-square tables. Test for the population variance σ2 1. State the null and alternative hypotheses. H0: 2 02 versus H1a: 2 02 or H1b: 2 > 02 or H1c: 2 02 2. Calculate the test statistic 02 (n 1) S 2 02 . 3. State the level of significance α and determine the critical value(s) and critical region. Degrees of freedom = = n-1. (i) For alternative H1a the critical region is R = { 02 | 02 < 2 }. (ii) For alternative H1b the critical region is R = { 02 | 02 > 12 }. (iii) For alternative H1c the critical region is R = { 02 | 02 > 12 / 2 or 02 < 2 / 2 }. 4. If 02 lies in the critical region, reject H0 , otherwise do not reject H0. 5. State conclusion in terms of the original problem. 43 For a one-sided test with alternative hypothesis H1b the rejection region (highlighted area) is shown in the graph below. For a two-sided test with alternative hypothesis H1c the rejection region (highlighted area) is shown in the graph below. Example 1 Consider the example on the drying time of the paint discussed in the previous section. Until recently it was believed that the variance in the drying time is 65 minutes. Suppose it is suspected that this variance has increased. Test this assertion at the 5% level of significance. Solution H0 : σ2 = 65 (Variance has not increased) H1 : σ2 > 65 (Variance has increased) n = 20, 02 = 65 (given), S = 9.65674 (calculated from the data). 44 Test statistic: 02 = 19 * 9.65674 2 = 27.258. 65 α = 0.05, 1-α = 0.95. From the chi-square distribution table with degrees of freedom = = n-1 =19, 02.95 = 30.14. Critical region R = { 02 > 02.95 = 30.14 }. Since 27.258 < 30.14, H0 is not rejected. Conclusion: Variance has not increased. 2 A manufacturer of car batteries guarantees that their batteries will last, on average 3 years with a standard deviation of 1 year. Ten of the batteries have lifetimes of 1.2, 2.5, 3, 3.5, 2.8, 4, 4.3, 1.9, 0.7 and 4.3 years. Test at the 5% level of significance whether the variability guarantee is still valid. Solution H0 : σ2 = 1 (Guarantee is valid) H1 : σ2 ≠ 1 (Guarantee is not valid) n = 10, 02 = 1 (given), S = 1.26209702, S2 = 1.592889 (calculated from the data). Test statistic: 02 = 9 *1.592889 = 14.336. 1 α = 0.05, α/2 = 0.025, 1-α/2 = 0.975. From the chi-square distribution table with degrees of freedom = = n-1 =9, 02.025 = 2.70 , 02.975 = 19.02. Critical region R = { 02 < 02.025 = 2.70 or 02 > 02.975 =19.02}. Since 2.70 < 14.336 < 19.02, H0 is not rejected. Conclusion: Variability guarantee appears to still be valid. 45 8.6 Test for population proportion The test for the population proportion (p) is based on the fact that the sample proportion X ~ N(p, pq/n) , where n is the sample size and X the number of items labeled Pˆ n Pˆ p “success” in the sample. From this result it follows that Z = ~ N(0, 1). pq / n For this reason the critical value(s) and critical region are the same as that for the test for the population mean (both based on the standard normal distribution). Test for the population proportion p 1 State the null and alternative hypotheses. H0: p p0 versus H1a: p p0 or H1b: p > p0 or H1c: p p0 pˆ p 2 Calculate the test statistic z 0 = ’ p 0 q0 / n 3 State the level of significance α and determine the critical value(s) and critical region. (i) For alternative H1a the critical region is R = { z0 | z0 < Zα }. (ii) For alternative H1b the critical region is R = { z0 | z0 > Z1-α }. (iii) For alternative H1c the critical region is R = { z0 | z0 > Z1-α/2 or z0 < Zα/2 }. 4 If z0 lies in the critical region, reject H0, otherwise do not reject H0. 5 State conclusion in terms of the original problem. Examples 1 A construction company suspects that the proportion of jobs they complete behind schedule is 0.20 (20%). Of their 80 most recent jobs 22 were completed behind schedule. Test at the 5% level of significance whether this information confirms their suspicion. Solution H0 : p = 0.20 (Suspicion is confirmed) H1 : p ≠ 0.20 (Suspicion is not confirmed) n = 80, x = 22 (given), p̂ = Test statistic z0 = 22 = 0.275, p0 = 0.20. 80 0.275 0.20 0.20 * 0.80 / 80 = 1.677. 46 α = 0.05. Critical region R = { z0 < Z0.025 = -1.96 or z0 > Z0.975 = 1.96 }. Since -1.96 < z0 = 1.677 < 1.96, H0 is not rejected. Conclusion: The suspicion is confirmed. 2 During a marketing campaign for a new product 176 out of the 200 potential users of this product that were contacted indicated that they would use it. Is this evidence that more than 85% of all the potential will actually use the product? Use α = 0.01. Solution H0 : p = 0.85 (85% of all potential users will use the product) H1 : p > 0.85 (More than 85% of all potential users will use the product) n = 200, x = 176, p0 = 0.85 (given), p̂ = Test statistic z0 = 0.88 0.85 0.85 * 0.15 / 200 176 = 0.88. 200 = 1.188. α = 0.01. Critical region R = { z0 > Z0.99 = 2.576 }. Since z0 = 1.188 < 2.576, H0 is not rejected. Conclusion: 85% of all potential users will use the product. 8.7 Computer output 1 The output shown below is when the test for the population mean, for the data in example 1 in section 8.4, is performed by using excel. t-Test: Mean Mean Variance Observations Hypothesized Mean df t Stat P(T<=t) one-tail t Critical one-tail 129.1 93.25263158 20 120 19 1.898752271 0.036445557 1.729132792 The value of the test statistic is t0 = 1.90 (2 decimal places). From the table P(T<=-1.9) = 0.036. This probability is known as the p-value (the probability of getting a t value more remote than the test statistic). When testing at the 5% level of significance, a p-value of below 0.05 will cause the null hypothesis to be rejected. 47 2 The output shown below is when the test for the population variance in example 1 in section 8.5 (the data in example 1 in section 8.4) is performed by using excel. Chi-square test: Variance Variance Observations Hypothesized variance df Chi-square stat P(Chi-square<=27.25846) onetail Chi-square critical one-tail 93.25263 20 65 19 27.25846 0.098775 30.14353 The values of the test statistic and critical value are the same as in the example in section 8.5. The p-value is 0.098775 (2nd to last entry in the 2nd column in the table above). Since 0.098775 >0.05 the null hypothesis cannot be rejected at the 5% level of significance. 48 Chapter 9 – Statistical Inference: Testing of hypotheses for two samples 9.1 Formulation of hypotheses, notation and additional results The tests discussed in the previous chapter involve hypotheses concerning parameters of a single population and were based on a random sample drawn from a single population of interest. Often the interest is in tests concerning parameters of two different populations (labeled populations 1 and 2) where two random samples (one from each population) are drawn. Examples 1 Are the mean salaries the same for males and females with the same educational qualifications and work experience? 2 Do smokers and non-smokers have the same mortality rate? 3 Are the variances in drying times for two different types of paints different? 4 Is a particular diet successful in reducing people’s weights? Null and alternative hypotheses The following hypotheses involving two samples will be tested. 1 The test for equality of two variances. As an example see example 3 above. 2 The test for equality of two means (independent samples). As an example see example 1 above. 3 The test for equality of two means (paired samples). As an example see example 4 above. 4 The test for equality of two proportions. As an example see example 2 above. The parameters to be used, when testing the hypotheses, are summarized in the table below. Parameter population 1 population 2 mean 2 1 2 variance 1 22 proportion p1 p2 The following null and alternative hypotheses (as defined in section 8.1) also apply in the two sample case. H0: 0 (The statement that the parameter is equal to the hypothetical value 0 ) . H1a: 0 or H1b: 0 or H1c: 0 . Examples 1 When testing for equality of variances from 2 different populations labeled 1 and 2 the hypotheses are 49 H0: 12 22 H1a: 12 22 or H1b: 12 22 or H1c: 12 22 . These hypotheses can also be written as 12 H0: 2 1 2 12 12 12 H1a: 2 1 or H1b: 2 1 or H1c: 2 1 . 2 2 2 In terms of the general notation stated above 12 and 0 1 . 22 2 When testing for equality of means from 2 different populations labeled 1 and 2 the hypotheses are H0: 1 2 H1a: 1 2 or H1b: 1 2 or H1c: 1 2 . These hypotheses can also be written as H0: 1 2 0 H1a: 1 2 0 or H1b: 1 2 0 or H1c: 1 2 0 In terms of the general notation stated above 1 2 and 0 0 . 3 When testing for equality of proportions from 2 different populations labeled 1 and 2 the hypotheses are H0: p1 p2 H1a: p1 p2 or H1b: p1 p 2 or H1c: p1 p2 . These hypotheses can also be written as H0: p1 p2 0 H1a: p1 p2 0 or H1b: p1 p2 0 or H1c: p1 p2 0 . In terms of the general notation stated above p1 p2 and 0 0 . 50 Notation The following notation will used in the description of the two sample tests. Measure sample size sample notation (population 1) notation (population 2) sample mean sample variance (standard deviation) sample proportion n m x1 , x2 ,, xn x 1 , x 2 , , x m x1 S12 ( S1 ) x n* p̂1 = n x2 S 22 ( S 2 ) x m* p̂ 2 = m x n* and x m* are the numbers of “success” items in the samples from populations 1 and 2 respectively. Standard error and variance formulae When testing hypotheses for the difference between two means ( 1 2 ) or the difference between two proportions ( p1 p2 ), formulae for the standard errors of the corresponding sample differences ( X X when testing for the mean, Pˆ Pˆ when testing for proportions) 1 2 1 2 will be needed. These formulae are summarized in the table that follows. Sample difference (ˆ) condition X1 X 2 population variances not equal X1 X 2 Pˆ 1 Pˆ2 Pˆ 1 Pˆ2 standard error [ SE (ˆ)] 12 22 )1 / 2 n m population variances equal 1 1 ( )1 / 2 2 2 2 i.e. 1 2 n m population proportions not equal p (1 p1 ) p2 (1 p2 ) 1 / 2 [ 1 ] n m population proportions equal 1 1 [ p(1 p)( )]1 / 2 i.e. p1 p2 p n m ( In each of the formulae the variance is the square of the standard error. The general form of the test statistic used in most of these tests is Z ˆ 0 , where SE (ˆ) Z follows a N(0,1) distribution . In some small sample cases the test statistic has a general form t ˆ 0 , where t follows a t-distribution. SˆE (ˆ) Two sample sampling distribution results 1 For sufficiently large random samples (both n, m 30) drawn from populations (with known variances) that are not too different from a normal population, the statistic 51 Z X 1 X 2 ( 1 2 ) ( 12 n 22 m ) follows a N(0,1) distribution. 1/ 2 1 1 When 12 22 2 the above mentioned result still holds, but with ( )1 / 2 in the n m denominator. 2 3 When the population variances 12 , 22 and 2 , referred to in the two above mentioned results, are not known they may be replaced by their sample estimates S12 , S 22 and (n 1) S12 (m 1) S 22 respectively. In such a case the resulting statistic follows S nm2 2 (i) a N(0,1) distribution when the sample sizes are large (both n and m > 30). (ii) a t-distribution when the sample sizes are small (at least one of n or m 30). The degrees of freedom will depend on whether 12 22 2 is true or not. If 12 22 2 is true, the degrees of freedom is n m 2. 4 For sufficiently large random samples the statistic Z Pˆ1 Pˆ2 ( p1 p 2 ) p (1 p1 ) p 2 (1 p 2 ) 1 / 2 [ 1 ] n m follows a N(0,1) distribution. 1 1 5 When p1 p2 p the abovementioned result still holds but with [ p(1 p)( )]1 / 2 in n m the denominator. 6 Provided the sample sizes are sufficiently large, the two above mentioned results will still x n* x m* be valid with p1 , p2 and p in the denominator replaced by p̂1 = , p̂ 2 and n m x * x m* respectively. pˆ n nm 52 12 9.2 Test for equality of population variances (F-test) and confidence interval for 2 2 A summary of the steps to be followed in the testing procedure is shown below. Test for 12 22 Step 1: State null and alternative hypotheses H0: 12 22 versus H1a: 12 22 or H1b: 12 22 or H1c: 12 22 max( S12 , S 22 ) min( S12 , S 22 ) Step 3: State the level of significance and determine the critical value(s) and critical region. Step 2: Calculate the test statistic F0 Degrees of freedom is df 1 = sample size (numerator sample variance)-1 and df 2 = sample size (denominator sample variance) -1 (i) For alternatives H1a and H1b the critical region is R = { F0 | F0 F 1 } . (ii) For alternatives H1c the critical region is R = { F0 | F0 F 1 / 2} . Step 4: If F0 lies in the critical region, reject H0, otherwise do not reject H0. Step 5: State the conclusion in terms of the original problem Confidence interval for 12 22 Step 1: Calculate S12 and S 22 . Values of n, m and confidence percentage are given. Step 2: Determine the upper and lower F - distribution values for a given confidence percentage, df 1 and df 2 . Step 3: Confidence interval is ( S12 S12 * lower, * upper) S 22 S 22 Examples 1 The following sample information about the daily travel expenses of the sales (population 1) and audit (population 2) staff at a certain company was collected. sales 1048 1080 1168 1320 1088 1136 audit 1040 816 1032 1142 1192 960 1112 (a) Test at the 10% level of significance whether the population variances could be the same. 12 (b) Calculate a 95% confidence interval for 2 . 2 53 (a) H0: 12 22 H1: 12 22 From the above information n 6, m 7 , S12 9593.6 and S 22 15884 . Test statistic: F0 max( 9593.6,15884) 15884 1.656 min( 9593.6,15884) 9593.6 df1 7 1 6 , df 2 6 1 5 , 0.10 , / 2 0.05. For df1 6, df 2 5 F0.95 4.95. Critical region R = { F0 4.95 } Since F0 1.656 < 4.95, H0 is not rejected. Conclusion: The population variances could be the same. (b) P( F0.025 S 22 / 22 S12 12 S12 or P( F ) 0 . 95 F F0.975 ) 0.95 0.975 0.025 S12 / 12 S 22 22 S 22 In the above expression S 22` is in the numerator and S12 in the denominator. Hence 1 with df1 6, df 2 5 and upper = F0.975 6.98. lower = F0.025 is found from F0.975 1 = 0.1669. df1 5, df 2 6 i.e. lower = 5.99 S12 0.604 , F0.025 0.1669 and F0.975 = 6.98 into the above gives a confidence S 22 interval of (0.604*0.1669, 0.604*6.98) = (0.101, 4.216). Substituting 2 The waiting times (minutes) for minor treatments were recorded at two different medical centres. Below is a summary of the calculations made from the samples. centre sample size mean variance 1 12 25.69 7.200 2 10 27.66 22.017 Test at the 5% level of significance whether the centre 1 population variance is less than that for population 2. H0: 12 22 H1: 12 22 54 From the above table n 12, m 10 , S12 7.200 and S 22 22.017 . F0 max( 22.017,7.200) 22.017 3.058 . min( 22.017,7.200) 7.200 df1 10 1 9 , df 2 12 1 11, 0.05 For df1 9, df 2 11 F0.95 2.90. Critical region R = { F0 2.90 } Since F0 3.058 > 2.90, H0 is rejected. Conclusion: The variance for population 1 is probably less than that for population 2. 9.3 Test for difference between means for independent samples (i) For large samples (both sample sizes n, m 30) Test for 1 2 0 (large samples, population variances known) Step 1: State null and alternative hypotheses H0: 1 2 0 H1a: 1 2 0 or H1b: 1 2 0 or H1c: 1 2 0 x1 x 2 Step 2: Calculate the test statistic z 0 . 2 1 22 1 / 2 ( ) n m Step 3: State the level of significance and determine the critical value(s) and critical region. (i) For alternative H1a the critical region is R = { z0 | z0 < Zα }. (ii) For alternative H1b the critical region is R = { z0 | z0 > Z1-α }. (iii) For alternative H1c the critical region is R = { z0 | z0 > Z1-α/2 or z0 < Zα/2 }. Step 4: If z0 lies in the critical region, reject H0, otherwise do not reject H0. Step 5: State the conclusion in terms of the original problem. A 100(1- ) % confidence interval for 1 2 is given by x1 x2 Z1 / 2 ( 12 n 22 m )1 / 2 . 55 If the population variances 12 and 22 are not known, they can be replaced in the above formulae by their sample estimates S12 and S 22 respectively with the testing procedure unchanged. Examples: 1 Data were collected on the length of short term stay of patients at hospitals. Independent random samples of n 40 male patients (population 1) and m 35 female patients (population 2) were selected. The sample mean stays for male and female patients were x1 = 9 days and x 2 = 7.2 days respectively. The population variances are known from past experience to be 12 = 55 and 22 = 47. (a) Test at the 5% level of significance whether male patients stay longer on average than female patients. (b) Calculate a 95% confidence interval for the mean difference (in staying time) between males and females. (a) H0: 1 2 0 (mean staying times for males and females the same) H1: 1 2 0 (mean staying time for males greater than for females) x1 x 2 Test statistic: z 0 ( 2 1 n 2 2 1/ 2 m ) = 97 2 1.213 = 55 47 1 / 2 1.6486 ( ) 40 35 0.05 . Critical region R = { z 0 > Z 0.95 = 1.645 }. Since z 0 = 1.213 < 1.645, H0 cannot be rejected. Conclusion: The mean staying times for males and females are probably the same. (b) x1 x2 = 2, ( 12 22 )1 / 2 = 1.6486 (denominator value when calculating the test n m statistic), 1 0.95 , 0.05 , / 2 = 0.025, Z1 / 2 = Z 0.975 = 1.96. x1 x2 Z1 / 2 ( x1 x2 Z1 / 2 ( 12 n 12 n 22 22 m m )1 / 2 = 2 -1.96*1.6486 = -1.231 )1 / 2 = 2 + 1.96*1.6486 = 5.231 2 Researchers in obesity want to test the effectiveness of dieting with exercise against dieting without exercise. Seventy three patients who were on the same diet were randomly divided into “exercise” ( n =37 patients) and “no exercise” groups ( m =36 patients). The 56 results of the weight losses (in kilograms) of the patients after 2 months are summarized in the table below. Diet with exercise group Diet without exercise group x1 7.6 x2 6.7 S12 2.53 S 22 5.59 Test at the 5% level of significance whether there is a difference in weight loss between the 2 groups. H0: 1 2 0 (No difference in weight loss) H1: 1 2 0 (There is a difference in weight loss) Test statistic: z 0 x1 x2 7.6 6.7 0 .9 = = 1.903. 2 0.473 S S 2 1 / 2 ( 2.53 5.59 )1 / 2 ( ) 37 36 n m 2 1 0.05 . Critical region R = { z 0 < Z 0.025 = -1.96 or z 0 Z 0.975 1.96 }. Since -1.96 < z 0 =1.903 < 1.96, H0 cannot be rejected. Conclusion: There is not sufficient evidence to suggest a difference in weight loss between the 2 groups. (ii) For small samples (at least one of n or m 30 ) from normal populations with variances unknown The test to be performed in this case will be preceded by a test for equality of population variances ( 12 22 = 2 ) i.e. the F-test discussed in section 9.2. If the hypothesis of equal variances cannot be rejected, the test described below should be performed. If this hypothesis is rejected, the Welsh-Aspin test (not to be discussed here) should be performed. If, in this case, the assumption of samples from normal populations does not hold, a nonparametric test like the Mann-Whitney test (not to be discussed here) should be used. 57 Test for 1 2 0 (small sample sizes, population variances unknown but equal) Step 1: State null and alternative hypotheses H0: 1 2 0 H1a: 1 2 0 or H1b: 1 2 0 or H1c: 1 2 0 x1 x 2 (n 1) S12 (m 1) S 22 2 Step 2: Calculate the test statistic t 0 with S 1 1 nm2 S ( )1 / 2 n m Step 3: State the level of significance and determine the critical value(s) and critical region. Degrees of freedom = = n m 2 . (i) For alternative H1a the critical region is R = { t0 | t0 < tα }. (ii) For alternative H1b the critical region is R = { t0 | t0 > t1-α }. (iii) For alternative H1c the critical region is R = { t0 | t0 > t1-α/2 or t0 < tα/2 }. Step 4: If t0 lies in the critical region, reject H0, otherwise do not reject H0. Step 5: State the conclusion in terms of the original problem. 1 1 A 100(1- ) % confidence interval for 1 2 is given by x1 x 2 t n m 2,1 / 2 S ( )1 / 2 . n m Examples 1 Consider the above example on the comparison of the travel expenses for the sale and audit staff (see section 9.2, example 1 for F-test). (a) Test, at the 5% level of significance, whether the mean expenses for the two types of staff could be the same. (b) Calculate a 95% confidence interval for the mean difference between the mean expenses for the two types of staff. (a) Since the hypothesis of equal population variances was not rejected, the test described above can be performed. From the data given x1 =1140, x 2 = 1042, S12 = 9593.6 and S 22 = 15884. H0: 1 2 0 (Mean travel expenses for sale and audit staff the same) H1: 1 2 0 (Mean travel expenses for sale and audit staff not the same) 0.05, / 2 0.025,1 / 2 0.975 . From the t-distribution with n m 2 = 6+72=11 degrees of freedom, t 0.975 = 2.201. 58 S2 (n 1) S12 (m 1) S 22 5 * 9593.6 6 *15884 = 13024.727, S = 114.126 11 nm2 Test statistic = 1140 1042 = 1.543. 1 1 1/ 2 114.126( ) 6 7 Critical region = R = { t 0 t 0.975 2.201 }. Since 1.543 < t 0.975 2.201 , H0 cannot be rejected. Conclusion: Mean travel expenses for sale and audit staff are probably the same. (b) A 95% confidence interval for the difference between sales and audit staff means is 1 1 1140-1042 2.201*114.126*( )1 / 2 i.e (-41.75, 237.75). 6 7 2 A certain hospital has been getting complaints that the response to calls from senior citizens is slower (takes longer time on average) than that to calls from other patients. In order to test this claim, a pilot study was carried out. The results are shown below. Patient type sample mean response time sample standard deviation sample size Senior citizens 5.60 minutes 0.25 minutes 18 Others 5.30 minutes 0.21 minutes 13 Test, at the 1% level of significance, whether the complaint is justified. Label the “senior citizens” and “others” populations as 1 and 2 and their population mean response times as 1 and 2 respectively. H0: 1 2 0 (Mean response times the same) H1: 1 2 0 (Mean response time for senior citizens longer than for others) The hypothesis that the population variances are equal cannot be rejected (perform the F-test to check this). Hence equal variances for the 2 populations can be assumed. S2 = 17 * 0.25 2 12 * 0.212 = 0.0549 29 Test statistic: t 0 5.6 5.3 = 3.518 1 1 1/ 2 0.2343( ) 18 13 59 0.01,1 0.99. From the t-distribution table with n m 2 18 13 2 29 degrees of freedom t 0.99 2.462 . Critical region = R = { t 0 t 0.99 2.462 }. Since t 0 3.518 > 2.462, H0 is rejected. Conclusion: The claim is justified i.e. the mean response time for senior citizens takes longer than that for others. 9.4 Test for difference between means for paired (matched) samples The tests for the difference between means in the previous section assumed independent samples. In certain situations this assumption is not met. Examples 1 A group of patients going on a diet is weighed before going on the diet and again after having been on the diet for one month. A test to determine whether the diet has reduced their weight is to be performed. 2 The aptitudes of boys and girls for mathematics are to be compared. In order to eliminate the effect of social factors, pairs of brothers and sisters are used in the comparison. Each (brother, sister) pair is given the same test and the mean marks of boys and girls compared. In each of these situations the two samples cannot be regarded as independent. In the first example two readings (before and after readings) are made on the same subject. In the second example the two samples are matched via a common factor (family connection). The data layout for the experiments described above is shown below. sample 1 sample 2 difference x1 y1 d 1 x1 y1 ..... x2 ..... y2 d 2 x2 y2 ..... xn yn d n xn y n The mean of the paired differences of the ( x, y ) values of the two populations is defined as d 1 2 . Under the assumption that the differences are sampled from a normal population, hypotheses concerning the mean of the differences d can be tested by performing a one sample t -test (described in the previous chapter) with the observed differences d1 , d 2 ,, d n as the sample. The mean and standard deviation of these sample differences will be denoted by d and S d respectively. 60 Test for d 0 ( paired samples) Step 1: State null and alternative hypotheses H0: d 0 H1a: d 0 or H1b: d 0 or H1c: d 0 Step 2: Calculate the test statistic t 0 d . Sd n Step 3: State the level of significance and determine the critical value(s) and critical region. Degrees of freedom = = n 1 . (i) For alternative H1a the critical region is R = { t0 | t0 < tα }. (ii) For alternative H1b the critical region is R = { t0 | t0 > t1-α }. (iii) For alternative H1c the critical region is R = { t0 | t0 > t1-α/2 or t0 < tα/2 }. Step 4: If t0 lies in the critical region, reject H0, otherwise do not reject H0. Step 5: State the conclusion in terms of the original problem. A 100(1- ) % confidence interval for d is given by d t -multiplier* Sd , where the t n multiplier is obtained from the t-tables with n 1 degrees of freedom with an area 1 / 2 under the t-curve below it. Examples 1 A bank is considering loan applications for buying each of 10 homes. Two different companies (company 1 and company 2) are asked to do an evaluation of each of these 10 homes. The evaluations (thousands of Rand) for these homes are shown in the table below. Home company 1 company 2 difference 1 750 810 60 2 990 1000 10 3 1025 1020 -5 4 1285 1320 35 5 1300 1290 -10 6 875 915 40 7 1240 1250 10 8 880 910 30 9 700 650 -50 10 1315 1290 -25 (a) At the 5% level of significance, is there a difference in the mean evaluations for the 2 companies? (b) Calculate a 95% confidence interval for the difference between the mean evaluations for companies 1 and 2. 61 (a) H0: d 0 (No difference in mean evaluations) H1: d 0 (There is a difference in mean evaluations) From the above table d = 9.5, S d = 33.12015, n =10. Test statistic: t 0 9 .5 = 0.907. 33.12015 10 0.05, / 2 0.025,1 / 2 0.975 . From the t-tables with n - 1 = 9 degrees of freedom, t 0.975 = 2.262. Critical region R = { t 0 2.262 }. Since t 0 = 0.907 < 2.262, H0 is not rejected. (b) A 95% confidence interval is given by 9.5 2.262 33.12015 10 = (-14.19, 33.19). 2 Each of 15 people going on a diet was weighed before going on the diet and again one after having been on the diet for one month. The weights (in kilograms) are shown in the table below. Person before after difference 1 90 85 -5 2 110 105 -5 3 124 126 2 4 116 118 2 5 105 94 -11 6 88 84 -4 7 86 87 1 8 92 87 -5 9 101 99 -2 10 112 105 -7 11 138 130 -8 12 96 93 -3 13 102 95 -7 14 111 102 -9 15 82 83 1 Test, at the 1% level of significance, whether the mean weight after one month on the diet is less than that before going on the diet. Let d denote the mean difference between the weight after having been on the diet for one month and before going on the diet. H0: d 0 (No difference in mean weights) H1: d 0 (Mean weight after one month on diet less than before going on diet) From the above table d = -4, S d = 4.1231, n =15. Test statistic: t 0 4 = -3.757. 4.1231 15 0.01. From the t-tables with n - 1 = 14 degrees of freedom, t 0.01 t 0.99 = -2.624. 62 Critical region R = { t 0 2.624 }. Since t 0 = -3.757 < -2.624, H0 is rejected. Conclusion: The mean weight after one month on the diet is less than before going on diet. 9.5 Test for the difference between proportions for independent samples When testing for the difference between the proportions of two different populations, the test is based on the sampling distribution results 4-6 described in the first section of this chapter. Test for p1 p2 0 Step 1: State null and alternative hypotheses H0: p1 p2 0 H1a: p1 p2 0 or H1b: p1 p 2 0 or H1c: p1 p2 0 x n* x m* pˆ 1 pˆ 2 Step 2: Calculate the test statistic z 0 with pˆ . 1 1 1/ 2 nm ˆ ˆ [ p(1 p)( )] n m Step 3: State the level of significance and determine the critical value(s) and critical region. (i) For alternative H1a the critical region is R = { z0 | z0 < Zα }. (ii) For alternative H1b the critical region is R = { z0 | z0 > Z1-α }. (iii) For alternative H1c the critical region is R = { z0 | z0 > Z1-α/2 or z0 < Zα/2 }. Step 4: If z0 lies in the critical region, reject H0, otherwise do not reject H0. Step 5: State the conclusion in terms of the original problem. A 100(1- ) % confidence interval for p1 p2 is given by 1 1 pˆ 1 pˆ 2 Z1 / 2 [ pˆ (1 pˆ )( )]1 / 2 . n m Example A perfume company is planning to market a new fragrance. In order to test the popularity of the fragrance, 120 young women and 150 older women were selected at random and asked whether they liked the new fragrance. The results of the survey are shown below. women like did not like sample size young 48 72 120 older 72 78 150 63 (a) Test, at the 5% level of significance, whether older women like the new fragrance more than young women. (b) Calculate a 95% confidence interval for the difference between the proportions of older and young women who like the fragrance. (a) Let the older and younger women populations be labeled 1 and 2 respectively and p1 and p 2 the respective population proportions that like the fragrance. H0: p1 p2 0 H1: p1 p 2 0 From the above table n = 150, m = 120, x n* = 72, x m* = 48. p̂ = 72 48 4 150 120 9 72 48 0.08 150 120 Test statistic: z 0 = =1.3145. 4 5 1 1 1/ 2 0.060858 [( ) ( ) ( )] 9 9 150 120 0.05 . Critical region R = { z 0 > Z 0.95 = 1.645 }. Since z 0 = 1.3145 < 1.645, H0 cannot be rejected. Conclusion: There is not sufficient evidence to suggest that older women like the new fragrance more than young women. 1 1 (b) [ pˆ (1 pˆ )( )]1 / 2 = 0.06858 [denominator of z 0 in part (a)]. n m pˆ 1 pˆ 2 = 0.08 [numerator of z 0 in part (a)], Z 0.975 =1.96 1 1 pˆ 1 pˆ 2 Z1 / 2 [ pˆ (1 pˆ )( )]1 / 2 = 0.08 1.96*0.06858 = (-0.039, 0.199). n m 9.6 Computer output 1 The test for the difference between population means in example 1 in section 9.3(ii) (the data in example 1 in section 9.2) can be performed by using excel. What follows is the output. 64 t-Test: Two-Sample Assuming Equal Variances Variable Variable 1 2 Mean 1140 1042 Variance 9593.6 15884 Observations 6 7 Pooled Variance 13024.73 Hypothesized Mean Difference 0 df 11 t Stat 1.543458 P(T<=t) two-tail 0.150984 t Critical two-tail 2.200985 The p-value is 0.150984 > 0.05. At the 5% level of significance the null hypothesis cannot be rejected. 2 The output shown below is when the test for equality of population variances for the data in example 1 in section 9.2 is performed by using excel. F-Test Two-Sample for Variances Variable 1 Variable 2 Mean 1140 1042 Variance 9593.6 15884 Observations 6 7 df 5 6 f 0.603979 P(F<=f) 0.701718 F Critical two-tail 0.143266 s12 9593.6 0.603979 . The s 22 15884 critical value (last entry under variable 1 in the above table) is The value of the test statistic shown in the above table is 1 0.143266 and the p-value (second to last entry under variable 1 in F6,5;0.975 6.98 the above table) is 0.701718. Since 0.701718 > 0.025, the null hypothesis cannot be rejected. F 5,6:0.025 1 65 Chapter 10 – Linear Correlation and regression 10.1 Bivariate data and scatter diagrams Often two variables are measured simultaneously and relationships between these variables explored. Data sets involving two variables are known as bivariate data sets. The first step in the exploration of bivariate data is to plot the variables on a graph. From such a graph, which is known as a scatter diagram (scatter plot, scatter graph), an idea can be formed about the nature of the relationship. Examples 1 The number of copies sold (y) of a new book is dependent on the advertising budget (x) the publisher commits in a pre-publication campaign. The values of x and y for 12 recently published books are shown below. x (thousands of rands) 8 y (thousands) 12.5 9.5 7.2 6.5 10 12 11.5 14.8 17.3 27 30 25 18.6 25.3 24.8 35.7 45.4 44.4 45.8 65.3 75.7 72.3 79.2 Scatter diagram Adve rting budge t and copie s s old 90 80 copies sold 70 60 50 40 30 20 10 0 0 5 10 15 20 adve rtis ing budge t 25 30 35 66 2 In a study of the relationship between the amount of daily rainfall (x) and the quantity of air pollution removed (y), the following data were collected. Rainfall (centimeters) quantity removed (micrograms per cubic meter) 4.3 4.5 5.9 5.6 6.1 5.2 3.8 2.1 7.5 126 121 116 118 114 118 132 141 108 Scatter diagram Rainfall and quantity removed 160 Quantity removed 140 120 100 80 60 40 20 0 0 2 4 6 8 Rainfall 1 In both cases the relationship can be fairly well described by means of a straight line i.e. both these relationships are linear relationships. 2 In the first example an increase in y is proportional to an increase in x (positive linear relationship). 3 In the second example a decrease in y is proportional to an increase in x (negative linear relationship). 4 In both the examples changes in the values of y are affected by changes in the values of x (not the other way round). The variable x is known as the explanatory (independent) variable and the variable y the response (dependent) variable. In this section only linear relationships between 2 variables will be explored. The issues to be explored are 67 1 Measuring the strength of the linear relationship between the 2 variables (the linear correlation problem). 2 Finding the equation of the straight line that will best describe the relationship between the 2 variables (the linear regression problem). Once this line is determined, it can be used to estimate a value of y for given value of x (linear estimation). 10.2 Linear Correlation The calculation of the coefficient of correlation (r) is based on the closeness of the plotted points (in the scatter diagram) to the line fitted through them. It can be shown that -1 ≤ r ≤ 1. If the plotted points are closely clustered around this line, r will lie close to either 1 or -1 (depending on whether the linear relationship is positive or negative). The further the plotted points are away from the line, the closer the value of r will be to 0. Consider the scatter diagrams below. Strong positive correlation (r close to 1) Strong negative correlation (r close -1) No pattern (r close to 0) 68 For a sample of n pairs of values (x1, y1) , (x2, y2), . . . , (xn, yn) , the coefficient of correlation can be calculated from the formula r= n xy x y [n x 2 ( x) 2 ][ n y 2 ( y ) 2 ] . Example Consider the data on the advertising budget (x) and the number of copies sold (y) considered earlier. For this data r can be calculated in the following way. x sum y 8 9.5 7.2 6.5 10 12 11.5 14.8 17.3 27 30 25 178.8 12.5 18.6 25.3 24.8 35.7 45.4 44.4 45.8 65.3 75.7 72.3 79.2 545 xy x2 100 176.7 182.16 161.2 357 544.8 510.6 677.84 1129.69 2043.9 2169 1980 10032.89 64 90.25 51.84 42.25 100 144 132.25 219.04 299.29 729 900 625 3396.92 y2 156.25 345.96 640.09 615.04 1274.49 2061.16 1971.36 2097.64 4264.09 5730.49 5227.29 6272.64 30656.5 Substituting n=12, ∑ x = 178.8, ∑ y = 545, ∑ xy = 10032.89, ∑ x2 = 3396.92 and ∑ y2 = 30656.5 into the equation for r gives r= 12 * 10032.89 178.8 * 545 [12 * 3396.92 (178.8) 2 [12 * 30656.5 (545) 2 = 229486.8 8793.6 * 70853 = 0.9194. Comment: Strong positive correlation i.e. the increase in the number of copies sold is closely linked with an increase in advertising budget. 69 Coefficient of determination The strength of the correlation between 2 variables is proportional to the square of the correlation coefficient (r2). This quantity, called the coefficient of determination, is the proportion of variability in the y variable that is accounted for by its linear relationship with the x variable. Example In the above example on copies sold (y) and advertising budget (x), the coefficient of determination = r2 = 0.91942 = 0.8453. This means that 84.53% of the change in the variability of copies sold is explained by its relationship with advertising budget. 10.3 Linear Regression Finding the equation of the line that best fits the (x, y) points is based on the least squares principle. This principle can best be explained by considering the scatter diagram below. The scatter diagram is a plot of the DBH (diameter at breast height) versus the age for 12 oak trees. The data are shown in the table below. Age x DBH y (years) (inch) 97 93 88 81 75 57 52 45 28 15 12 11 12.5 12.5 8 9.5 16.5 11 10.5 9 6 1.5 1 1 According to the least squares principle, the line that “best” fits the plotted points is the one that minimizes the sum of the squares of the vertical deviations (see vertical lines in the 70 above graph) between the plotted y and estimated y (values on the line). For this reason the line fitted according to this principle is called the least squares line. Calculation of least squares linear regression line The equation for the line to be fitted to the (x, y) points is ŷ = a + bx, where ŷ is the fitted y value (y value on the line which is different to the observed y value), a is the y-intercept and b the slope of the line. It can be shown that the coefficients that define the least squares line can be calculated from b= n xy x y and n x 2 ( x) 2 a = y bx . Example For the above data on age (x) and DBH (y) the least squares line can calculated as shown below. x sum y x2 xy 97 93 88 81 75 57 52 45 28 15 12 11 12.5 12.5 8 9.5 16.5 11 10.5 9 6 1.5 1 1 1212.5 9409 1162.5 8649 704 7744 769.5 6561 1237.5 5625 627 3249 546 2704 405 2025 168 784 22.5 225 12 144 11 121 654 99 6877.5 47240 Substituting n=12, ∑ x = 654, ∑ y = 99, ∑ xy = 6877.5 and ∑ x2 = 47240 into the above equation gives. b= 12 * 6877.5 654 * 99 17784 0.12779 and 139164 12 * 47240 (654) 2 71 a= 99 654 = 1.285. 0.12779 * 12 12 Therefore the equation of the y on x least squares line that can be used to estimate values of y (DBH) based on x (age) is ŷ = 1.285 + 0.12779 x. Suppose the DBH of a tree aged 90 years is to be estimated. This can be done by substituting the value of x = 90 into the above equation. Then ŷ = 1.285 + 0.12779*90 = 12.786. A word of caution 1 The linear relationship between y and x is often only valid for values of x within a certain range e.g. when estimating the DBH using age as explanatory variable, it should be taken into account that at some age the tree will stop growing. Assuming a linear relationship between age and DBH for values beyond the age where the tree stops growing would be incorrect. 2 Only relationships between variables that could be related in a practical sense are explored e.g. it would be pointless to explore the relationship between the number of vehicles in New York and the number of divorces in South Africa. Even if data collected on such variables might suggest a relationship, it cannot be of any practical value. 3 If variables are not linearly related, it does not mean that they are not related. There are many situations where the relationships between variables are non-linear. Example A plot of the banana consumption (y) versus the price (x) is shown in the graph below. A straight line will not describe this relationship very well, but the non-linear curve shown below will describe it well. NONLINEAR REGRESSION: EXAMPLE 14 y 12 10 8 y 6 x u z u 4 2 0 0 1 2 3 4 5 6 7 8 9 10 11 12 x This sequence shows how a nonlinear regression model may be fitted. It uses the banana consumption example in the first sequence. 1 72 10.4 Computer output Consider the data on age (x variable) and DBH (y variable). The output when performing a straight line regression on this data on excel is shown below. SUMMARY OUTPUT Regression Statistics R Square 0.689307572 ANOVA df 1 10 11 SS 189.3872553 85.36274468 274.75 Coefficients 1.285353971 0.12779167 Standard Error 1.702259153 0.027130722 Regression Residual Total Intercept X Variable MS 189.3873 8.536274 F 22.1862 t Stat 0.755087 4.71022 P-value 0.46761 0.00083 Significance F 0.000828626 1 The coefficient of determination in the above table is R square = 0.689307572. 2 The ANOVA (Analysis of Variance) table is constructed to test whether there is a significant linear relationship between X and Y. The p-value for this test is the entry under the Significance F heading in the ANOVA table. Since this p-value < 0.05 (or 0.01), the hypothesis of “no linear relationship between X and Y” can be rejected and it can be concluded that there is a significant linear relationship between X and Y. 3 The third of the tables in the summary output shows the intercept and slope values of the line. These are the first two entries under Coefficients. The remaining columns to the right of the Coefficients column concerns the performance of tests for zero intercept and slope. From the intercept and slope p-values (0.46761 and 0.00083 respectively) it can be seen that the intercept is not significantly different from zero at the 5% level of significance (0.46761>0.05) but that the slope is significantly different from zero at the 5% or 1% levels of significance (0.00083 < 0.01 < 0.05). When the correlation coefficient is calculated for the above mentioned data by using excel, the output is as shown below. Column 1 Column 1 Column 2 Column 2 1 0.83025 1 The above table shows that the correlation between x and y is 0.83025. 73 TUTORIAL QUESTIONS CHAPTERS 5 TO 10